cs.CL - 2023-07-22

Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery

  • paper_url: http://arxiv.org/abs/2307.12045
  • repo_url: https://github.com/longbai1006/cs-vqla
  • paper_authors: Long Bai, Mobarakol Islam, Hongliang Ren
    for:这篇论文旨在探讨如何使用非例子同时学习(Continual Learning,CL)方法,以提高手术教育中的知识助手系统(VQLA)的能力。methods:这篇论文提出了一种非例子CL框架,以解决深度神经网络(DNNs)在学习新知识时的忘却问题。具体来说,当DNNs学习新的类或任务时,其对于老任务的性能会下降很快。此外,由于医疗数据隐私和许可问题,通常无法访问老数据来更新CL模型。因此,该论文提出了一种具有刚性和柔性特性的CL框架,以平衡DNNs在顺序学习中的刚性和柔性。results:通过对三个公共的手术数据集进行大量的实验,该论文证明了其提出的方法可以在手术VQLA中超越传统CL方法。具体来说,该方法可以保持老任务的性能,同时学习新任务。此外,该方法还可以调整权重对于老和新任务,以适应不同的学习情况。
    Abstract The visual-question localized-answering (VQLA) system can serve as a knowledgeable assistant in surgical education. Except for providing text-based answers, the VQLA system can highlight the interested region for better surgical scene understanding. However, deep neural networks (DNNs) suffer from catastrophic forgetting when learning new knowledge. Specifically, when DNNs learn on incremental classes or tasks, their performance on old tasks drops dramatically. Furthermore, due to medical data privacy and licensing issues, it is often difficult to access old data when updating continual learning (CL) models. Therefore, we develop a non-exemplar continual surgical VQLA framework, to explore and balance the rigidity-plasticity trade-off of DNNs in a sequential learning paradigm. We revisit the distillation loss in CL tasks, and propose rigidity-plasticity-aware distillation (RP-Dist) and self-calibrated heterogeneous distillation (SH-Dist) to preserve the old knowledge. The weight aligning (WA) technique is also integrated to adjust the weight bias between old and new tasks. We further establish a CL framework on three public surgical datasets in the context of surgical settings that consist of overlapping classes between old and new surgical VQLA tasks. With extensive experiments, we demonstrate that our proposed method excellently reconciles learning and forgetting on the continual surgical VQLA over conventional CL methods. Our code is publicly accessible.
    摘要 Visual-问题本地回答(VQLA)系统可以作为医学教育中的知识助手。除了提供文本回答外,VQLA系统还可以高亮 interessested 区域,以便更好地理解外科场景。然而,深度神经网络(DNNs)在学习新知识时会出现慢性学习问题。具体来说,当 DNNs 学习增量类或任务时,其对于古老任务的性能会下降很快。此外,由于医学数据隐私和许可问题,通常不可以访问老数据,因此在更新 continual learning(CL)模型时困难。因此,我们开发了一个非例外 continual surgical VQLA框架,以探索和衡量 DNNs 在顺序学习中的僵化-柔软性质的权衡。我们重新评估了 CL 任务中的滥览损失,并提出了固有-柔软性感知(RP-Dist)和自适应多样化滥览(SH-Dist)来保持古老知识。此外,我们还使用了 weight aligning(WA)技术来调整新任务和古老任务之间的权重偏好。我们进一步建立了 CL 框架在三个公共外科数据集上,这些数据集在外科设置中包含了古老和新的外科 VQLA任务之间的重叠类。经过广泛的实验,我们证明了我们提出的方法在 continual surgical VQLA 中优化了学习和忘却。我们的代码公共访问。

FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

  • paper_url: http://arxiv.org/abs/2308.00065
  • repo_url: https://github.com/yuweiyin/finpt
  • paper_authors: Yuwei Yin, Yazheng Yang, Jian Yang, Qi Liu
  • For: Financial risk prediction in the financial sector, specifically addressing the issues of outdated algorithms and lack of a unified benchmark.* Methods: Propose a novel approach called FinPT that leverages large pretrained foundation models and natural language processing techniques to improve financial risk prediction. FinPT fills financial tabular data into pre-defined instruction templates, obtains natural-language customer profiles by prompting LLMs, and fine-tunes large foundation models with the profile text for predictions.* Results: Demonstrate the effectiveness of FinPT by experimenting with a range of representative strong baselines on FinBench, a set of high-quality datasets on financial risks. Analytical studies also deepen the understanding of LLMs for financial risk prediction.
    Abstract Financial risk prediction plays a crucial role in the financial sector. Machine learning methods have been widely applied for automatically detecting potential risks and thus saving the cost of labor. However, the development in this field is lagging behind in recent years by the following two facts: 1) the algorithms used are somewhat outdated, especially in the context of the fast advance of generative AI and large language models (LLMs); 2) the lack of a unified and open-sourced financial benchmark has impeded the related research for years. To tackle these issues, we propose FinPT and FinBench: the former is a novel approach for financial risk prediction that conduct Profile Tuning on large pretrained foundation models, and the latter is a set of high-quality datasets on financial risks such as default, fraud, and churn. In FinPT, we fill the financial tabular data into the pre-defined instruction template, obtain natural-language customer profiles by prompting LLMs, and fine-tune large foundation models with the profile text to make predictions. We demonstrate the effectiveness of the proposed FinPT by experimenting with a range of representative strong baselines on FinBench. The analytical studies further deepen the understanding of LLMs for financial risk prediction.
    摘要 financial风险预测在金融领域扮演着关键的角色。机器学习方法在自动检测 potential的风险方面得到广泛的应用,从而节省劳动成本。然而,在这一领域的发展在最近几年来受到了以下两个因素的延迟:1)使用的算法有些已经过时,尤其是在生成式AI和大型自然语言模型(LLM)的快速进步的背景下;2)缺乏一个统一的、开源的金融标准 benchmark,对相关研究造成了多年的阻碍。为解决这些问题,我们提出了 FinPT 和 FinBench。前者是一种新的金融风险预测方法,通过 Profile Tuning 大型预训模型中的大型预训模型,并在 FinBench 中进行了详细的分析研究。在 FinPT 中,我们将金融表格数据填充到预定的指令模板中,通过 LLMS 提取自然语言客户profile,并使用profile文本微调大型基础模型进行预测。我们通过对 FinBench 中的多种代表强基eline进行实验,证明了我们提出的 FinPT 的有效性。分析研究还深入了解LLMS的金融风险预测能力。

Learning Vision-and-Language Navigation from YouTube Videos

  • paper_url: http://arxiv.org/abs/2307.11984
  • repo_url: https://github.com/jeremylinky/youtube-vln
  • paper_authors: Kunyang Lin, Peihao Chen, Diwei Huang, Thomas H. Li, Mingkui Tan, Chuang Gan
  • for: 使用 YouTube 上的房屋游览视频来培养一个embodied agent,以便在真实的3D环境中使用自然语言指令进行导航。
  • methods: 创建一个大规模的数据集,其包含有理性的路径指令对从房屋游览视频中提取的,并在这个数据集上预训练 agent。
  • results: 通过使用 entropy 算法构建路径指令对,以及一个action-aware生成器来从未标注的旁路中提取指令,最终通过训练 trajectory judgment 预text task 来让 agent 挖掘到环境的布局知识,实现了在 R2R 和 REVERIE 两个标准测试benchmark上的状态级表现。
    Abstract Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. However, these videos have not been explored for VLN before. In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it. To achieve this, we have to tackle the challenges of automatically constructing path-instruction pairs and exploiting real layout knowledge from raw and unlabeled videos. To address these, we first leverage an entropy-based method to construct the nodes of a path trajectory. Then, we propose an action-aware generator for generating instructions from unlabeled trajectories. Last, we devise a trajectory judgment pretext task to encourage the agent to mine the layout knowledge. Experimental results show that our method achieves state-of-the-art performance on two popular benchmarks (R2R and REVERIE). Code is available at https://github.com/JeremyLinky/YouTube-VLN
    摘要 vision-and-language navigation (VLN) 需要一个具体的智能体在真实的3D环境中使用自然语言指令进行导航。现有的VLN方法受到小规模环境或不合理的路径指令数据的限制,导致对未看过的环境的泛化能力受到限制。 YouTube上有大量的房屋游览视频,这些视频提供了丰富的实际导航经验和房屋布局信息。然而,这些视频没有被前面的VLN研究所用。在这篇论文中,我们提议从这些视频中学习一个智能体,并创建了一个大规模的数据集,该数据集包括合理的路径指令对。为了实现这一点,我们首先利用一种 entropy-based 方法构建路径轨迹的节点。然后,我们提出了一种 action-aware 生成器,用于从无标签的轨迹中生成指令。最后,我们设计了一个轨迹判断预测任务,以便让智能体挖掘布局知识。实验结果表明,我们的方法在两个流行的benchmark(R2R和REVERIE)上达到了状态艺术性的表现。代码可以在 https://github.com/JeremyLinky/YouTube-VLN 上获取。

CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots

  • paper_url: http://arxiv.org/abs/2307.11865
  • repo_url: None
  • paper_authors: Nikhil Kakodkar, Dmitriy Rivkin, Bobak H. Baghi, Francois Hogan, Gregory Dudek
  • for: 这个论文探讨了大语言模型(LLM)如何解决混合语言计划和自然语言导航界面的问题。
  • methods: 该论文使用了大语言模型来解释用户在对话中提供的描述性语言查询,并在3D simulator AI2Thor中创建了复杂和可重复的场景。
  • results: 研究表明,使用大语言模型可以更好地解析用户在对话中提供的描述性语言查询,并且可以更好地理解用户的 Navigation 目标。
    Abstract This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation.Our focus is on following relatively complex instructions that are more akin to natural conversation than traditional explicit procedural directives seen in robotics. Unlike most prior work, where navigation directives are provided as imperative commands (e.g., go to the fridge), we examine implicit directives within conversational interactions. We leverage the 3D simulator AI2Thor to create complex and repeatable scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot can better parse descriptive language queries than existing methods by using an LLM to interpret the user interaction in the context of a list of the objects in the scene.
    摘要 To conduct our research, we utilize the 3D simulator AI2Thor to create complex and repeatable scenarios at scale, and augment it by adding complex language queries for 40 object types. Our results show that a robot can better understand and execute descriptive language queries by using an LLM to interpret the user interaction in the context of a list of objects in the scene.

The Looming Threat of Fake and LLM-generated LinkedIn Profiles: Challenges and Opportunities for Detection and Prevention

  • paper_url: http://arxiv.org/abs/2307.11864
  • repo_url: None
  • paper_authors: Navid Ayoobi, Sadat Shahriar, Arjun Mukherjee
  • for: 这个研究目的是为 LinkedIn 线上社交网络中检测伪注册和大语言模型(LLM)生成的账户,以避免伪者获取正常用户的私人资讯和推广未来的骗变活动。
  • methods: 这个研究使用 LinkedIn Profil 中提供的文本信息,并引入 Section and Subsection Tag Embedding(SSTE)方法,以增强这些数据的归类特征,以分辨伪注册和 manually 或使用 LLM 生成的账户。
  • results: 这个研究获得了约 95% 的准确率,可以分辨伪注册和正常账户,并且显示 SSTE 在识别 LLM 生成的账户时的准确率为约 90%,即使在训练阶段没有使用 LLM 生成的账户。
    Abstract In this paper, we present a novel method for detecting fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network immediately upon registration and before establishing connections. Early fake profile identification is crucial to maintaining the platform's integrity since it prevents imposters from acquiring the private and sensitive information of legitimate users and from gaining an opportunity to increase their credibility for future phishing and scamming activities. This work uses textual information provided in LinkedIn profiles and introduces the Section and Subsection Tag Embedding (SSTE) method to enhance the discriminative characteristics of these data for distinguishing between legitimate profiles and those created by imposters manually or by using an LLM. Additionally, the dearth of a large publicly available LinkedIn dataset motivated us to collect 3600 LinkedIn profiles for our research. We will release our dataset publicly for research purposes. This is, to the best of our knowledge, the first large publicly available LinkedIn dataset for fake LinkedIn account detection. Within our paradigm, we assess static and contextualized word embeddings, including GloVe, Flair, BERT, and RoBERTa. We show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. In addition, we show that SSTE has a promising accuracy for identifying LLM-generated profiles, despite the fact that no LLM-generated profiles were employed during the training phase, and can achieve an accuracy of approximately 90% when only 20 LLM-generated profiles are added to the training set. It is a significant finding since the proliferation of several LLMs in the near future makes it extremely challenging to design a single system that can identify profiles created with various LLMs.
    摘要 在这篇论文中,我们提出了一种新的方法,用于在 LinkedIn 在线社交网络上立即识别假 profiles 和 Large Language Model(LLM)生成的 profiles,以避免骗子在获取正式用户的隐私信息和敏感信息后,进行后续的骗取和骗财活动。这种工作使用 LinkedIn profile 中提供的文本信息,并引入 Section and Subsection Tag Embedding(SSTE)方法,以增强这些数据的抑制特征,以分辨真实 profiles 和骗劫 manually 或使用 LLM 生成的 profiles。此外,由于 LinkedIn 公共可用的大型数据集缺乏,我们自己收集了 3600 个 LinkedIn profiles для我们的研究。我们将会在研究用途上公开发布我们的数据集。这是,我们知道的, LinkedIn 上假帐户检测的首个大型公共可用数据集。在我们的 paradigm 中,我们评估了静态和Contextualized Word Embeddings,包括 GloVe、Flair、BERT 和 RoBERTa。我们发现,我们的方法可以在所有 Word Embeddings 上分辨真实 profiles 和假 profiles,准确率约为 95%。此外,我们发现 SSTE 在 LLM 生成 profiles 上具有扩展的准确率,即使在训练阶段没有使用 LLM 生成 profiles,可以达到约 90% 的准确率,只需要将 20 个 LLM 生成 profiles 添加到训练集中。这是一个重要的发现,因为未来几年 LLM 的普及会使得设计一个可以分辨多种 LLM 生成的 profiles 的系统变得极其困难。

MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering

  • paper_url: http://arxiv.org/abs/2307.11848
  • repo_url: https://github.com/tonyby/myth-qa
  • paper_authors: Yang Bai, Anthony Colas, Daisy Zhe Wang
  • for: The paper is written for detecting check-worthy claims directly from a large-scale information source, such as Twitter, to accelerate the fact-checking process.
  • methods: The paper introduces MythQA, a new multi-answer open-domain question answering task that involves contradictory stance mining for query-based large-scale check-worthy claim detection.
  • results: The paper presents a baseline system for MythQA and evaluates existing NLP models for each system component using the TweetMythQA dataset. The paper also provides initial benchmarks and identifies key challenges for future models to improve upon.Here’s the Simplified Chinese text for the three information points:
  • for: 这篇论文是为了从大规模信息源,如推特,直接检测check-worthy claim的目的。
  • methods: 这篇论文提出了一种新的多答问题回答任务,即mythQA,以推特上的矛盾立场挖掘为基础,以加速事实核查的过程。
  • results: 这篇论文提供了mythQA的基eline系统,并对现有NLP模型进行了TweetMythQA数据集上的评估。 paper还提供了初步的benchmark和未来模型改进的关键挑战。
    Abstract Check-worthy claim detection aims at providing plausible misinformation to downstream fact-checking systems or human experts to check. This is a crucial step toward accelerating the fact-checking process. Many efforts have been put into how to identify check-worthy claims from a small scale of pre-collected claims, but how to efficiently detect check-worthy claims directly from a large-scale information source, such as Twitter, remains underexplored. To fill this gap, we introduce MythQA, a new multi-answer open-domain question answering(QA) task that involves contradictory stance mining for query-based large-scale check-worthy claim detection. The idea behind this is that contradictory claims are a strong indicator of misinformation that merits scrutiny by the appropriate authorities. To study this task, we construct TweetMythQA, an evaluation dataset containing 522 factoid multi-answer questions based on controversial topics. Each question is annotated with multiple answers. Moreover, we collect relevant tweets for each distinct answer, then classify them into three categories: "Supporting", "Refuting", and "Neutral". In total, we annotated 5.3K tweets. Contradictory evidence is collected for all answers in the dataset. Finally, we present a baseline system for MythQA and evaluate existing NLP models for each system component using the TweetMythQA dataset. We provide initial benchmarks and identify key challenges for future models to improve upon. Code and data are available at: https://github.com/TonyBY/Myth-QA
    摘要 <> CHECK-worthy 声明检测的目标是提供可信的谣言来供下游真伪检查系统或人类专家进行检查。这是减少真伪检查过程的关键步骤。许多努力已经投入到如何从小规模的预收集声明中Identify CHECK-worthy 声明,但如何高效地从大规模信息源,如推特,中直接检测 CHECK-worthy 声明仍未得到充分研究。为了填补这个空白,我们介绍了 MitQA,一个新的多答题开放Domain问答任务,涉及到矛盾立场挖掘,以便从推特等大规模信息源中检测 CHECK-worthy 声明。我们的想法是,矛盾的声明是谣言的强力指标,值得当局的审查。为了研究这个任务,我们构建了 TweetMythQA 评估数据集,包含 522 个多答问题,基于争议话题。每个问题有多个答案。此外,我们收集了每个问题的相关推特,然后将其分为三类:“支持”、“反对”和“中立”。总的来说,我们标注了 5.3K 条推特。为所有答案在数据集中,我们收集了矛盾证据。最后,我们提供了基线系统 для MitQA,并使用 TweetMythQA 数据集评估现有 NLP 模型的每个系统组件。我们提供了初步的标准和标识未来模型改进的关键挑战。代码和数据可以在 GitHub 上获取:https://github.com/TonyBY/Myth-QA。

OUTFOX: LLM-generated Essay Detection through In-context Learning with Adversarially Generated Examples

  • paper_url: http://arxiv.org/abs/2307.11729
  • repo_url: None
  • paper_authors: Ryuto Koike, Masahiro Kaneko, Naoaki Okazaki
  • for: 本研究旨在提高LLM生成文本检测器的Robustness,并在实际场景中评估其效果。
  • methods: 本研究提出了OUTFOX框架,该框架让检测器和攻击者都可以考虑对方的输出,并在学生作业作文中应用。
  • results: 实验结果表明,OUR proposed detector可以通过在攻击者的帮助下进行培训,提高检测性能,而OUR proposed attacker可以使检测器性能下降至-57.0点F1分。
    Abstract Large Language Models (LLMs) have achieved human-level fluency in text generation, making it difficult to distinguish between human-written and LLM-generated texts. This poses a growing risk of misuse of LLMs and demands the development of detectors to identify LLM-generated texts. However, existing detectors degrade detection accuracy by simply paraphrasing LLM-generated texts. Furthermore, the effectiveness of these detectors in real-life situations, such as when students use LLMs for writing homework assignments (e.g., essays) and quickly learn how to evade these detectors, has not been explored. In this paper, we propose OUTFOX, a novel framework that improves the robustness of LLM-generated-text detectors by allowing both the detector and the attacker to consider each other's output and apply this to the domain of student essays. In our framework, the attacker uses the detector's prediction labels as examples for in-context learning and adversarially generates essays that are harder to detect. While the detector uses the adversarially generated essays as examples for in-context learning to learn to detect essays from a strong attacker. Our experiments show that our proposed detector learned in-context from the attacker improves the detection performance on the attacked dataset by up to +41.3 point F1-score. While our proposed attacker can drastically degrade the performance of the detector by up to -57.0 point F1-score compared to the paraphrasing method.
    摘要 大型语言模型(LLM)已经 дости成人类水准的文本生成能力,使得区分人类写成和LLM生成的文本变得困难。这导致了LLM的滥用风险的增加,并且需要发展检测LLM生成的文本的技术。然而,现有的检测器对LLM生成的文本进行简单的重写,从而降低了检测器的准确度。此外,现有的检测器在实际情况下,例如学生使用LLM写作作业(例如论文)并快速学习如何避免这些检测器的情况下,尚未被探访。在这篇论文中,我们提出了 OUTFOX 框架,它可以提高 LLM 生成文本检测器的Robustness。在我们的框架中,攻击者使用检测器的预测标签作为内容学习的示例,并通过对检测器进行对抗式学习来生成更难以检测的文本。而检测器则使用对抗式生成的文本作为内容学习的示例,以提高检测器对于攻击者生成的文本的准确度。我们的实验显示,我们的提案的检测器在攻击 dataset 上的准确度提高了 +41.3 点 F1 分数。而我们的提案的攻击者可以对检测器造成极大的影响,比如 -57.0 点 F1 分数,相比之下,对文本进行重写方法的影响相对较小。

GPT-4 Can’t Reason

  • paper_url: http://arxiv.org/abs/2308.03762
  • repo_url: https://github.com/vohidjon123/google
  • paper_authors: Konstantine Arkoudas
  • for: 评估 GPT-4 模型的逻辑能力
  • methods: 使用多种评估方法评估 GPT-4 模型的逻辑能力
  • results: GPT-4 模型现在不具备逻辑能力,尝试用多种方法进行评估,但它只有偶尔出现一些分析天赋。
    Abstract GPT-4 was released in March 2023 to wide acclaim, marking a very substantial improvement across the board over GPT-3.5 (OpenAI's previously best model, which had powered the initial release of ChatGPT). However, despite the genuinely impressive improvement, there are good reasons to be highly skeptical of GPT-4's ability to reason. This position paper discusses the nature of reasoning; criticizes the current formulation of reasoning problems in the NLP community, as well as the way in which LLM reasoning performance is currently evaluated; introduces a small collection of 21 diverse reasoning problems; and performs a detailed qualitative evaluation of GPT-4's performance on those problems. Based on this analysis, the paper concludes that, despite its occasional flashes of analytical brilliance, GPT-4 at present is utterly incapable of reasoning.
    摘要