For: This paper aims to help stakeholders in refugee status adjudications, such as lawyers, judges, governing bodies, and claimants, make better decisions through data-driven intelligence and increase understanding and transparency of the refugee application process.* Methods: The paper presents a completed experiment on retrieving past cases and ongoing efforts related to analyzing legal decision-making processes on a dataset of Canadian cases, using NLP-based solutions.* Results: The paper introduces a novel benchmark for future NLP research in refugee law and expects to achieve benefits such as reduced time-to-decision, fairer and more transparent outcomes, and improved decision quality.Here are the three points in Simplified Chinese text:* For: 这个论文旨在帮助难民地位评估中的潜在利益相关者,如律师、法官、管理机构和申请人,通过数据驱动智能来做出更好的决策,并提高难民申请过程中所有参与者的理解和透明度。* Methods: 论文提出了一个完成的实验,涉及到过去案例的收集,以及对加拿大案例集进行法律决策过程的分析,使用NLP技术解决问题。* Results: 论文引入了一个新的NLP研究 benchmark,预计可以实现减少决策时间、提高决策质量、更公平和透明的决策结果等利益。Abstract
Our project aims at helping and supporting stakeholders in refugee status adjudications, such as lawyers, judges, governing bodies, and claimants, in order to make better decisions through data-driven intelligence and increase the understanding and transparency of the refugee application process for all involved parties. This PhD project has two primary objectives: (1) to retrieve past cases, and (2) to analyze legal decision-making processes on a dataset of Canadian cases. In this paper, we present the current state of our work, which includes a completed experiment on part (1) and ongoing efforts related to part (2). We believe that NLP-based solutions are well-suited to address these challenges, and we investigate the feasibility of automating all steps involved. In addition, we introduce a novel benchmark for future NLP research in refugee law. Our methodology aims to be inclusive to all end-users and stakeholders, with expected benefits including reduced time-to-decision, fairer and more transparent outcomes, and improved decision quality.
摘要
我们的项目的目标是帮助和支持难民地位审批相关方,如律师、法官、管理机构和申请人,以使更好的决策。我们通过数据驱动智能来增加所有参与方的理解和透明度,并提高决策的质量。这个博士项目有两个主要目标:(1)检索历史案例,(2)分析加拿大案例的法律决策过程。在这篇论文中,我们介绍了我们的当前工作,包括已经完成的试验部分(1)以及正在进行的努力(2)。我们认为,NLP技术非常适合解决这些挑战,我们正在调查是否可以自动化所有步骤。此外,我们还介绍了一个新的标准测试集,用于未来NLP研究领域的难民法。我们的方法旨在包容所有终端用户和参与方,期望的利益包括减少时间决策、更公平和透明的结果,以及改善决策质量。
Unsupervised Prototype Adapter for Vision-Language Models
results: 我们的实验结果表明,我们的方法可以大幅超越8批CoOp、8批Tip-Adapter和现有的UPL方法,并且在图像识别和领域泛化任务中具有优秀的表现。Abstract
Recently, large-scale pre-trained vision-language models (e.g. CLIP and ALIGN) have demonstrated remarkable effectiveness in acquiring transferable visual representations. To leverage the valuable knowledge encoded within these models for downstream tasks, several fine-tuning approaches, including prompt tuning methods and adapter-based methods, have been developed to adapt vision-language models effectively with supervision. However, these methods rely on the availability of annotated samples, which can be labor-intensive and time-consuming to acquire, thus limiting scalability. To address this issue, in this work, we design an unsupervised fine-tuning approach for vision-language models called Unsupervised Prototype Adapter (UP-Adapter). Specifically, for the unannotated target datasets, we leverage the text-image aligning capability of CLIP to automatically select the most confident samples for each class. Utilizing these selected samples, we generate class prototypes, which serve as the initialization for the learnable prototype model. After fine-tuning, the prototype model prediction is combined with the original CLIP's prediction by a residual connection to perform downstream recognition tasks. Our extensive experimental results on image recognition and domain generalization show that the proposed unsupervised method outperforms 8-shot CoOp, 8-shot Tip-Adapter, and also the state-of-the-art UPL method by large margins.
摘要
现在,大规模预训练视觉语言模型(例如CLIP和ALIGN)已经表现出了很好的抽象能力。为了利用这些模型中嵌入的有价值知识来进行下游任务,有多种精度调整方法,如提示调整方法和适配器基本方法,已经开发出来。然而,这些方法需要有注解样本,这可以是劳动密集和时间消耗的。为了解决这个问题,在这项工作中,我们设计了一种无监督的精度调整方法 для视觉语言模型,即Unsupervised Prototype Adapter(UP-Adapter)。具体来说,对于无注解目标数据集,我们利用CLIP的文本图像对齐能力自动选择每个类型的最有信心的样本。使用这些选择的样本,我们生成类prototype,这些类prototype作为初始化来学习可变prototype模型。经过精度调整后,prototype模型预测结果与原CLIP预测结果之间进行差分连接,以进行下游识别任务。我们对图像识别和领域泛化进行了广泛的实验,结果表明,提posed的无监督方法可以大幅超越8批CoOp、8批Tip-Adapter以及状态监督UPL方法。
Can Authorship Representation Learning Capture Stylistic Features?
paper_authors: Andrew Wang, Cristina Aggazzotti, Rebecca Kotula, Rafael Rivera Soto, Marcus Bishop, Nicholas Andrews
for: 这 paper 的目的是用数据驱动的方式学习作者表示,以便进行作者归属性预测。
methods: 这 paper 使用了大量的文本 corpus 和作者标签,通过数据驱动的方式学习作者表示。
results: 这 paper 的实验结果表明,学习的作者表示可以准确地捕捉作者的写作风格,并且可以鲁棒地抗压缩数据变换,如主题的变化。Abstract
Automatically disentangling an author's style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.
摘要
自动分解作者的风格从写作内容中分离是计算语言学领域的长期问题。同时,有大量文本库已经标注作者的出现,使得可以通过数据驱动方式学习作者表示,这种任务显然更加依赖于编码写作风格而非编码内容。然而,成功完成这个代理任务并不能确保这些表示capture风格,因为作者可能也与其他隐藏变量相关,如话题。为了更好地理解这些表示中传递的信息,以及特别是验证假设是编码写作风格的,我们系统地进行了一系列targeted实验。实验结果表明,learned for surrogate authorship prediction task的表示确实敏感于写作风格。因此,作者表示可能会对某些数据变换具有Robustness,如时间的话题漂移。此外,我们的发现可能会开启下游应用需要风格表示的应用场景,如样式传递。
Learning to generate and corr- uh I mean repair language in real-time
methods: 这个论文使用了之前已经学习的动态语法语法和CHILDES数据集,开发了一个基于 probabilistic model 的增量生成模型,用于实现实时语言处理。
results: 研究发现,使用这个模型可以在78%的情况下输出金标候选答案,ROUGE-l分数为0.86。此外,模型还可以在生成目标改变时自动生成自修复,自动评估显示,模型可以正确地生成自修复的情况为85%。小规模的人工评估也证明了生成的自修复是自然和正确的。Abstract
In conversation, speakers produce language incrementally, word by word, while continuously monitoring the appropriateness of their own contribution in the dynamically unfolding context of the conversation; and this often leads them to repair their own utterance on the fly. This real-time language processing capacity is furthermore crucial to the development of fluent and natural conversational AI. In this paper, we use a previously learned Dynamic Syntax grammar and the CHILDES corpus to develop, train and evaluate a probabilistic model for incremental generation where input to the model is a purely semantic generation goal concept in Type Theory with Records (TTR). We show that the model's output exactly matches the gold candidate in 78% of cases with a ROUGE-l score of 0.86. We further do a zero-shot evaluation of the ability of the same model to generate self-repairs when the generation goal changes mid-utterance. Automatic evaluation shows that the model can generate self-repairs correctly in 85% of cases. A small human evaluation confirms the naturalness and grammaticality of the generated self-repairs. Overall, these results further highlight the generalisation power of grammar-based models and lay the foundations for more controllable, and naturally interactive conversational AI systems.
摘要
在对话中,说话人会生成语言Word by Word,同时监测自己的言语是否适切,并在对话背景下动态地进行修复。这种实时语言处理能力是对话AI的自然化和流畅化的关键。在这篇论文中,我们使用先前学习的动态 syntax grammatical model和CHILDES corpus来开发、训练和评估一种随机生成模型,其输入是在类型理论中的某种语义生成目标概念。我们显示该模型的输出与金标准候选之间的匹配率为78%,ROUGE-l分数为0.86。我们进一步进行零shot评估模型在生成自修复时的能力。自动评估显示模型可以正确地生成自修复的85%情况下。一小规模的人工评估也证明了生成的自修复是自然和正确的。总的来说,这些结果再次强调了基于语法模型的模型的通用性,并为更可控、自然交互的对话AI系统开创了基础。
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
results: 作者的方法在多语言多模式嵌入搜索任务上显著超越了现有的嵌入方法,如LASER3和LabSE。此外,作者的语音编码器在相似搜索任务上也表现出色,并且可以实现零shot语言和模式组合的语音翻译。Abstract
We introduce SONAR, a new multilingual and multimodal fixed-size sentence embedding space. Our single text encoder, covering 200 languages, substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks. Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. Our encoders outperform existing speech encoders on similarity search tasks. We also provide a text decoder for 200 languages, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations. Our text-to-text results are competitive compared to the state-of-the-art NLLB~1B model, despite the fixed-size bottleneck representation. Our zero-shot speech-to-text translation results compare favorably with strong supervised baselines such as Whisper.
摘要
我们介绍SONAR,一个新的多语言多模式固定大小句子嵌入空间。我们的单一文本编码器,覆盖200种语言,与现有的句子嵌入 such as LASER3和LabSE在xsim和xsim++多 lingual similarity搜寻任务上表现出色,并可以将语音段落嵌入同一个 SONAR嵌入空间中使用语言特定的语音编码器在教师-学生设定下在语音识别数据上训练。我们的编码器在类似搜寻任务上表现出色,而我们还提供了200种语言的文本解oder,可以进行文本-文本和语音-文本机器翻译,包括零��� conocido语言和模式组合。我们的文本-文本结果与现有的NLLB1B模型相匹配,即使受到固定大小瓶颈表现的限制。我们的零��� known语音-文本翻译结果与强化过的基准模型such as Whisper相匹配。
Extracting Relational Triples Based on Graph Recursive Neural Network via Dynamic Feedback Forest Algorithm
results: 提出了一种新的方法,可以在模型训练时通过推理操作连接各个子任务的表示,实现子任务的 интеграciónAbstract
Extracting relational triples (subject, predicate, object) from text enables the transformation of unstructured text data into structured knowledge. The named entity recognition (NER) and the relation extraction (RE) are two foundational subtasks in this knowledge generation pipeline. The integration of subtasks poses a considerable challenge due to their disparate nature. This paper presents a novel approach that converts the triple extraction task into a graph labeling problem, capitalizing on the structural information of dependency parsing and graph recursive neural networks (GRNNs). To integrate subtasks, this paper proposes a dynamic feedback forest algorithm that connects the representations of subtasks by inference operations during model training. Experimental results demonstrate the effectiveness of the proposed method.
摘要
将文本数据转化为结构化知识,EXTRACTING relational triples(主语、谓语、谓 Object)从文本中提取是一个基本任务。命名实体识别(NER)和关系提取(RE)是这个知识生成管道的两个基础任务。这两个任务的集成带来了很大挑战,因为它们之间存在很大的差异。本文提出了一种新的方法,将 triple 提取任务转化为图标注问题,利用语言结构信息和图循环神经网络(GRNN)。为了将子任务集成,本文提出了一种动态反馈森林算法,在模型训练过程中,通过推理操作连接子任务的表示。实验结果表明,提出的方法有效。
Convoifilter: A case study of doing cocktail party speech recognition
results: 模型可以将单个说话人的语音净化到14.5%,比单独调整的26.4%更低。Abstract
This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.
摘要
这份研究报告介绍了一种用于改进特定发音人员在嘈杂环境下的自动语音识别(ASR)模型。该模型使用单通道语音提升模块,以隔离发音人员的声音与背景噪声,同时还包括ASR模块。通过这种方法,模型可以降低ASR的单词错误率(WER)从80%降至26.4%。通常,这两个组件在数据需求的变化下独立地调整。然而,语音提升可能会导致ASR效率下降。通过实施联合细调策略,模型可以在联合细调中降低WER从26.4%下降至14.5%。
M3PS: End-to-End Multi-Grained Multi-Modal Attribute-Aware Product Summarization in E-commerce
results: 实验结果表明,该方法在一个大规模的中文电商 dataset 上的评价metric 上显著 OUTPERFORMS 现有的产品概要方法。Abstract
Given the long textual product information and the product image, Multi-Modal Product Summarization (MMPS) aims to attract customers' interest and increase their desire to purchase by highlighting product characteristics with a short textual summary. Existing MMPS methods have achieved promising performance. Nevertheless, there still exist several problems: 1) lack end-to-end product summarization, 2) lack multi-grained multi-modal modeling, and 3) lack multi-modal attribute modeling. To address these issues, we propose an end-to-end multi-grained multi-modal attribute-aware product summarization method (M3PS) for generating high-quality product summaries in e-commerce. M3PS jointly models product attributes and generates product summaries. Meanwhile, we design several multi-grained multi-modal tasks to better guide the multi-modal learning of M3PS. Furthermore, we model product attributes based on both text and image modalities so that multi-modal product characteristics can be manifested in the generated summaries. Extensive experiments on a real large-scale Chinese e-commence dataset demonstrate that our model outperforms state-of-the-art product summarization methods w.r.t. several summarization metrics.
摘要
文本级别的产品描述和产品图像,多模式产品概述(MMPS)目标是通过突出产品特点而吸引顾客兴趣,提高购买意愿。现有的MMPS方法已经实现了一定的成果。然而,还存在一些问题:1)缺乏端到端产品概述,2)缺乏多层多模式模型,3)缺乏多模式属性模型。为了解决这些问题,我们提出了一种端到端多层多模式属性感知产品概述方法(M3PS),用于生成高质量的电商产品概述。M3PS同时模型产品属性,并生成产品概述。此外,我们设计了多个多层多模式任务,以更好地引导多模式学习。同时,我们基于文本和图像模式来模型产品属性,以便在生成的概述中表达多模式产品特点。我们对大规模中国电商数据进行了广泛的实验,并证明了我们的模型在多个概述指标上表现比现状态的产品概述方法更高。
LEAP: Efficient and Automated Test Method for NLP Software
paper_authors: Mingxuan Xiao, Yan Xiao, Hai Dong, Shunhui Ji, Pengcheng Zhang
for: 提高 DNN 模型的Robustness,透过自动生成 adversarial test cases。
methods: 使用 Levy flight-based Adaptive Particle swarm optimization integrated with textual features,并采用 initialization population 增加测试用例的多样性,以及使用启动器算法和精准搜索缩短搜索时间。
results: 对 NLP 软件进行了系列测试,并证明了 LEAP 能够生成高精度的 adversarial test cases,同时具有较高的效率和可迁移性。Abstract
The widespread adoption of DNNs in NLP software has highlighted the need for robustness. Researchers proposed various automatic testing techniques for adversarial test cases. However, existing methods suffer from two limitations: weak error-discovering capabilities, with success rates ranging from 0% to 24.6% for BERT-based NLP software, and time inefficiency, taking 177.8s to 205.28s per test case, making them challenging for time-constrained scenarios. To address these issues, this paper proposes LEAP, an automated test method that uses LEvy flight-based Adaptive Particle swarm optimization integrated with textual features to generate adversarial test cases. Specifically, we adopt Levy flight for population initialization to increase the diversity of generated test cases. We also design an inertial weight adaptive update operator to improve the efficiency of LEAP's global optimization of high-dimensional text examples and a mutation operator based on the greedy strategy to reduce the search time. We conducted a series of experiments to validate LEAP's ability to test NLP software and found that the average success rate of LEAP in generating adversarial test cases is 79.1%, which is 6.1% higher than the next best approach (PSOattack). While ensuring high success rates, LEAP significantly reduces time overhead by up to 147.6s compared to other heuristic-based methods. Additionally, the experimental results demonstrate that LEAP can generate more transferable test cases and significantly enhance the robustness of DNN-based systems.
摘要
“随着深度神经网络(DNN)在自然语言处理(NLP)软件中的广泛应用,问题的Robustness问题得到了吸引注意。研究人员提出了多种自动测试技术,但现有方法受到两个限制:一是弱的错误发现能力,成功率从0%到24.6%之间,二是时间浪费,每个测试案例需要177.8s至205.28s,这使得它们在时间紧张的情况下具有挑战性。为了解决这些问题,本文提出了LEAP,一个自动测试方法,利用LEvy flight-based Adaptive Particle swarm optimization与文本特征来生成攻击测试案例。具体来说,我们在人口初始化中采用Levy flight,以增加生成的测试案例的多样性。我们还设计了一个吸引力适应更新算法,以提高LEAP的全球优化高维文本示例的效率。此外,我们还设计了基于推导策略的突变算法,以减少搜索时间。我们对NLP软件进行了一系列实验, Validate LEAP的测试能力,结果显示,LEAP的平均成功率为79.1%,高于下一个最佳方法(PSOattack)的6.1%。同时,LEAP可以保证高的成功率,并对其他着重基于规律的方法实现时间优化,最多减少147.6s。实验结果显示,LEAP可以生成更转移的测试案例,并对DNN基于系统增加了更高的Robustness。”
HopPG: Self-Iterative Program Generation for Multi-Hop Question Answering over Heterogeneous Knowledge
results: 实验结果表明,HopPG在MMQA-T^2上表现出色,特别是在多步问答中超过了现有semantic-parsing基eline。Abstract
The semantic parsing-based method is an important research branch for knowledge-based question answering. It usually generates executable programs lean upon the question and then conduct them to reason answers over a knowledge base. Benefit from this inherent mechanism, it has advantages in the performance and the interpretability. However,traditional semantic parsing methods usually generate a complete program before executing it, which struggles with multi-hop question answering over heterogeneous knowledge. Firstly,a complete multi-hop program relies on multiple heterogeneous supporting facts, and it is difficult for models to receive these facts simultaneously. Secondly,these methods ignore the interaction information between the previous-hop execution result and the current-hop program generation. To alleviate these challenges, we propose a self-iterative framework for multi-hop program generation (HopPG) over heterogeneous knowledge, which leverages the previous-hop execution results to retrieve supporting facts and generate subsequent programs iteratively. We evaluate our model on MMQA-T^2. The experimental results show that HopPG outperforms existing semantic-parsing-based baselines, especially on the multi-hop questions.
摘要
“ semantic parsing-based 方法是知识基于问题回答的重要研究分支。它通常将问题转换为可执行的程式,然后将其与知识库进行推理,获得答案。由于这个自然的机制,它具有性能和可读性的优势。然而,传统的 semantic parsing 方法通常会生成完整的程式 перед执行,这会对于多步骤问题回答 sobre 不同的知识类型产生困难。首先,完整的多步骤程式需要多个不同的支持事实,而这些模型很难同时获取这些事实。其次,这些方法忽略了前一步执行结果和现在一步程式生成之间的互动信息。为了解决这些挑战,我们提出了一个自我迭代框架 для 多步骤程式生成 (HopPG) over 不同的知识,它利用前一步执行结果来获取支持事实并生成下一步程式。我们将我们的模型评估在 MMQA-T^2 上。实验结果显示,HopPG 比 existed semantic-parsing-based 基eline更高效,特别是在多步骤问题上。”
ViCo: Engaging Video Comment Generation with Human Preference Rewards
results: 实验结果表明,使用本研究提出的方法可以生成高质量的视频评论,特别是在考虑参与度时。Abstract
Engaging video comments play an important role in video social media, as they are the carrier of feelings, thoughts, or humor of the audience. Preliminary works have made initial exploration for video comment generation by adopting caption-style encoder-decoder models. However, comment generation presents some unique challenges distinct from caption generation, which makes these methods somewhat less effective at generating engaging comments. In contrast to the objective and descriptive nature of captions, comments tend to be inherently subjective, making it hard to quantify and evaluate the engagement of comments. Furthermore, the scarcity of truly engaging comments brings difficulty to collecting enough high-quality training examples. In this paper, we propose ViCo with three novel designs to tackle the above challenges for generating engaging Video Comments. Firstly, to quantify the engagement of comments, we utilize the number of "likes" each comment receives as a proxy of human preference after an appropriate debiasing procedure. Secondly, to automatically evaluate the engagement of comments, we train a reward model to align its judgment to the above proxy. Our user studies indicate that this reward model effectively aligns with human judgments. Lastly, to alleviate the scarcity of high-quality comments, an initial generator is trained on readily available but noisy data to generate comments. Then the reward model is employed to offer feedback on the generated comments, thus optimizing the initial generator. To facilitate the research of video commenting, we collect a large video comment-dataset (ViCo-20k) with rich metadata from a popular video website. Experiments on ViCo-20k show that the comments generated by our ViCo model exhibit the best performance in terms of both quantitative and qualitative results, particularly when engagement is considered.
摘要
优化视频评论的核心在于促进视频社交媒体上的评论内容的互动性和趣味性。现有的初步工作已经采用了caption风格的编解oder模型进行视频评论生成。然而,评论生成存在一些独特的挑战,与caption生成不同,这些挑战使得这些方法在生成互动评论时有所不足。在评论中,评论内容具有主观性,使得评估评论的互动性变得更加困难。此外,缺乏真正有趣的评论使得收集高质量的训练示例具有挑战性。在这篇论文中,我们提出了ViCo模型,其中包括三个新的设计来解决以上挑战。首先,我们利用每个评论 receives的“喜欢”数作为人类偏好的代理,并进行了适当的偏移处理。其次,我们训练了一个奖励模型,以使其对于上述代理的评价与人类评价相互对应。我们的用户研究表明,这个奖励模型与人类评价之间具有良好的一致性。最后,我们使用初始生成器在 readily available但含有噪声的数据上生成评论,然后使用奖励模型来反馈给初始生成器,以便优化初始生成器。为促进视频评论研究,我们收集了一个大量的视频评论数据集(ViCo-20k),其中包括了视频网站上具有丰富 metadata 的视频评论。我们在ViCo-20k数据集上进行了实验,结果显示,我们的ViCo模型在互动性和质量两个方面表现出色,特别是在考虑互动性时。
LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning (Practical Experience Report)
paper_authors: Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, Chun Zuo
for: automatizing code review activities
methods: utilizes parameter-efficient fine-tuning (PEFT) methods and LLaMA, a popular large language model
results: equals the performance of existing code-review-focused models with a small model size and limited tuning epochsAbstract
The automation of code review activities, a long-standing pursuit in software engineering, has been primarily addressed by numerous domain-specific pre-trained models. Despite their success, these models frequently demand extensive resources for pre-training from scratch. In contrast, Large Language Models (LLMs) provide an intriguing alternative, given their remarkable capabilities when supplemented with domain-specific knowledge. However, their potential for automating code review tasks remains largely unexplored. In response to this research gap, we present LLaMA-Reviewer, an innovative framework that leverages the capabilities of LLaMA, a popular LLM, in the realm of code review. Mindful of resource constraints, this framework employs parameter-efficient fine-tuning (PEFT) methods, delivering high performance while using less than 1% of trainable parameters. An extensive evaluation of LLaMA-Reviewer is conducted on two diverse, publicly available datasets. Notably, even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs, LLaMA-Reviewer equals the performance of existing code-review-focused models. The ablation experiments provide insights into the influence of various fine-tuning process components, including input representation, instruction tuning, and different PEFT methods. To foster continuous progress in this field, the code and all PEFT-weight plugins have been made open-source.
摘要
<> translate the following text into Simplified Chinese<>软件工程中的代码审查活动自动化,是一项长期追求的问题,已经由许多域 especific pre-trained models addresses。Despite their success, these models often require extensive resources for pre-training from scratch。In contrast, Large Language Models (LLMs) provide an interesting alternative,given their remarkable capabilities when supplemented with domain-specific knowledge。However,their potential for automating code review tasks remains largely unexplored。In response to this research gap,we present LLaMA-Reviewer,an innovative framework that leverages the capabilities of LLaMA,a popular LLM,in the realm of code review。Mindful of resource constraints,this framework employs parameter-efficient fine-tuning (PEFT) methods,delivering high performance while using less than 1% of trainable parameters。An extensive evaluation of LLaMA-Reviewer is conducted on two diverse,publicly available datasets。Notably,even with the smallest LLaMA base model consisting of 6.7B parameters and a limited number of tuning epochs,LLaMA-Reviewer equals the performance of existing code-review-focused models。The ablation experiments provide insights into the influence of various fine-tuning process components,including input representation,instruction tuning,and different PEFT methods。To foster continuous progress in this field,the code and all PEFT-weight plugins have been made open-source。
NLP-based detection of systematic anomalies among the narratives of consumer complaints
results: 研究使用Consumer Financial Protection Bureau的consumer complaint database示例,并成功地检测到了一些系统性异常。Abstract
We develop an NLP-based procedure for detecting systematic nonmeritorious consumer complaints, simply called systematic anomalies, among complaint narratives. While classification algorithms are used to detect pronounced anomalies, in the case of smaller and frequent systematic anomalies, the algorithms may falter due to a variety of reasons, including technical ones as well as natural limitations of human analysts. Therefore, as the next step after classification, we convert the complaint narratives into quantitative data, which are then analyzed using an algorithm for detecting systematic anomalies. We illustrate the entire procedure using complaint narratives from the Consumer Complaint Database of the Consumer Financial Protection Bureau.
摘要
我们开发了一种基于自然语言处理(NLP)技术的系统性异常检测程序,用于检测消费者投诉文本中的系统性异常。尽管分类算法可以检测明显的异常,但在小型和频繁的系统性异常情况下,算法可能会失败,这可能是技术上的限制以及人类分析员的自然限制。因此,我们将投诉文本转换成量化数据,然后使用一种检测系统性异常的算法进行分析。我们使用美国消费者金融保护署的消费者投诉数据库中的投诉文本进行示例。
Towards Objective Evaluation of Socially-Situated Conversational Robots: Assessing Human-Likeness through Multimodal User Behaviors
results: 研究发现,通过对用户行为进行分析,可以对机器人的人类化程度进行评估,并且这种方法可以增强 объекivity和可重复性。Abstract
This paper tackles the challenging task of evaluating socially situated conversational robots and presents a novel objective evaluation approach that relies on multimodal user behaviors. In this study, our main focus is on assessing the human-likeness of the robot as the primary evaluation metric. While previous research often relied on subjective evaluations from users, our approach aims to evaluate the robot's human-likeness based on observable user behaviors indirectly, thus enhancing objectivity and reproducibility. To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.
摘要
To begin, we created an annotated dataset of human-likeness scores, utilizing user behaviors found in an attentive listening dialogue corpus. We then conducted an analysis to determine the correlation between multimodal user behaviors and human-likeness scores, demonstrating the feasibility of our proposed behavior-based evaluation method.Translation notes:* "socially situated" is translated as "社交境中" (shè jìoù zhōng zhī)* "conversational robots" is translated as "对话机器人" (duì yǔ jī rén)* "human-likeness" is translated as "人类化" (rén xìng huà)* "objective evaluation" is translated as "客观评价" (kè jiàn píng jì)* "multimodal user behaviors" is translated as "多Modal用户行为" (duō modāl yòng hòu xíng bèi)* "attentive listening dialogue corpus" is translated as "注意听录对话 corpus" (zhù yì tīng luō duì hǎo)
Using language models in the implicit automated assessment of mathematical short answer items
results: 研究表明,这种管道方法比传统的分桌评分法更加准确和有用,可以为学生提供更有arget的反馈,帮助学生提高数学知识。Abstract
We propose a new way to assess certain short constructed responses to mathematics items. Our approach uses a pipeline that identifies the key values specified by the student in their response. This allows us to determine the correctness of the response, as well as identify any misconceptions. The information from the value identification pipeline can then be used to provide feedback to the teacher and student. The value identification pipeline consists of two fine-tuned language models. The first model determines if a value is implicit in the student response. The second model identifies where in the response the key value is specified. We consider both a generic model that can be used for any prompt and value, as well as models that are specific to each prompt and value. The value identification pipeline is a more accurate and informative way to assess short constructed responses than traditional rubric-based scoring. It can be used to provide more targeted feedback to students, which can help them improve their understanding of mathematics.
摘要
我们提出了一种新的方法来评估某些短 constructed responses 的数学项目。我们的方法使用一个管道,以确定学生在回答中提供的关键值。这些值可以确定回答的正确性,以及学生可能存在的误解。管道中的信息可以用于向教师和学生提供反馈。我们的值标识管道包括两个精心调整的自然语言模型。第一个模型判断学生回答中是否包含关键值。第二个模型确定回答中关键值的位置。我们考虑了一个通用的模型,可以用于任何提问和值,以及每个提问和值的特定模型。值标识管道比传统的分类型分配法更准确和有用,可以为学生提供更有向导性的反馈,帮助他们深化数学理解。
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles
results: 发现大多数语言模型在交互中具有差guide的横向思维能力,比如 GPT-4 也存在一定的差异,与人类相比仍有很大差距Abstract
With the continuous evolution and refinement of LLMs, they are endowed with impressive logical reasoning or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model's lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: the quality of questions posed by the model and the model's capability to integrate information for problem-solving. We find that nearly all LLMs struggle with employing lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits the advantage to some extent, yet still maintain a noticeable gap when compared to human. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial to an effective AI assistant.
摘要
We challenge LLMs with two aspects: the quality of questions posed by the model and the model's ability to integrate information for problem-solving. Our results show that nearly all LLMs struggle with using lateral thinking during interactions. For example, even the most advanced model, GPT-4, exhibits some advantage, but still maintains a noticeable gap compared to humans. This evaluation benchmark provides LLMs with a highly challenging and distinctive task that is crucial for an effective AI assistant.
AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents
methods: 该框架使用了多种方法,包括 dynamically adjusting its composition 和 collaboratively accomplishing tasks。
results: 实验结果表明,该框架可以有效地派Send multi-agent groups that outperform a single agent。 Additionally, the paper explores the emergence of social behaviors among individual agents within a group during collaborative task accomplishment.Abstract
Autonomous agents empowered by Large Language Models (LLMs) have undergone significant improvements, enabling them to generalize across a broad spectrum of tasks. However, in real-world scenarios, cooperation among individuals is often required to enhance the efficiency and effectiveness of task accomplishment. Hence, inspired by human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}.
摘要
自主代理 empowered by Large Language Models (LLMs) 已经经历了重要的改进,使其能够广泛应用于多种任务。然而,在实际场景中,人们之间的合作是经常需要的,以提高任务完成的效率和效果。因此, drawing inspiration from human group dynamics, we propose a multi-agent framework \framework that can collaboratively and dynamically adjust its composition as a greater-than-the-sum-of-its-parts system. Our experiments demonstrate that \framework framework can effectively deploy multi-agent groups that outperform a single agent. Furthermore, we delve into the emergence of social behaviors among individual agents within a group during collaborative task accomplishment. In view of these behaviors, we discuss some possible strategies to leverage positive ones and mitigate negative ones for improving the collaborative potential of multi-agent groups. Our codes for \framework will soon be released at \url{https://github.com/OpenBMB/AgentVerse}.Note: Please note that the translation is in Simplified Chinese, which is one of the two standard versions of Chinese. If you need the translation in Traditional Chinese, please let me know.