results: 实验结果表明,这种水印方法可以在不改变文本的分布下,对文本进行植入水印,并且可以在不同的攻击方式下保持水印的可读性。具体来说,对于OPT-1.3B和LLaMA-7B模型,可以在40%-50%的杂素替换、插入和删除攻击下,在35个字符的文本中仍可以可靠地检测水印(p<=0.01)。对于Alpaca-7B模型,由于响应的 entropy 较低,检测是更加困难的,但仍可以在25%的响应中检测到水印(p<=0.01)。Abstract
We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50$\% of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement.
摘要
我们提出了一种方法,用于在文本中植入抗干扰的水印,不会改变文本的分布,直到最大生成预算为止。我们生成水印文本,通过将一个序列Random numbers(我们使用随机水印密钥计算)映射到语言模型的样本。为检测水印文本,任何知道密钥的人可以将文本与Random number序列进行对齐。我们实现了我们的水印方法,使用两种采样方案:反转采样和最小值采样。我们在三个语言模型(OPT-1.3B、LLaMA-7B和Alpaca-7B)上实验 validate its statistical power and robustness to various paraphrasing attacks。我们发现,对OPT-1.3B和LLaMA-7B模型,我们可以在35个字符之前(也就是说,在40-50%的字符被随机编辑后)检测水印文本(p ≤ 0.01)。对Alpaca-7B模型,我们进行了一项研究,探讨是否可以在用户指令的回答中植入水印。由于回答的低 entropy,检测变得更加困难:约25%的回答( median length around 100 tokens)可以在p ≤ 0.01的情况下检测,而水印也更易受到一些自动生成的修改攻击。
When to generate hedges in peer-tutoring interactions
results: 研究发现,使用嵌入层(捕捉上一句话的语义信息)可以显著提高模型的性能。此外,研究还提供了关于不同特征(如人际关系和非语言表达)在预测幂值方面的重要性的视觉值解释。研究发现,辅导和学生双方的视线强烈关系到幂值预测。这一观察得到了验证通过后续减少研究。Abstract
This paper explores the application of machine learning techniques to predict where hedging occurs in peer-tutoring interactions. The study uses a naturalistic face-to-face dataset annotated for natural language turns, conversational strategies, tutoring strategies, and nonverbal behaviours. These elements are processed into a vector representation of the previous turns, which serves as input to several machine learning models. Results show that embedding layers, that capture the semantic information of the previous turns, significantly improves the model's performance. Additionally, the study provides insights into the importance of various features, such as interpersonal rapport and nonverbal behaviours, in predicting hedges by using Shapley values for feature explanation. We discover that the eye gaze of both the tutor and the tutee has a significant impact on hedge prediction. We further validate this observation through a follow-up ablation study.
摘要
All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection
paper_authors: Daniele Mari, Davide Salvi, Paolo Bestagini, Simone Milani
for: 防止声音深冒险攻击和身份盗窃 + The paper is written to address the issue of synthetic speech detection in order to prevent frauds and identity thefts.
methods: 结合了三种文献中提出的特征集 + The paper uses a fusion of three different feature sets proposed in the literature to improve the performance of synthetic speech detection.
results: 在不同的场景和数据集上实现了更好的总体性能 + The paper presents a model that fuses the three feature sets and achieves better overall performance compared to state-of-the-art solutions, with robustness to anti-forensic attacks and generalization capabilities.Here is the information in Simplified Chinese text:
for: 防止声音深冒险攻击和身份盗窃
methods: 结合了三种文献中提出的特征集
results: 在不同的场景和数据集上实现了更好的总体性能Abstract
Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
摘要
(Simplified Chinese translation)最近的深度学习和计算机视觉技术的进步,使得 multimedia 内容的合成和伪造变得更加容易,可能导致来自黑客的威胁和危险。在音频领域,我们目睹到深度语音生成技术的快速发展,这使得对于可能的欺诈或身份盗用而需要开发深层 speech 检测算法。在这篇论文中,我们考虑了 literature 中提出的三种不同的特征集,并提出一种将其融合的模型,实现了与当前的解决方案相比的更好的性能。系统在不同的场景和数据集上进行了测试,以证明其对抗反科学攻击和泛化能力的Robustness。
‘What are you referring to?’ Evaluating the Ability of Multi-Modal Dialogue Models to Process Clarificational Exchanges
paper_authors: Javier Chiyah-Garcia, Alessandro Suglia, Arash Eshghi, Helen Hastie
For: 这篇论文主要针对对话中的referential ambiguity问题,即当引用表达不唯一确定所指的对象时,谈话中的冲突和修复机制。* Methods: 该论文使用SIMMC 2.0数据集来评估不同状态艺术模型对Clarificational Exchanges (CE)的处理能力,包括对对话历史相关的CE进行处理。* Results: 研究发现,语言基于模型可以编码多Modal semantic information,并处理一些CE;而多Modal模型可以通过额外学习目标获得分离的对象表示,这在处理多modal referential ambiguity中发挥了关键作用。Abstract
Referential ambiguities arise in dialogue when a referring expression does not uniquely identify the intended referent for the addressee. Addressees usually detect such ambiguities immediately and work with the speaker to repair it using meta-communicative, Clarificational Exchanges (CE): a Clarification Request (CR) and a response. Here, we argue that the ability to generate and respond to CRs imposes specific constraints on the architecture and objective functions of multi-modal, visually grounded dialogue models. We use the SIMMC 2.0 dataset to evaluate the ability of different state-of-the-art model architectures to process CEs, with a metric that probes the contextual updates that arise from them in the model. We find that language-based models are able to encode simple multi-modal semantic information and process some CEs, excelling with those related to the dialogue history, whilst multi-modal models can use additional learning objectives to obtain disentangled object representations, which become crucial to handle complex referential ambiguities across modalities overall.
摘要
优先级意图杂化出现在对话中当referring表达不唯一地标识目标对象时。对话参与者通常立即发现这种杂化并与对话者使用meta-communicative, Clarificational Exchanges(CE)进行修复,包括一个Clarification Request(CR)和回应。我们 argue that能生成和回应CE强制要求对多模态、视觉固定对话模型的架构和目标函数做出特定的限制。我们使用SIMMC 2.0数据集来评估不同状态 искусственного智能模型的处理CE能力,并使用一个度量测试模型中的上下文更新。我们发现语言基于模型可以编码简单的多模态Semantic信息并处理一些CE,在对话历史相关的CE方面表现出色,而多模态模型可以通过额外学习目标函数获得分离的对象表示,这些表示在多modal杂化中扮演重要角色。
Oracle Computability and Turing Reducibility in the Calculus of Inductive Constructions
results: 这篇论文得到了以下结果:Turing 下降形成上半semilattice,传输 decidability,并且比 truth-table 下降更加强大表达能力。此外,当 predicate $p$ 和其 complement 都是对 oracle $q$ 的 semi-decidable 时,then $p$ Turing-reduces to $q$.Abstract
We develop synthetic notions of oracle computability and Turing reducibility in the Calculus of Inductive Constructions (CIC), the constructive type theory underlying the Coq proof assistant. As usual in synthetic approaches, we employ a definition of oracle computations based on meta-level functions rather than object-level models of computation, relying on the fact that in constructive systems such as CIC all definable functions are computable by construction. Such an approach lends itself well to machine-checked proofs, which we carry out in Coq. There is a tension in finding a good synthetic rendering of the higher-order notion of oracle computability. On the one hand, it has to be informative enough to prove central results, ensuring that all notions are faithfully captured. On the other hand, it has to be restricted enough to benefit from axioms for synthetic computability, which usually concern first-order objects. Drawing inspiration from a definition by Andrej Bauer based on continuous functions in the effective topos, we use a notion of sequential continuity to characterise valid oracle computations. As main technical results, we show that Turing reducibility forms an upper semilattice, transports decidability, and is strictly more expressive than truth-table reducibility, and prove that whenever both a predicate $p$ and its complement are semi-decidable relative to an oracle $q$, then $p$ Turing-reduces to $q$.
摘要
我们在Calculus of Inductive Constructions(CIC)中发展了一种干预计算和图灵可reducible的概念。与传统的 sintética方法不同,我们使用基于高级函数而不是对象水平模型的计算定义 oracle computations。这种方法适合机器检查证明,我们在Coq中进行了证明。在高级层次上定义干预计算的问题存在一种矛盾。一方面,它必须够精细以证明中心结果,确保所有概念都能够准确地捕捉。另一方面,它必须够简单以便利用 axioms for synthetic computability,这些axioms通常只关注第一级对象。 draw inspiration from Andrej Bauer 基于有效幂论中的连续函数的定义,我们使用sequential continuity来 caracterize valid oracle computations。我们的主要技术结果包括:1. Turing reducibility forms an upper semilattice。2. Turing reducibility transports decidability。3. Turing reducibility is strictly more expressive than truth-table reducibility。4. If both a predicate $p$ and its complement are semi-decidable relative to an oracle $q$, then $p$ Turing-reduces to $q$.注意:以下是简化中文版本,如果需要更加详细的解释,请咨询专业人士。
The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling
results: 研究发现,这些encoder在不同任务中的增量行为都具有不同的特点,这可以帮助改进修订策略。Abstract
Incremental dialogue model components produce a sequence of output prefixes based on incoming input. Mistakes can occur due to local ambiguities or to wrong hypotheses, making the ability to revise past outputs a desirable property that can be governed by a policy. In this work, we formalise and characterise edits and revisions in incremental sequence labelling and propose metrics to evaluate revision policies. We then apply our methodology to profile the incremental behaviour of three Transformer-based encoders in various tasks, paving the road for better revision policies.
摘要
转换文本为简化中文:增量对话模型组件生成基于输入的输出前缀序列。由于地方冲突或错误假设,可能会出现错误,因此能够修改过去输出的能力是一个感irable的属性,可以由策略控制。在这项工作中,我们将增量编辑和修订在增量序列标记中进行正式化和特征化,并提出修订策略评价指标。然后,我们将方法应用于三种基于Transformer的编码器在不同任务中的增量行为进行 profiling,为更好的修订策略开出道路。
The timing bottleneck: Why timing and overlap are mission-critical for conversational user interfaces, speech recognition and dialogue systems
paper_authors: Andreas Liesenfeld, Alianda Lopez, Mark Dingemanse
for: 这些研究是为了评估现有的商业语音识别系统在对话场景中的性能。
methods: 研究者使用了5种主要的商业语音识别系统,对于6种语言的自然对话数据进行了评估。
results: 研究发现,对话数据中的单词错误率仍然很高,而 overlap 问题是对话识别的关键挑战。这些结果有助于评估当前的对话语音识别技术的状态,并且可以帮助建立更加可靠的对话speech技术。Abstract
Speech recognition systems are a key intermediary in voice-driven human-computer interaction. Although speech recognition works well for pristine monologic audio, real-life use cases in open-ended interactive settings still present many challenges. We argue that timing is mission-critical for dialogue systems, and evaluate 5 major commercial ASR systems for their conversational and multilingual support. We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). This impacts especially the recognition of conversational words (study 2), and in turn has dire consequences for downstream intent recognition (study 3). Our findings help to evaluate the current state of conversational ASR, contribute towards multidimensional error analysis and evaluation, and identify phenomena that need most attention on the way to build robust interactive speech technologies.
摘要
speech recognition systems 是人机交互中的关键中间件,尽管speech recognition在纯净的对话中工作得很好,但实际的生活中的开放式交互场景仍然存在许多挑战。我们认为时间是对话系统的关键因素,并评估了5个主要的商业ASR系统的对话和多语言支持。我们发现,在6种自然的对话语言中,word error rate remain extremely high,并且 overlap是关键挑战(研究1)。这继而影响了对话词的识别(研究2),并 ultimately affects downstream intent recognition(研究3)。我们的发现可以评估当前的对话ASR的状态,帮助建立多维度的错误分析和评估,并 indentify需要特别注意的现象,以建立可靠的对话speech技术。
Cross-Modal Concept Learning and Inference for Vision-Language Models
results: 我们的 CCLI 方法在 few-shot learning 和领域泛化等下游任务中表现出了显著的提升,比如与当前状态艺术法相比,提高了8.0%的性能。Abstract
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.
摘要
大规模预训练视觉语言模型(VLM),如CLIP,已经建立了文本和图像之间的相关性,在多种下游任务上达到了非常成功的结果。现有的精细调整方法中,通常将类型特定的文本描述与整个图像进行匹配,但我们认为这种整个图像匹配并不有效,因为图像从同一类型的图像中可能包含多个不同的semantic对象,而每个semantic对象都可能包含多个semantic部分或概念。在不同类型的图像中,这些semantic部分或概念可能会出现。为解决这个问题,在这篇论文中,我们开发了一种新的方法 called cross-model concept learning and inference(CCLI)。使用CLIP强大的文本-图像相关能力,我们自动学习了一个大量的特异性视觉概念从图像中,并使用这些视觉概念构建了一个描述性图像表示,并学习一个概念推理网络来进行下游图像分类任务,如少量学习和领域泛化。我们的CCLI方法在多种实验结果中表现出了大幅提升的性能,比如在少量学习任务上提升了8.0%,在领域泛化任务上提升了1.3%。
Trie-NLG: Trie Context Augmentation to Improve Personalized Query Auto-Completion for Short and Unseen Prefixes
results: 通过对两个大型 QAC 数据集进行评估,发现该方法可以提高 QAC 系统的 MRR 指标的表现,相比之下 Popular Trie-based Lookup 和 BART-based Baseline 方法,平均提高了大约 57% 和 14%。Abstract
Query auto-completion (QAC) aims at suggesting plausible completions for a given query prefix. Traditionally, QAC systems have leveraged tries curated from historical query logs to suggest most popular completions. In this context, there are two specific scenarios that are difficult to handle for any QAC system: short prefixes (which are inherently ambiguous) and unseen prefixes. Recently, personalized Natural Language Generation (NLG) models have been proposed to leverage previous session queries as context for addressing these two challenges. However, such NLG models suffer from two drawbacks: (1) some of the previous session queries could be noisy and irrelevant to the user intent for the current prefix, and (2) NLG models cannot directly incorporate historical query popularity. This motivates us to propose a novel NLG model for QAC, Trie-NLG, which jointly leverages popularity signals from trie and personalization signals from previous session queries. We train the Trie-NLG model by augmenting the prefix with rich context comprising of recent session queries and top trie completions. This simple modeling approach overcomes the limitations of trie-based and NLG-based approaches and leads to state-of-the-art performance. We evaluate the Trie-NLG model using two large QAC datasets. On average, our model achieves huge ~57% and ~14% boost in MRR over the popular trie-based lookup and the strong BART-based baseline methods, respectively. We make our code publicly available.
摘要
Query 自动完成(QAC)的目标是为给定的查询前缀提供可能的完成方案。传统上,QAC 系统都是基于历史查询记录中的尝试来建议最受欢迎的完成方案。在这种情况下,短前缀(即查询符)和未看到的前缀是两个困难的场景。最近,人性化的自然语言生成(NLG)模型被提议用于解决这两个挑战。然而,这些 NLG 模型受到两个缺点:(1)一些前一个会话中的查询可能是噪音和无关于用户意图的查询,和(2)NLG 模型不能直接包含历史查询的流行性信号。这种情况引发我们提出一种新的 NLG 模型,即 Trie-NLG,它同时利用尝试和前一个会话中的查询来提供可能的完成方案。我们在训练 Trie-NLG 模型时,将前缀添加了丰富的上下文,包括最近的会话中的查询和 top 尝试。这种简单的模型方法超越了尝试基于和 NLG 基于的方法,并带来了状态之最好的性能。我们使用两个大的 QAC 数据集来评估 Trie-NLG 模型。在 average 的情况下,我们的模型在 MRR 方面获得了大约 57% 和 14% 的提升,相比于流行的尝试基于的查看和强大的 BART 基eline 方法。我们将代码公开。
CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition
results: 实验结果显示,CFN-ESA可以优化ERC的表现,并与现有模型相比,得到了remarkable的进步。Abstract
Multimodal Emotion Recognition in Conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a cross-modal fusion network with emotion-shift awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information, rendering it hard to adequately extract complementary and associative information from multimodal data. To cope with this problem, in CFN-ESA, textual modalities are treated as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of the unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together the data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture related information, thereby guide the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform the state-of-the-art models.
摘要
多modal情感识别在对话(ERC)领域已经吸引了不同领域的研究者的关注。在这篇论文中,我们提出了跨modal融合网络 WITH emotion-shift 意识(CFN-ESA) для ERC。现有的方法均视每个模式都是平等的,无法准确地EXTRACT complementary和associative information FROM multimodal data。为了解决这个问题,在CFN-ESA中,文本模式被视为情感信息的主要来源,而视觉和声音模式则被视为次要来源。此外,大多数多modal ERC模型忽视情感转换信息并过度关注上下文信息,导致情感识别下情感转换场景失败。我们提出了情感转换模块来解决这个挑战。CFN-ESA主要由RUME、ACME和LESM三部分组成。RUME用于EXTRACT对话水平的情感cue,并将多个模式之间的数据分布相互紧密连接起来;ACME用于在文本模式为中心进行多模式交互;LESM用于模型情感转换, capture相关信息,以导引主任务的学习。实验结果表明,CFN-ESA可以有效提高ERC的性能,并remarkably exceed state-of-the-art模型。
Investigating the Learning Behaviour of In-context Learning: A Comparison with Supervised Learning
paper_authors: Xindi Wang, Yufei Wang, Can Xu, Xiubo Geng, Bowen Zhang, Chongyang Tao, Frank Rudzicz, Robert E. Mercer, Daxin Jiang
for: 这 paper 的目的是 investigate the learning behavior of in-context learning (ICL) and compare it with supervised learning (SL) under label perturbations.
results: AUTHORS 发现,gold labels 对下游 ICL 性能有显著影响,特别是对大语言模型; 然而,不均衡标签对 ICL 的影响几乎无关紧要。此外,作者还发现,相比 SL,ICL 对标签扰乱更为敏感,但是随着模型大小增加,ICL 逐渐达到 SL 的性能水平。Abstract
Large language models (LLMs) have shown remarkable capacity for in-context learning (ICL), where learning a new task from just a few training examples is done without being explicitly pre-trained. However, despite the success of LLMs, there has been little understanding of how ICL learns the knowledge from the given prompts. In this paper, to make progress toward understanding the learning behaviour of ICL, we train the same LLMs with the same demonstration examples via ICL and supervised learning (SL), respectively, and investigate their performance under label perturbations (i.e., noisy labels and label imbalance) on a range of classification tasks. First, via extensive experiments, we find that gold labels have significant impacts on the downstream in-context performance, especially for large language models; however, imbalanced labels matter little to ICL across all model sizes. Second, when comparing with SL, we show empirically that ICL is less sensitive to label perturbations than SL, and ICL gradually attains comparable performance to SL as the model size increases.
摘要
大型语言模型(LLM)已经表现出杰出的内容学习(ICL)能力,即从极少的训练示例中学习新任务,而不需要预先训练。然而, despite 成功的 LLM, there has been little understanding of how ICL learns the knowledge from the given prompts. 在这篇论文中,我们将同一个 LLM 训练 via ICL 和监督学习(SL),并调查它们在标签噪音(i.e., 杂凑标签和标签不均)的情况下的性能。首先,通过广泛的实验,我们发现 gold labels 对 downstream in-context performance 有很大的影响,特别是 для large language models; however, imbalanced labels matter little to ICL across all model sizes。其次,我们比较 SL 和 ICL,我们显示了实践中 ICL 比 SL 更敏感于标签噪音,并且 ICL 逐渐实现了与 SL 相同的性能,随着模型大小增加。
Towards a Fully Unsupervised Framework for Intent Induction in Customer Support Dialogues
methods: 本研究使用了对话 corpora 的预处理技术,以提高结果的准确性。同时,通过investigating the most common sequences,提取对话的意图流程。
results: 本研究在 MultiWOZ dataset 上进行了测试,并获得了可靠的结果。这种框架不仅可以应用于 MultiWOZ dataset,还可以应用于任何可能的用 caso,例如实际世界中的客户支持应用。Abstract
State of the art models in intent induction require annotated datasets. However, annotating dialogues is time-consuming, laborious and expensive. In this work, we propose a completely unsupervised framework for intent induction within a dialogue. In addition, we show how pre-processing the dialogue corpora can improve results. Finally, we show how to extract the dialogue flows of intentions by investigating the most common sequences. Although we test our work in the MultiWOZ dataset, the fact that this framework requires no prior knowledge make it applicable to any possible use case, making it very relevant to real world customer support applications across industry.
摘要
现代模型对意向推干需要标注数据集。然而,标注对话是时间费时的、劳苦的和昂费的。在这个工作中,我们提出了一个 completly 无监控的框架,用于对对话中的意向进行推干。此外,我们显示了如何对对话数据库进行预processing,以改善结果。最后,我们显示了如何从最常见的sequences中提取对话流程的意向。我们在MultiWOZ dataset上进行了测试,但由于这个框架不需任何先前知识,因此它适用于任何可能的用 caso,使其在实际世界中的客户支持应用程序中非常有 relevance。
Multilingual Tourist Assistance using ChatGPT: Comparing Capabilities in Hindi, Telugu, and Kannada
results: 研究发现,印地语翻译表现出色,具有更高的准确性和流畅性,而telugu翻译则落后于其他语言。人工评分者对翻译的准确性和流畅性进行评估,提供了全面的语言模型性能评估。Abstract
This research investigates the effectiveness of ChatGPT, an AI language model by OpenAI, in translating English into Hindi, Telugu, and Kannada languages, aimed at assisting tourists in India's linguistically diverse environment. To measure the translation quality, a test set of 50 questions from diverse fields such as general knowledge, food, and travel was used. These were assessed by five volunteers for accuracy and fluency, and the scores were subsequently converted into a BLEU score. The BLEU score evaluates the closeness of a machine-generated translation to a human translation, with a higher score indicating better translation quality. The Hindi translations outperformed others, showcasing superior accuracy and fluency, whereas Telugu translations lagged behind. Human evaluators rated both the accuracy and fluency of translations, offering a comprehensive perspective on the language model's performance.
摘要
这项研究探讨了OpenAI开发的语言模型ChatGPT在将英语翻译成印地语、telugu和 kannada语言方面的效果,以帮助印度语言多样性环境中的旅游者。为衡量翻译质量,研究使用了50个多学科知识、食物和旅行的问题集,由5名志愿者评测准确性和流畅性,并将得分转换为BLEU分数。BLEU分数评估机器生成翻译与人工翻译之间的相似性,高分数表示更高的翻译质量。印地语翻译表现出色,准确性和流畅性都较高,而telugu翻译则落后。人工评测器对翻译准确性和流畅性进行评估,为语言模型表现提供了全面的视角。
Teach Me How to Improve My Argumentation Skills: A Survey on Feedback in Argumentation
results: 论文发现,现有的计算机模型可以提供较为 ricH 的反馈,但是这些反馈通常无法解释为什么某个论证是低质量的,这限制了对学生的反馈提供 constructive 的feedback。Abstract
The use of argumentation in education has been shown to improve critical thinking skills for end-users such as students, and computational models for argumentation have been developed to assist in this process. Although these models are useful for evaluating the quality of an argument, they oftentimes cannot explain why a particular argument is considered poor or not, which makes it difficult to provide constructive feedback to users to strengthen their critical thinking skills. In this survey, we aim to explore the different dimensions of feedback (Richness, Visualization, Interactivity, and Personalization) provided by the current computational models for argumentation, and the possibility of enhancing the power of explanations of such models, ultimately helping learners improve their critical thinking skills.
摘要
使用辩论在教育中有助于提高学生的批判性思维能力,计算机模型也已经为这个过程而开发。虽然这些模型有用于评估论证质量,但它们往往无法解释特定论证为何不好或者不合理,这使得给用户提供有用的反馈很困难,从而难以帮助学生提高批判性思维能力。在这份调查中,我们计划探讨现有的计算机模型feedback维度(丰富性、可视化、交互性和个性化),以及可能通过增强这些模型的解释力来帮助学生提高批判性思维能力。
BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering
methods: 本研究使用了预训Sequence-to-Sequence和 bidirectional encoder representation from Image Transformers,并在越南语中进行训练。
results: 实验结果显示,我们的提案模型在六个指标中出performing better than 强基eline,包括 Accuracy、Precision、Recall、F1-score、WUPS 0.0 和 WUPS 0.9。Abstract
Visual Question Answering (VQA) is an intricate and demanding task that integrates natural language processing (NLP) and computer vision (CV), capturing the interest of researchers. The English language, renowned for its wealth of resources, has witnessed notable advancements in both datasets and models designed for VQA. However, there is a lack of models that target specific countries such as Vietnam. To address this limitation, we introduce a transformer-based Vietnamese model named BARTPhoBEiT. This model includes pre-trained Sequence-to-Sequence and bidirectional encoder representation from Image Transformers in Vietnamese and evaluates Vietnamese VQA datasets. Experimental results demonstrate that our proposed model outperforms the strong baseline and improves the state-of-the-art in six metrics: Accuracy, Precision, Recall, F1-score, WUPS 0.0, and WUPS 0.9.
摘要
视觉问答(VQA)是一项复杂且需求高的任务,涉及自然语言处理(NLP)和计算机视觉(CV),吸引了研究者们的关注。英语,因其资源丰富,在VQA领域已经取得了显著进步,但是尚未有专门针对特定国家的模型。为了解决这一限制,我们提出了一个基于变换器的越南语模型,名为BARTPhoBEiT。这个模型包括预训练的序列到序列和双向编码器表示图像变换器在越南语中,并评估越南语VQA数据集。实验结果表明,我们提议的模型超越强基线和提高了状态之册的六个指标:准确率、精度、回归率、F1分数、WUPS 0.0和WUPS 0.9。
SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text
results: 在synthetic corpora上,我们的方法可以生成更加理解的抽象,并且只需要提供一部分标签。在实际 corpora 上,我们获得了相似的结果。Abstract
A common way to explore text corpora is through low-dimensional projections of the documents, where one hopes that thematically similar documents will be clustered together in the projected space. However, popular algorithms for dimensionality reduction of text corpora, like Latent Dirichlet Allocation (LDA), often produce projections that do not capture human notions of document similarity. We propose a semi-supervised human-in-the-loop LDA-based method for learning topics that preserve semantically meaningful relationships between documents in low-dimensional projections. On synthetic corpora, our method yields more interpretable projections than baseline methods with only a fraction of labels provided. On a real corpus, we obtain qualitatively similar results.
摘要
通常来说,探索文本 corpus 的方式是通过低维度投影文档,希望在投影空间中 clusters similar documents 。然而,流行的文本探索维度减少算法,如 Latent Dirichlet Allocation (LDA),经常生成投影不符合人类意义的文档相似性。我们提议一种 semi-supervised 人在循环 LDA 基于方法,以学习保持含义相似性的文档关系在低维度投影中。在 sintetic corpus 上,我们的方法可以提供更加 interpretable 的投影,只需提供一部分标签。在真实 corpus 上,我们获得了类似的结果。
TrafficSafetyGPT: Tuning a Pre-trained Large Language Model to a Domain-Specific Expert in Transportation Safety
results: 研究发现,这种TrafficSafetyGPT模型在交通安全领域任务中表现出色,并且可以减轻特殊的交通安全专业知识的需求。Abstract
Large Language Models (LLMs) have shown remarkable effectiveness in various general-domain natural language processing (NLP) tasks. However, their performance in transportation safety domain tasks has been suboptimal, primarily attributed to the requirement for specialized transportation safety expertise in generating accurate responses [1]. To address this challenge, we introduce TrafficSafetyGPT, a novel LLAMA-based model, which has undergone supervised fine-tuning using TrafficSafety-2K dataset which has human labels from government produced guiding books and ChatGPT-generated instruction-output pairs. Our proposed TrafficSafetyGPT model and TrafficSafety-2K train dataset are accessible at https://github.com/ozheng1993/TrafficSafetyGPT.
摘要
ChatHome: Development and Evaluation of a Domain-Specific Language Model for Home Renovation
results: 实验表明,ChatHome不仅提高了域专功能,还保持了其通用性。Abstract
This paper presents the development and evaluation of ChatHome, a domain-specific language model (DSLM) designed for the intricate field of home renovation. Considering the proven competencies of large language models (LLMs) like GPT-4 and the escalating fascination with home renovation, this study endeavors to reconcile these aspects by generating a dedicated model that can yield high-fidelity, precise outputs relevant to the home renovation arena. ChatHome's novelty rests on its methodology, fusing domain-adaptive pretraining and instruction-tuning over an extensive dataset. This dataset includes professional articles, standard documents, and web content pertinent to home renovation. This dual-pronged strategy is designed to ensure that our model can assimilate comprehensive domain knowledge and effectively address user inquiries. Via thorough experimentation on diverse datasets, both universal and domain-specific, including the freshly introduced "EvalHome" domain dataset, we substantiate that ChatHome not only amplifies domain-specific functionalities but also preserves its versatility.
摘要
ChatHome's novelty lies in its methodology, which fuses domain-adaptive pretraining and instruction-tuning over an extensive dataset. This dataset includes professional articles, standard documents, and web content related to home renovation. This dual-pronged approach is designed to ensure that our model can absorb comprehensive domain knowledge and effectively address user inquiries.Through thorough experimentation on diverse datasets, including the newly introduced "EvalHome" domain dataset, we demonstrate that ChatHome not only enhances domain-specific functionalities but also preserves its versatility.
Multilingual Lexical Simplification via Paraphrase Generation
results: 实验结果表明,我们的方法在英语、西班牙语和葡萄牙语等语言上显著超过 BERT 基于方法和零批 GPT3 基于方法。Abstract
Lexical simplification (LS) methods based on pretrained language models have made remarkable progress, generating potential substitutes for a complex word through analysis of its contextual surroundings. However, these methods require separate pretrained models for different languages and disregard the preservation of sentence meaning. In this paper, we propose a novel multilingual LS method via paraphrase generation, as paraphrases provide diversity in word selection while preserving the sentence's meaning. We regard paraphrasing as a zero-shot translation task within multilingual neural machine translation that supports hundreds of languages. After feeding the input sentence into the encoder of paraphrase modeling, we generate the substitutes based on a novel decoding strategy that concentrates solely on the lexical variations of the complex word. Experimental results demonstrate that our approach surpasses BERT-based methods and zero-shot GPT3-based method significantly on English, Spanish, and Portuguese.
摘要
Lexical simplification(LS)方法基于预训练语言模型已经做出了很大的进步,通过分析词语上下文环境来生成可能的替换词。然而,这些方法需要单独的预训练模型来支持不同的语言,并且忽略了保持句子意义的要求。在这篇论文中,我们提出了一种新的多语言LS方法,通过句子重写来提供多样性的词选择,同时保持句子意义的完整性。我们将重写视为一种零批译翻译任务,利用多语言神经翻译模型支持多种语言。在输入句子被编码器模型处理后,我们通过一种新的解码策略来生成替换词,专注于复杂词语的语言变换。实验结果显示,我们的方法在英语、西班牙语和葡萄牙语等语言上超越BERT基于方法和零批GPT3基于方法。
f-Divergence Minimization for Sequence-Level Knowledge Distillation
results: 实验结果显示,本文提出的方法可以超越现有的填充方法,并且使用 симметриック的填充损失可以更好地让学生模型学习教师分布。Abstract
Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.
摘要
知识填充(KD)是将知识从大型模型传递到小型模型的过程。随着自然语言处理领域中模型的不断扩大,KD已经受到了越来越多的关注。在这项工作中,我们提出了f-DISTILL框架,它将序列级知识填充形式化为最小化一个通用f-散度函数。我们提出了四种填充变体,并证明了现有的SeqKD和ENGINE方法是f-DISTILL方法的近似方法。我们还 deriv了 step-wise 分解,将不可 tractable的序列级散度降到单词级损失,可以在可追踪的方式上计算。实验结果表明,我们的方法在四个数据集上表现出色,并且我们的对称填充损失可以更好地让学生学习老师分布。
results: 实验表明,使用R-LACE(Ravfogel et al., 2022)返回的一维子空间可以控制语言模型生成的字符串的概念值。在本文的 causal controlled intervention 中,我们发现,对于至少一个模型,可以使用 R-LACE 返回的子空间来准确地控制生成的字符串中的概念值。Abstract
Large language models rely on real-valued representations of text to make their predictions. These representations contain information learned from the data that the model has trained on, including knowledge of linguistic properties and forms of demographic bias, e.g., based on gender. A growing body of work has considered removing information about concepts such as these using orthogonal projections onto subspaces of the representation space. We contribute to this body of work by proposing a formal definition of $\textit{intrinsic}$ information in a subspace of a language model's representation space. We propose a counterfactual approach that avoids the failure mode of spurious correlations (Kumar et al., 2022) by treating components in the subspace and its orthogonal complement independently. We show that our counterfactual notion of information in a subspace is optimized by a $\textit{causal}$ concept subspace. Furthermore, this intervention allows us to attempt concept controlled generation by manipulating the value of the conceptual component of a representation. Empirically, we find that R-LACE (Ravfogel et al., 2022) returns a one-dimensional subspace containing roughly half of total concept information under our framework. Our causal controlled intervention shows that, for at least one model, the subspace returned by R-LACE can be used to manipulate the concept value of the generated word with precision.
摘要
大型语言模型通常使用实数valued表示文本来进行预测。这些表示包含模型在训练数据中学习的信息,包括语言性质和人口偏见(如性别)等。一个快速增长的研究领域是去除这些信息,使用正交投影onto subspace of representation space。我们对这些研究进行贡献,提出了一个正式的内在信息定义在语言模型表示空间中的概念。我们提出了一种避免假 correlate 失败模式(Kumar et al., 2022)的对照方法,该方法将subspace和其正交补做独立处理。我们显示了我们的对照方法可以优化内在信息在subspace中。此外,这种干预还允许我们通过控制概念的值来进行概念控制生成。我们的实验表明,R-LACE(Ravfogel et al., 2022)返回了一个一维子空间,包含约半个总概念信息。我们的 causal 控制干预表明,至少一个模型中,R-LACE返回的子空间可以准确地控制生成词的概念值。
results: 研究发现,bag-of-words 方法可以达到类似或更好的结果,而且更高效。Abstract
The effectiveness of compression in text classification ('gzip') has recently garnered lots of attention. In this note we show that `bag-of-words' approaches can achieve similar or better results, and are more efficient.
摘要
“压缩(gzip)在文本分类中的效果在最近引起了很多关注。在这个笔记中,我们展示了 bag-of-words 方法可以达到类似或更好的结果,并且更高效。”Note:* "压缩" (gzip) is translated as "压缩" in Simplified Chinese.* "bag-of-words" is translated as "bag-of-words" in Simplified Chinese.* "文本分类" (text classification) is translated as "文本分类" in Simplified Chinese.
For: The paper proposes a new linear attention-based Large Language Model (LLM) called TransNormerLLM, which outperforms conventional softmax attention-based models in terms of both accuracy and efficiency.* Methods: The paper introduces several advanced modifications to the previous linear attention architecture TransNormer, including positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. The paper also proposes a new technique called Lightning Attention to accelerate linear attention.* Results: The paper achieves impressive acceleration of over 20% and reduces memory usage by a remarkable four times. The model also shows superior efficiency during both training and inference stages, and is scalable for seamless deployment on large-scale clusters. The paper also demonstrates the effectiveness of the model through comprehensive experiments on a self-collected corpus exceeding 6TB and containing over 2 trillion tokens.Abstract
We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanism, tensor normalization, inference acceleration and stabilization. Specifically, we use LRPE together with an exponential decay to avoid attention dilution issues while allowing the model to retain global interactions between tokens. Additionally, we propose Lightning Attention, a cutting-edge technique that accelerates linear attention by more than twice in runtime and reduces memory usage by a remarkable four times. To further enhance the performance of TransNormer, we leverage a gating mechanism to smooth training and a new tensor normalization scheme to accelerate the model, resulting in an impressive acceleration of over 20%. Furthermore, we have developed a robust inference algorithm that ensures numerical stability and consistent inference speed, regardless of the sequence length, showcasing superior efficiency during both training and inference stages. Scalability is at the heart of our model's design, enabling seamless deployment on large-scale clusters and facilitating expansion to even more extensive models, all while maintaining outstanding performance metrics. Rigorous validation of our model design is achieved through a series of comprehensive experiments on our self-collected corpus, boasting a size exceeding 6TB and containing over 2 trillion tokens. To ensure data quality and relevance, we implement a new self-cleaning strategy to filter our collected data. Our pre-trained models will be released to foster community advancements in efficient LLMs.
摘要
我们介绍TransNormerLLM,世界上首个运算式注意力基于大语言模型(LLM),在精度和效率方面都能超越传统软max注意力基于模型。TransNormerLLM继承了之前的线性注意架构TransNormer,通过进一步改进,包括位置嵌入、加速线性注意、闸门机制、tensor normalization、推理加速和稳定。具体来说,我们使用LRPE和指数衰退来避免注意力扩散问题,同时让模型保留字元之间的全局互动。此外,我们提出了Lightning Attention技术,可以在runtime中更多地加速线性注意,并降低内存使用量,实现了四倍以上的加速。为了进一步提高TransNormer的性能,我们导入了闸门机制来缓和训练,并使用新的tensor normalization scheme来加速模型,实现了20%以上的加速。此外,我们开发了一个可靠的推理算法,可以在训练和推理阶段都保持数据稳定和一致的推理速度,无视序列长度。 scalability是我们模型的设计核心,使得可以顺利部署在大规模集群上,并且可以进一步扩展到更大的模型,同时保持出色的性能 Metrics。我们透过一系列严谨的实验,证明了我们的模型设计是正确的。我们将预训练的模型发布,以促进社区对效率LLM的发展。
Incrementally-Computable Neural Networks: Efficient Inference for Dynamic Inputs
results: 实验结果表明,使用该方法可以在文档编辑过程中实现高效的增量计算,并且可以保持与批处理模型相同的准确率。具体来说,相比于OPT-125M预训练语言模型,该方法可以降低12.1倍(中位数)的操作数量。Abstract
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs. For example, an AI writing assistant is required to update its suggestions in real time as a document is edited. Re-running the model each time is expensive, even with compression techniques like knowledge distillation, pruning, or quantization. Instead, we take an incremental computing approach, looking to reuse calculations as the inputs change. However, the dense connectivity of conventional architectures poses a major obstacle to incremental computation, as even minor input changes cascade through the network and restrict information reuse. To address this, we use vector quantization to discretize intermediate values in the network, which filters out noisy and unnecessary modifications to hidden neurons, facilitating the reuse of their values. We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of the modified inputs. Our experiments with adapting the OPT-125M pre-trained language model demonstrate comparable accuracy on document classification while requiring 12.1X (median) fewer operations for processing sequences of atomic edits.
摘要
为解决这个问题,我们使用 вектор量化来精度化 intermediate 值在网络中,从而过滤无用和干扰的修改。这使得可以 reuse hidden neurons 的值,从而实现高效的逐步计算算法。我们应用这种方法到 transformers 架构上,创造了一种高效的逐步计算算法,其复杂度与修改输入的 Fraction 成正比。我们的实验表明,可以在文档编辑中实现相同的准确率,而且需要 fewer operations( median 为 12.1X)来处理序列中的 atomic 修改。