cs.CL - 2023-09-26

Unsupervised Pre-Training for Vietnamese Automatic Speech Recognition in the HYKIST Project

paper_url: http://arxiv.org/abs/2309.15869
repo_url: None
paper_authors: Khai Le-Duc
for: 这个研究的目的是开发一个敬礼语言翻译系统，以支持患者和医生之间的沟通。
methods: 该研究使用了ASR和MT技术，并 investigate了不同的训练计划和数据结合策略，以提高系统的性能。
results: 研究发现，使用公共可用的模型如XLSR-53可以达到比较高的识别精度，而自定义预训练模型也可以提高系统的性能。同时，该研究还 compare了不同的训练方法，包括supervised和Unsupervised方法，并使用wav2vec 2.0作为架构。

Abstract
In today's interconnected globe, moving abroad is more and more prevalent, whether it's for employment, refugee resettlement, or other causes. Language difficulties between natives and immigrants present a common issue on a daily basis, especially in medical domain. This can make it difficult for patients and doctors to communicate during anamnesis or in the emergency room, which compromises patient care. The goal of the HYKIST Project is to develop a speech translation system to support patient-doctor communication with ASR and MT. ASR systems have recently displayed astounding performance on particular tasks for which enough quantities of training data are available, such as LibriSpeech. Building a good model is still difficult due to a variety of speaking styles, acoustic and recording settings, and a lack of in-domain training data. In this thesis, we describe our efforts to construct ASR systems for a conversational telephone speech recognition task in the medical domain for Vietnamese language to assist emergency room contact between doctors and patients across linguistic barriers. In order to enhance the system's performance, we investigate various training schedules and data combining strategies. We also examine how best to make use of the little data that is available. The use of publicly accessible models like XLSR-53 is compared to the use of customized pre-trained models, and both supervised and unsupervised approaches are utilized using wav2vec 2.0 as architecture.

摘要
今天的全球化社会中，越来越多的人选择移民 abroad，无论是为了工作、难民重新安置或其他原因。在医疗领域，语言障碍问题是每天都存在的问题，特别是在医生和患者之间的交流中。这会使患者和医生在医学询问或紧急室中的交流受到干扰，从而影响病人的护理。 Project HYKIST 的目标是开发一个语音翻译系统，以支持患者和医生之间的交流。 ASR 系统在特定任务上已经表现出了惊人的表现，如 LibriSpeech。但建立好模型仍然困难，因为有很多说话风格、音频和录音设置，以及缺乏相关领域的训练数据。在这个论文中，我们描述了我们在医疗领域的语音识别任务中使用 ASR 系统的努力。我们 investigate 了不同的训练计划和数据组合策略，以提高系统的表现。我们还研究了如何利用有限的数据来提高系统的性能。我们 compare 了使用公共可用模型如 XLSR-53 和自定义预训练模型，以及使用 supervised 和 unsupervised 方法，使用 wav2vec 2.0 架构。

RAGAS: Automated Evaluation of Retrieval Augmented Generation

paper_url: http://arxiv.org/abs/2309.15217
repo_url: None
paper_authors: Shahul Es, Jithin James, Luis Espinosa-Anke, Steven Schockaert
for: 评估 Retrieval Augmented Generation (RAG) 框架，不需要参考文本数据库。
methods: 使用 Retrieval 和 LLM 模块，将知识从参考文本数据库传递给 LLM，以减少用户与文本数据库之间的风险。
results: 提出了一组无需人工标注的评估指标，可以评估不同维度的 RAGB 架构，包括 retrieve 模块是否能够准确地标识有关焦点文本段落，LLM 模块是否能够准确地利用这些段落，以及生成结果的质量。

Abstract
We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAG systems are composed of a retrieval and an LLM based generation module, and provide LLMs with knowledge from a reference textual database, which enables them to act as a natural language layer between a user and textual databases, reducing the risk of hallucinations. Evaluating RAG architectures is, however, challenging because there are several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages in a faithful way, or the quality of the generation itself. With RAGAs, we put forward a suite of metrics which can be used to evaluate these different dimensions \textit{without having to rely on ground truth human annotations}. We posit that such a framework can crucially contribute to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

摘要
我们介绍了RAGAs（引用自由评估Retrieval Augmented Generation）框架，用于评估基于引用文本库的 Retrieval Augmented Generation（RAG）pipeline。RAG系统由一个检索和一个基于LLM的生成模块组成，通过将知识从参考文本库传递给LLM，使LLM能够作为自然语言层次，减少用户和文本库之间的风险假设。评估RAG体系却存在多个维度：检索系统能够寻找相关和焦点的文本段落，LLM能够充分利用这些文本段落，或生成的质量自身。我们提出了一组无需基于真实人类标注的指标，用于评估这些不同维度。我们认为这种框架可以在评估RAG体系中帮助降低评估周期的时间，特别是 LLM 的广泛采用。

STANCE-C3: Domain-adaptive Cross-target Stance Detection via Contrastive Learning and Counterfactual Generation

paper_url: http://arxiv.org/abs/2309.15176
repo_url: None
paper_authors: Nayoung Kim, David Mosallanezhad, Lu Cheng, Michelle V. Mancenido, Huan Liu
for: 这个研究的目的是提出一种适用于多个领域的立场推断模型，以便在不同领域和目标话题上进行高效的立场推断。
methods: 该模型使用了对比学习和对比生成来强化领域适应性的训练，以及修改的自然语言约束来防止过拟合和提高对多个领域的泛化能力。
results: 经过实验表明，该模型在多个 dataset 上表现出了性能提升，并且在不同领域和目标话题上具有较高的泛化能力。

Abstract
Stance detection is the process of inferring a person's position or standpoint on a specific issue to deduce prevailing perceptions toward topics of general or controversial interest, such as health policies during the COVID-19 pandemic. Existing models for stance detection are trained to perform well for a single domain (e.g., COVID-19) and a specific target topic (e.g., masking protocols), but are generally ineffectual in other domains or targets due to distributional shifts in the data. However, constructing high-performing, domain-specific stance detection models requires an extensive corpus of labeled data relevant to the targeted domain, yet such datasets are not readily available. This poses a challenge as the process of annotating data is costly and time-consuming. To address these challenges, we introduce a novel stance detection model coined domain-adaptive Cross-target STANCE detection via Contrastive learning and Counterfactual generation (STANCE-C3) that uses counterfactual data augmentation to enhance domain-adaptive training by enriching the target domain dataset during the training process and requiring significantly less information from the new domain. We also propose a modified self-supervised contrastive learning as a component of STANCE-C3 to prevent overfitting for the existing domain and target and enable cross-target stance detection. Through experiments on various datasets, we show that STANCE-C3 shows performance improvement over existing state-of-the-art methods.

摘要
<>translate_language: zh-CN<>Stance detection是推断人的立场或看法在特定问题上，以便推断人们对一些广泛或争议性的话题（如COVID-19大流行期间的健康政策）的看法。现有的姿态检测模型通常只能在单一领域（如COVID-19）和特定目标话题（如面具协议）上表现出色，但在其他领域或话题上通常无法达到相同的水平，这是因为数据的分布shift。然而，建立高性能的领域专门的姿态检测模型需要大量的相关领域数据，但这些数据并不易 disponibles。这种情况提出了一个挑战，因为标注数据的过程是贵重的和时间consuming。为解决这些挑战，我们介绍了一种新的姿态检测模型，名为域 adapted Cross-target STANCE detection via Contrastive learning and Counterfactual generation（STANCE-C3）。STANCE-C3使用了对立数据增强，以便在训练过程中增强目标领域数据，并且需要较少的新领域信息。我们还提出了一种修改后的自我超视的对比学习，以避免过拟合现有领域和目标，并启用跨目标姿态检测。通过对多个数据集进行实验，我们表明STANCE-C3表现出了与现有状态艺技的性能提升。

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

paper_url: http://arxiv.org/abs/2309.15088
repo_url: https://github.com/castorini/rank_llm
paper_authors: Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin
for: 提高信息检索中列表重新排序的质量，并使用现代大语言模型（LLM）进行列表重新排序。
methods: 使用开源的7B参数模型，基于GPT-3.5和GPT-4的列表重新排序方法，并进行了分布式训练和排序。
results: 实验结果表明，我们可以在零批训练情况下达到与GPT-3.5的列表重新排序效果相似，但效果略为落后于GPT-4。我们希望我们的工作可以为未来关于列表重新排序的研究提供基础。

Abstract
Researchers have successfully applied large language models (LLMs) such as ChatGPT to reranking in an information retrieval context, but to date, such work has mostly been built on proprietary models hidden behind opaque API endpoints. This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. To address this significant shortcoming, we present RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. Experimental results on the TREC 2019 and 2020 Deep Learning Tracks show that we can achieve effectiveness comparable to zero-shot reranking with GPT-3.5 with a much smaller 7B parameter model, although our effectiveness remains slightly behind reranking with GPT-4. We hope our work provides the foundation for future research on reranking with modern LLMs. All the code necessary to reproduce our results is available at https://github.com/castorini/rank_llm.

摘要

Question-Answering Approach to Evaluate Legal Summaries

paper_url: http://arxiv.org/abs/2309.15016
repo_url: None
paper_authors: Huihui Xu, Kevin Ashley
for: 法律摘要评价框架
methods: GPT-4生成问答对集和答案评价
results: GPT-4评价与人类评价之间的相关性可以用于评估摘要质量

Abstract
Traditional evaluation metrics like ROUGE compare lexical overlap between the reference and generated summaries without taking argumentative structure into account, which is important for legal summaries. In this paper, we propose a novel legal summarization evaluation framework that utilizes GPT-4 to generate a set of question-answer pairs that cover main points and information in the reference summary. GPT-4 is then used to generate answers based on the generated summary for the questions from the reference summary. Finally, GPT-4 grades the answers from the reference summary and the generated summary. We examined the correlation between GPT-4 grading with human grading. The results suggest that this question-answering approach with GPT-4 can be a useful tool for gauging the quality of the summary.

摘要
传统的评估指标如ROUGE对 lexical overlap между参考和生成摘要没有考虑情节结构，这是法律摘要中重要的一点。在这篇论文中，我们提出了一种新的法律摘要评估框架，利用 GPT-4 生成一组对应于参考摘要中主要点和信息的问题集。然后，GPT-4 使用生成的摘要回答这些问题。最后，GPT-4 评分来自参考摘要和生成摘要的答案。我们对 GPT-4 评分与人工评分之间的相关性进行了检验。结果表明，这种问题回答方法与 GPT-4 可以作为评估摘要质量的有用工具。

Updated Corpora and Benchmarks for Long-Form Speech Recognition

paper_url: http://arxiv.org/abs/2309.15013
repo_url: https://github.com/revdotcom/speech-datasets
paper_authors: Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara, Corey Miller, Migüel Jetté
for: 这个论文主要用于研究长形语音识别（ASR）领域中的域名匹配问题。
methods: 该论文使用了三个标准的ASR corpora（TED-LIUM 3、Gigapeech和VoxPopuli-en），对其进行了更新的转录和对应，以便用于长形ASR研究。它们还研究了在训练和测试数据不同的情况下，逻辑架构和注意力基本encoder-decoder（AED）模型的Robustness问题。
results: 研究发现，AED模型更容易受到域名匹配问题的影响，而长形训练可以提高这些模型的Robustness。

Abstract
The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

摘要
大多数ASR研究使用已经 pré-分 segmented的数据集进行训练和测试。然而，在实际应用中，测试音频通常没有 segmented，导致模型训练用的 condition和测试 condition 之间存在匹配问题。在这篇论文中，我们重新发布了三个标准 ASR 数据集 - TED-LIUM 3、Gigapeech 和 VoxPopuli-en - 的更新的转录和对应，以便用于长形 ASR 研究。我们使用这些重新拟合的数据集来研究训练和测试之间的匹配问题，发现 AEDs 更容易受到这种问题的影响。最后，我们测试了一种简单的长形训练方法，并证明其在这种领域移植中的效果。

Robustness of the Random Language Model

paper_url: http://arxiv.org/abs/2309.14913
repo_url: https://github.com/chrisneagu/FTC-Skystone-Dark-Angels-Romania-2020
paper_authors: Fatemeh Lalegani, Eric De Giuli
for: 本研究探讨了人类和计算机语言之间的语法匹配。
methods: 该研究使用了随机语言模型（De Giuli 2019），这是一种 ensemble of stochastic context-free grammars，用于量化人类和计算机语言的 syntax。
results: 研究表明，在考虑到显式对称破坏的情况下，模型的enario是Robust的。与人类语言数据中的 syntax 网络划分系数相比， Observation 与24岁的儿童 обычно经历的转变相当。

Abstract
The Random Language Model (De Giuli 2019) is an ensemble of stochastic context-free grammars, quantifying the syntax of human and computer languages. The model suggests a simple picture of first language learning as a type of annealing in the vast space of potential languages. In its simplest formulation, it implies a single continuous transition to grammatical syntax, at which the symmetry among potential words and categories is spontaneously broken. Here this picture is scrutinized by considering its robustness against explicit symmetry breaking, an inevitable component of learning in the real world. It is shown that the scenario is robust to such symmetry breaking. Comparison with human data on the clustering coefficient of syntax networks suggests that the observed transition is equivalent to that normally experienced by children at age 24 months.

摘要
随机语言模型（De Giuli 2019）是一个集合的随机上下文自由格式语言，量化人类和计算机语言的语法。该模型提出了一个简单的语言学习图景，认为人类语言学习是一种热化在可能语言空间中的过程。在最简式表述中，它表明了一种单一的连续变换，在潜在词汇和分类之间各自破坏 симметрии。在这种情况下，我们考虑了对显式对称破坏的Robustness，这是学习世界中不可避免的一部分。结果表明，这种情况具有Robustness。与人类语言结构网络的凝集系数相比，显示出这种过渡与24个月大的儿童常见的过渡相等。

Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2309.15796
repo_url: https://github.com/k2-fsa/icefall
paper_authors: Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel Povey, Sanjeev Khudanpur
For: 提高自动语音识别（ASR）系统的训练效果，适用于大量的高质量对应数据。* Methods: 提出了 Omni-temporal Classification（OTC）训练标准，通过考虑标签不确定性，使模型能够有效地学习语音-文本对应。OTC基于不确定的Weighted Finite State Transducers（WFST）扩展了传统的 CTC 目标函数。* Results: 通过在 LibriSpeech 和 LibriVox 数据集上进行实验，表明使用 OTC 训练 ASR 模型，甚至在对应文本中含有70%错误的情况下，模型的性能不会下降。

Abstract
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our implementation is available at https://github.com/k2-fsa/icefall.

摘要
培训自动语音识别（ASR）系统需要大量的高质量对应数据。然而，人工标注员通常会进行“非文字”抄写，这可能导致模型训练不佳。在这篇论文中，我们提出了一种新的训练标准《全时分类》（OTC），该标准直接表达标注不确定性的影响。这使得模型能够有效地学习语音-文本对应，同时满足训练脚本中存在错误的情况。OTC基于权重finite state transducers扩展了传统的CTC目标，并通过实验表明，即使训练脚本中有70%的错误，ASR模型也能够保持高效性。我们的实现可以在https://github.com/k2-fsa/icefall中找到。

Segmentation-Free Streaming Machine Translation

paper_url: http://arxiv.org/abs/2309.14823
repo_url: None
paper_authors: Javier Iranzo-Sánchez, Jorge Iranzo-Sánchez, Adrià Giménez, Jorge Civera, Alfons Juan
for: 提出了一种无需分 segmentation 的概率机器翻译（MT）框架，以实现在实时翻译中不需要预先分 segmentation。
methods: 提出了一种延迟 segmentation 决策ntil 翻译结果生成完毕的方法，使得模型可以在不需要硬件 segmentation 的情况下进行翻译。
results: 对比其他竞争方法，提出的 Segmentation-Free 框架在质量-延迟Trade-off中具有更好的性能。

Abstract
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model. Software, data and models will be released upon paper acceptance.

摘要
流动机器翻译（MT）是将输入文本流转换成实时翻译的任务。传统的堆叠方法，将自动语音识别（ASR）和MT系统结合在一起，需要一个中间分 segmentation 步骤，将转录流分成句子样式的单元。然而，在 incorporating 硬件分 segmentation 会限制MT系统的性能，并且是错误的来源。这篇论文提出了无需分 segmentation 的框架，允许模型在翻译过程中延迟分 segmentation 决策，直到翻译结果被生成。广泛的实验表明，提议的无需分 segmentation 框架在质量-延迟质量之间有更好的质量-延迟平衡，与独立的分 segmentation 模型相比。软件、数据和模型会在论文接受后释出。

BLIP-Adapter: Parameter-Efficient Transfer Learning for Mobile Screenshot Captioning

paper_url: http://arxiv.org/abs/2309.14774
repo_url: https://github.com/rainyugg/blip-adapter
paper_authors: Ching-Yu Chiang, I-Hua Chang, Shih-Wei Liao
for: 这项研究旨在探索屏幕截图captioning任务的有效调参方法。
methods: 本研究提议使用适应器方法，只需调参模型中的附加模块，以提高性能。
results: 研究表明，通过将批处理大型预训练模型的参数冻结，并仅调参适应器方法中的参数，可以实现与完全调参模型的性能相似，同时减少了大量参数的数量。

Abstract
This study aims to explore efficient tuning methods for the screenshot captioning task. Recently, image captioning has seen significant advancements, but research in captioning tasks for mobile screens remains relatively scarce. Current datasets and use cases describing user behaviors within product screenshots are notably limited. Consequently, we sought to fine-tune pre-existing models for the screenshot captioning task. However, fine-tuning large pre-trained models can be resource-intensive, requiring considerable time, computational power, and storage due to the vast number of parameters in image captioning models. To tackle this challenge, this study proposes a combination of adapter methods, which necessitates tuning only the additional modules on the model. These methods are originally designed for vision or language tasks, and our intention is to apply them to address similar challenges in screenshot captioning. By freezing the parameters of the image caption models and training only the weights associated with the methods, performance comparable to fine-tuning the entire model can be achieved, while significantly reducing the number of parameters. This study represents the first comprehensive investigation into the effectiveness of combining adapters within the context of the screenshot captioning task. Through our experiments and analyses, this study aims to provide valuable insights into the application of adapters in vision-language models and contribute to the development of efficient tuning techniques for the screenshot captioning task. Our study is available at https://github.com/RainYuGG/BLIP-Adapter

摘要
这个研究的目标是探索屏幕截图标题预测任务中有效的调参方法。在图像描述领域，近年来有了 significative 的进步，但是对手机屏幕上的用户行为描述 task 的研究尚处于相对缺乏的状态。当前的数据集和用户行为描述 case 都很有限，因此我们决定使用先前训练的模型进行调参。然而，对大型预训练模型的调参可能会占用大量的时间、计算资源和存储空间，这是因为图像描述模型中有很多参数。为了解决这个挑战，本研究提出了一种将适配器方法与图像描述模型结合使用的方法。这种方法原本是设计用于视觉或语言任务的，我们想用它们来解决屏幕截图标题预测任务中的类似挑战。通过冻结图像描述模型的参数，并仅对适配器方法进行训练，可以实现与完全调参模型的性能相似，同时减少了大量参数的数量。本研究是首次对适配器在屏幕截图标题预测任务中的应用进行全面的研究。通过我们的实验和分析，本研究旨在为应用适配器在视觉语言模型中的应用提供有价值的发现，并为屏幕截图标题预测任务中的效率调参技术做出贡献。研究的数据集和代码可以在 GitHub 上找到：https://github.com/RainYuGG/BLIP-Adapter

KERMIT: Knowledge Graph Completion of Enhanced Relation Modeling with Inverse Transformation

paper_url: http://arxiv.org/abs/2309.14770
repo_url: None
paper_authors: Haotian Li, Lingzhi Wang, Yuliang Wei, Richard Yi Da Xu, Bailing Wang
for: 填充知识图中缺失的三元组（triple），以提高知识图完成任务的准确性。
methods: 利用文本描述来完成知识图 completion 任务，但可能会遇到限制，因为描述可能不准确地表达意图。为了解决这些挑战，我们提议通过两种附加机制来增强数据。首先，我们使用 ChatGPT 作为外部知识库，生成具有准确性和 coherence 的描述，以bridging semantic gap между查询和答案。其次，我们利用 inverse relations 创建对称图，生成额外标签和提供补充信息，以便链接预测。这种方法可以提供更多的关系between entities。
results: 通过这两种机制，我们观察到了知识图 completion 的显著改善，这些机制可以增强数据的 ricahness 和多样性，导致更准确的结果。

Abstract
Knowledge graph completion is a task that revolves around filling in missing triples based on the information available in a knowledge graph. Among the current studies, text-based methods complete the task by utilizing textual descriptions of triples. However, this modeling approach may encounter limitations, particularly when the description fails to accurately and adequately express the intended meaning. To overcome these challenges, we propose the augmentation of data through two additional mechanisms. Firstly, we employ ChatGPT as an external knowledge base to generate coherent descriptions to bridge the semantic gap between the queries and answers. Secondly, we leverage inverse relations to create a symmetric graph, thereby creating extra labeling and providing supplementary information for link prediction. This approach offers additional insights into the relationships between entities. Through these efforts, we have observed significant improvements in knowledge graph completion, as these mechanisms enhance the richness and diversity of the available data, leading to more accurate results.

摘要
知识图完成任务是基于现有的知识图信息完善缺失的 triple。目前的研究主要采用文本方法来完成这项任务，但这种模型化方法可能会遇到限制，特别是当描述不准确、不完整时。为了解决这些挑战，我们提议在数据上进行两种附加机制。首先，我们使用 ChatGPT 作为外部知识库，生成具有协调性的描述， bridging 知识图中缺失的semantic gap。其次，我们利用反向关系，创建对称图，从而创建Extra labeling和提供补充信息 для链接预测。这种方法提供了更多的关系 между实体的意义，通过这些努力，我们观察到了知识图完成任务中显著的改善，这些机制增加了可用数据的丰富性和多样性，导致更加准确的结果。

ConPET: Continual Parameter-Efficient Tuning for Large Language Models

paper_url: http://arxiv.org/abs/2309.14763
repo_url: https://github.com/raincleared-song/conpet
paper_authors: Chenyang Song, Xu Han, Zheni Zeng, Kuai Li, Chen Chen, Zhiyuan Liu, Maosong Sun, Tao Yang
for: 这个研究旨在提出一种应用于大型语言模型（LLM）的持续学习方法，以减少 computation costs、内存耗尽和遗传问题。
methods: 这个方法基于优化parameter-efficient tuning（PET），包括两个版本：静态ConPET和动态ConPET。静态ConPET可以让former continual learning方法在LMM中进行适应，并将适应成本大大减少。动态ConPET则透过分类PET模组和PET模组选择器实现动态选择最佳PET模组。
results: 实验结果显示，静态ConPET可以帮助多个former方法将可调参数的数量增加至3,000多倍，并在五个较小的benchmark上超过PET-只基eline以少于5分点。动态ConPET则在最大的dataset上获得了优化。codes和数据可以在https://github.com/Raincleared-Song/ConPET上获取。

Abstract
Continual learning necessitates the continual adaptation of models to newly emerging tasks while minimizing the catastrophic forgetting of old ones. This is extremely challenging for large language models (LLMs) with vanilla full-parameter tuning due to high computation costs, memory consumption, and forgetting issue. Inspired by the success of parameter-efficient tuning (PET), we propose Continual Parameter-Efficient Tuning (ConPET), a generalizable paradigm for continual task adaptation of LLMs with task-number-independent training complexity. ConPET includes two versions with different application scenarios. First, Static ConPET can adapt former continual learning methods originally designed for relatively smaller models to LLMs through PET and a dynamic replay strategy, which largely reduces the tuning costs and alleviates the over-fitting and forgetting issue. Furthermore, to maintain scalability, Dynamic ConPET adopts separate PET modules for different tasks and a PET module selector for dynamic optimal selection. In our extensive experiments, the adaptation of Static ConPET helps multiple former methods reduce the scale of tunable parameters by over 3,000 times and surpass the PET-only baseline by at least 5 points on five smaller benchmarks, while Dynamic ConPET gains its advantage on the largest dataset. The codes and datasets are available at https://github.com/Raincleared-Song/ConPET.

摘要
<>将文本翻译成简化中文。<> kontinuel lerning需要 kontinuel adapting模型到新出现的任务，并最大限度减少老任务的忘记。这对大语言模型（LLM）来说是极其困难的，因为它们的计算成本高、内存占用大，以及忘记问题。 Drawing inspiration from the success of parameter-efficient tuning（PET）, we propose Continual Parameter-Efficient Tuning（ConPET）， a generalizable paradigm for continual task adaptation of LLMs with task-number-independent training complexity. ConPET includes two versions with different application scenarios. First, Static ConPET can adapt former continual learning methods originally designed for relatively smaller models to LLMs through PET and a dynamic replay strategy, which largely reduces the tuning costs and alleviates the over-fitting and forgetting issue. Furthermore, to maintain scalability, Dynamic ConPET adopts separate PET modules for different tasks and a PET module selector for dynamic optimal selection. In our extensive experiments, the adaptation of Static ConPET helps multiple former methods reduce the scale of tunable parameters by over 3,000 times and surpass the PET-only baseline by at least 5 points on five smaller benchmarks, while Dynamic ConPET gains its advantage on the largest dataset. The codes and datasets are available at https://github.com/Raincleared-Song/ConPET。

QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models

paper_url: http://arxiv.org/abs/2309.14717
repo_url: https://github.com/eltociear/qa-lora
paper_authors: Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhengsu Chen, Xiaopeng Zhang, Qi Tian
for: 这个论文的目的是提出一个量化意识掌握算法（QA-LoRA），以实现大型自然语言模型（LLM）在边缘设备上的部署。
methods: 这个论文使用的方法是使用群体化算子，增加量化的自由度，同时减少适应的自由度。这个方法可以轻松地实现，只需要几行代码。
results: 这个论文的结果显示，使用QA-LoRA可以实现量化LLM的时间和内存使用率的减少，并且不会对精度造成损害。这个方法可以在不同的精度档案和下游应用中进行适用。

Abstract
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. Code will be made available at https://github.com/yuhuixu1993/qa-lora.

摘要
最近几年内，大型语言模型（LLM）的快速发展已经引起了广泛的关注。尽管LLM具有许多语言理解任务的强大能力，但是计算负担很大，特别是在部署到边缘设备时。在这篇论文中，我们提出了一种量化意识扩展低级化算法（QA-LoRA）。我们的动机在于量化和适应的自由度不均衡，我们的解决方案是使用群组操作符，以增加量化的自由度，同时减少适应的自由度。QA-LoRA易于实现，只需几行代码即可，它使得原始LoRA具有两种能力：（i）在练习中，LLM的参数被量化（例如INT4），以降低时间和存储使用；（ii）在练习后，LLM和辅助参数自然地被 integrate到量化模型中，无损失准确性。我们在LLaMA和LLaMA2模型家族上应用QA-LoRA，并在不同的练习数据集和下游enario中验证其效果。代码将在https://github.com/yuhuixu1993/qa-lora中提供。

A Simple Text to Video Model via Transformer

paper_url: http://arxiv.org/abs/2309.14683
repo_url: https://github.com/vividitytech/text2videogpt
paper_authors: Gang Chen
for: 本研究旨在提出一种通用且简单的文本到视频模型，基于Transformer结构。
methods: 本模型使用了Transformer结构来捕捉文本和图像的时间相关性，并使用GPT2进行语言模型。
results: 经测试在UCF101 dataset上，本方法可以生成出promising的视频。Here’s a more detailed explanation of each point:1. For: The paper aims to propose a general and simple text-to-video model based on the Transformer architecture.2. Methods: The model uses the Transformer architecture to capture the temporal consistency between text and image sequences, and employs GPT2 as the language model.3. Results: The proposed method is tested on the UCF101 dataset and shows promising results in generating videos.

Abstract
We present a general and simple text to video model based on Transformer. Since both text and video are sequential data, we encode both texts and images into the same hidden space, which are further fed into Transformer to capture the temporal consistency and then decoder to generate either text or images. Considering the image signal may become weak in the long sequence, we introduce the U-Net to reconstruct image from its noised version. Specifically, we increase the noise level to the original image in the long sequence, then use the $down$ module from U-Net to encode noised images, which are further input to transformer to predict next clear images. We also add a constraint to promote motion between any generated image pair in the video. We use GPT2 and test our approach on UCF101 dataset and show it can generate promising videos.

摘要
我们提出了一种通用、简单的文本到视频模型，基于Transformer。由于文本和视频都是序列数据，我们将文本和图像编码到同一个隐藏空间中，然后将其传递给Transformer来捕捉时间一致性。为了处理长序列中的图像信号弱化，我们引入了U-Net来重建图像。 Specifically，我们将原始图像的噪声水平提高，然后使用U-Net的$down$模块编码噪声图像，并将其输入到Transformer来预测下一帧清晰图像。此外，我们添加了一个约束来促进视频中任意生成图像对的运动。我们使用GPT2进行测试，并在UCF101 dataset上实现了可靠的视频生成。