cs.CL - 2023-08-11

Weakly Supervised Text Classification on Free Text Comments in Patient-Reported Outcome Measures

paper_url: http://arxiv.org/abs/2308.06199
repo_url: None
paper_authors: Anna-Grace Linton, Vania Dimitrova, Amy Downing, Richard Wagland, Adam Glaser
for: 这个论文是为了分析患有Rectal cancer的病人的自报症状数据中的FTC数据，以提高健康质量生活（HRQoL）的评估。
methods: 这个论文使用了五种weakly supervised text classification（WSTC）技术来分类医疗领域特定的FTC数据，以提取报告的健康相关质量生活（HRQoL）主题。
results: 研究发现，使用WSTC技术可以在医疗领域特定的FTC数据中提取健康相关质量生活（HRQoL）主题，但是模型精度和主题之间存在差异。

Abstract
Free text comments (FTC) in patient-reported outcome measures (PROMs) data are typically analysed using manual methods, such as content analysis, which is labour-intensive and time-consuming. Machine learning analysis methods are largely unsupervised, necessitating post-analysis interpretation. Weakly supervised text classification (WSTC) can be a valuable method of analysis to classify domain-specific text data in which there is limited labelled data. In this paper, we apply five WSTC techniques to FTC in PROMs data to identify health-related quality of life (HRQoL) themes reported by colorectal cancer patients. The WSTC methods label all the themes mentioned in the FTC. The results showed moderate performance on the PROMs data, mainly due to the precision of the models, and variation between themes. Evaluation of the classification performance illustrated the potential and limitations of keyword based WSTC to label PROMs FTC when labelled data is limited.

摘要
免费文本评论（FTC）在患者报告的结果数据中通常使用手动方法进行分析，如内容分析，这是费时费力的。机器学习分析方法是无监督的，需要后期分析。弱监督文本分类（WSTC）可以是分析domain特有文本数据的有价值方法，在这篇论文中，我们将WSTC技术应用于患者报告中的FTC，以确定患者报告中的健康相关质量生活（HRQoL）主题。WSTC方法将FTC中所提到的所有主题标注。结果表明，在PROMs数据上，WSTC方法的性能较差，主要是因为模型精度和主题之间的变化。对分类性能的评估表明了关键词基于WSTC的标注PROMs FTC的潜在和局限性。

Assessing Guest Nationality Composition from Hotel Reviews

paper_url: http://arxiv.org/abs/2308.06175
repo_url: None
paper_authors: Fabian Gröger, Marc Pouly, Flavia Tinner, Leif Brandes
for: 这个论文是为了研究如何使用机器学习来监测和评估具体客户来源的酒店业务竞争力。
methods: 该论文使用了预训练的嵌入和堆式LSTM层来提取文本评论中的客户国籍信息，以动态评估和监测具体客户来源的变化。
results: 研究发现，使用简单的架构可以提供更好的性能和时间成本比例，而不是使用更复杂的语言模型。

Abstract
Many hotels target guest acquisition efforts to specific markets in order to best anticipate individual preferences and needs of their guests. Likewise, such strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but no fine-grained information on the guest composition of individual businesses exists. There is, however, growing interest in such data from competitors, suppliers, researchers and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a rather simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.

摘要
Many hotels target their guest acquisition efforts at specific markets in order to best anticipate the individual preferences and needs of their guests. Likewise, this strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but there is no fine-grained information on the guest composition of individual businesses. However, there is growing interest in such data from competitors, suppliers, researchers, and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a relatively simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.Here's a word-for-word translation of the text into Simplified Chinese: muchos 酒店target their 客源取得 efforts at specific markets in order to best anticipate the individual preferences and needs of their guests. Likewise, this strategic positioning is a prerequisite for efficient marketing budget allocation. Official statistics report on the number of visitors from different countries, but there is no fine-grained information on the guest composition of individual businesses. However, there is growing interest in such data from competitors, suppliers, researchers, and the general public. We demonstrate how machine learning can be leveraged to extract references to guest nationalities from unstructured text reviews in order to dynamically assess and monitor the dynamics of guest composition of individual businesses. In particular, we show that a relatively simple architecture of pre-trained embeddings and stacked LSTM layers provides a better performance-runtime tradeoff than more complex state-of-the-art language models.

Task Conditioned BERT for Joint Intent Detection and Slot-filling

paper_url: http://arxiv.org/abs/2308.06165
repo_url: None
paper_authors: Diogo Tavares, Pedro Azevedo, David Semedo, Ricardo Sousa, João Magalhães
for: 本研究旨在解决对话系统中的不可预测用户意图和多个插槽的多样性问题，以实现对话状态的跟踪和用户喜好的理解。
methods: 该研究提出了一种基于Transformer编码器的原则性模型，通过在多个任务上训练模型，并通过丰富的输入来conditioning模型。
results: 实验结果表明，通过conditioning模型在多个对话推理任务上的输入，可以实现对MultiWOZ数据集上的同时意图和插槽检测的提高，具体是3.2%、10.8%和14.4%。此外，在真实的Farfetch客户对话中，提出的conditioned BERT也可以在对话中实现高度的共同目标和意图检测性能。

Abstract
Dialogue systems need to deal with the unpredictability of user intents to track dialogue state and the heterogeneity of slots to understand user preferences. In this paper we investigate the hypothesis that solving these challenges as one unified model will allow the transfer of parameter support data across the different tasks. The proposed principled model is based on a Transformer encoder, trained on multiple tasks, and leveraged by a rich input that conditions the model on the target inferences. Conditioning the Transformer encoder on multiple target inferences over the same corpus, i.e., intent and multiple slot types, allows learning richer language interactions than a single-task model would be able to. In fact, experimental results demonstrate that conditioning the model on an increasing number of dialogue inference tasks leads to improved results: on the MultiWOZ dataset, the joint intent and slot detection can be improved by 3.2\% by conditioning on intent, 10.8\% by conditioning on slot and 14.4\% by conditioning on both intent and slots. Moreover, on real conversations with Farfetch costumers, the proposed conditioned BERT can achieve high joint-goal and intent detection performance throughout a dialogue.

摘要
对话系统需要面对用户意图的不可预测性和插槽的多样性，以便理解用户偏好。在这篇论文中，我们研究了假设：将这些挑战作为一个统一的模型来处理，可以在不同任务之间传递参数支持数据。我们提出的原则性的模型基于Transformer编码器，在多个任务上训练，并且使用丰富的输入来 condition the model 于目标推理。conditioning the Transformer encoder 于多个对话推理任务上的同一个 корпу斯（例如，意图和多个插槽类型），可以学习更加丰富的语言互动。实际结果表明，conditioning the model 于增加的对话推理任务可以提高结果：在MultiWOZ dataset上，联合意图和插槽检测可以提高3.2%，联合插槽和意图检测可以提高10.8%，而联合意图和插槽检测可以提高14.4%。此外，使用conditioned BERT在真实的对话中，可以 achieve high joint-goal和意图检测性能。

Identification of the Relevance of Comments in Codes Using Bag of Words and Transformer Based Models

paper_url: http://arxiv.org/abs/2308.06144
repo_url: https://github.com/sruthisudheer/comment-classification-of-c-code
paper_authors: Sruthi S, Tanmay Basu
for: 本研究的目的是分类代码段落的注释是否相关。
methods: 本研究使用了不同的特征工程方案和文本分类技术，包括经典的袋子字符模型和基于转换器的模型。
results: 研究发现，使用袋子字符模型在训练集上表现最佳，但模型在训练和测试集上的表现并不理想。

Abstract
The Forum for Information Retrieval (FIRE) started a shared task this year for classification of comments of different code segments. This is binary text classification task where the objective is to identify whether comments given for certain code segments are relevant or not. The BioNLP-IISERB group at the Indian Institute of Science Education and Research Bhopal (IISERB) participated in this task and submitted five runs for five different models. The paper presents the overview of the models and other significant findings on the training corpus. The methods involve different feature engineering schemes and text classification techniques. The performance of the classical bag of words model and transformer-based models were explored to identify significant features from the given training corpus. We have explored different classifiers viz., random forest, support vector machine and logistic regression using the bag of words model. Furthermore, the pre-trained transformer based models like BERT, RoBERT and ALBERT were also used by fine-tuning them on the given training corpus. The performance of different such models over the training corpus were reported and the best five models were implemented on the given test corpus. The empirical results show that the bag of words model outperforms the transformer based models, however, the performance of our runs are not reasonably well in both training and test corpus. This paper also addresses the limitations of the models and scope for further improvement.

摘要
《信息检索论坛（FIRE）》这年开始了代码段评注分类的共同任务。这是一个二分类文本分类任务，目标是判断给定代码段的评注是否相关。印度科学教育研究所 Bhopal（IISERB）的 BioNLP-IISERB 组participated in this task and submitted five runs for five different models. 本文介绍了模型和其他对训练集的重要发现。方法包括不同的特征工程方案和文本分类技术。我们使用了经典的包装词语模型和转换器基于模型，并对训练集进行了不同的特征工程和分类技术的探索。我们还使用了随机森林、支持向量机和梯度回归等分类器，并使用了包装词语模型。此外，我们还使用了预训练的转换器基于模型，如 BERT、RoBERT 和 ALBERT，并将其在训练集上进行了精度训练。对于训练集和测试集，我们报告了不同模型的性能，并选择了最佳五个模型进行实现。实验结果表明，包装词语模型在训练集上表现出色，但我们的运行并没有在训练集和测试集上表现出理想的性能。这篇文章还讨论了模型的限制和改进的可能性。

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

paper_url: http://arxiv.org/abs/2308.06112
repo_url: None
paper_authors: Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid, Ebtessam Almazrouei, Merouane Debbah
for: 本文目的是提出一种简单的方法，以便在视频序列中进行语音识别。
methods: 该方法基于学习一个先验模型，将视频序列中的唇形态编码器映射到匹配的音频对应的唇形态编码器，然后使用一个Off-the-shelf Audio Speech Recognition（ASR）模型将生成的音频表示转换为文本。
results: 该方法在LRS3数据集上比前期方法更高效，达到26 WER水平，而且与State-of-the-art（SoTA）方法不同，该模型在VoxCeleb测试集上保持了理想的性能。

Abstract
Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Unlike previous works that involve auxiliary losses or complex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a reasonable performance on the VoxCeleb test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.

摘要
视觉语音识别（VSR）与常见的观察任务不同，它需要对视频序列进行更深层次的理解，即使由人类专家也是如此。尽管最近有一些关于VSR的进步，但现有的方法仍然依赖于标注数据来完全训练或调整其模型，这会导致它们在不同于训练集的场景下表现不佳。不同于之前的工作，我们提出了一种简单的方法，即 Lip2Vec，它基于学习一个先验模型。给定一个强大的视觉语音编码器，这个网络将编码后的唇形态封装到它们对应的音频对的 latent 表示中，这些表示是 suficiently 惰性的，以便有效地解码文本。生成的音频表示然后被使用一个购买的 Audio Speech Recognition（ASR）模型来解码为文本。我们提出的模型与完全supervised学习方法在 LRS3 数据集上比较，达到 26 WER 的性能。与 SoTA 方法不同，我们的模型在 VoxCeleb 测试集上保持了合理的性能。我们认为，将 VSR 转换为 ASR 任务，可以减少两者之间的性能差距，并为更flexible的唇读法提供了道路。

Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling

paper_url: http://arxiv.org/abs/2308.06077
repo_url: None
paper_authors: Marija Šakota, Maxime Peyrard, Robert West
for: 这个研究是为了提出一个成本效益探索（CELMOC）框架，帮助选择适当的语言模型（LM），以提高整体性能while reducing cost.
methods: 这个研究使用了四个不同大小和成本的语言模型（LM），并使用了一个meta-model来预测每个输入将对哪个LM有好的表现。然后，这个框架将输入分配给预测对应的LM，以提高整体性能while reducing cost.
results: 这个研究发现，使用这个框架可以与使用最大化LM的性能相似，但是可以降低成本63%。这个框架可以帮助研究者和实践者共同储存大量的钱。

Abstract
Generative language models (LMs) have become omnipresent across data science. For a wide variety of tasks, inputs can be phrased as natural language prompts for an LM, from whose output the solution can then be extracted. LM performance has consistently been increasing with model size - but so has the monetary cost of querying the ever larger models. Importantly, however, not all inputs are equally hard: some require larger LMs for obtaining a satisfactory solution, whereas for others smaller LMs suffice. Based on this fact, we design a framework for Cost-Effective Language Model Choice (CELMOC). Given a set of inputs and a set of candidate LMs, CELMOC judiciously assigns each input to an LM predicted to do well on the input according to a so-called meta-model, aiming to achieve high overall performance at low cost. The cost-performance trade-off can be flexibly tuned by the user. Options include, among others, maximizing total expected performance (or the number of processed inputs) while staying within a given cost budget, or minimizing total cost while processing all inputs. We evaluate CELMOC on 14 datasets covering five natural language tasks, using four candidate LMs of vastly different size and cost. With CELMOC, we match the performance of the largest available LM while achieving a cost reduction of 63%. Via our publicly available library, researchers as well as practitioners can thus save large amounts of money without sacrificing performance.

摘要
现代语言模型（LM）在数据科学中变得普遍，它们可以用来解决各种任务，并且可以通过自然语言提示来获得答案。然而，LM的性能随模型大小的增加而提高，但是查询费用也在增加。这意味着，不同的输入可能需要不同的LM来获得满意的答案，而不同的LM可能需要不同的费用。基于这一点，我们设计了一个Cost-Effective Language Model Choice（CELMOC）框架。给定一组输入和一组候选LM，CELMOC会judiciously将每个输入分配给一个LM，以便在一个称为meta-model中 predictions of the LM's performance on the input，以实现高效性低成本。用户可以通过调整cost-performance贸易来灵活地调整成本-性能平衡。选项包括最大化总预期性能（或处理输入数）的成本不超过一定预算，或者最小化成本的情况下处理所有输入。我们在14个数据集上进行了五种自然语言任务的测试，使用四种不同的LM，并证明了CELMOC可以与最大可用LM的性能匹配，同时实现63%的成本减少。通过我们公开提供的库，研究人员和实践者都可以大幅降低成本，无需牺牲性能。

A Case Study on Context Encoding in Multi-Encoder based Document-Level Neural Machine Translation

paper_url: http://arxiv.org/abs/2308.06063
repo_url: None
paper_authors: Ramakrishna Appicharla, Baban Gain, Santanu Pal, Asif Ekbal
for: 研究人员希望了解多个encoder模型在不同的上下文情况下的表现，以提高模型在翻译中的准确性。
methods: 研究人员使用多个encoder模型，并对其进行训练，以便在不同的上下文情况下进行翻译。
results: 研究人员发现，even when the context is random, the model can still perform well on the ContraPro test set。此外，研究人员还发现，混合选择的上下文和随机上下文的设置通常比其他设置更好。

Abstract
Recent studies have shown that the multi-encoder models are agnostic to the choice of context, and the context encoder generates noise which helps improve the models in terms of BLEU score. In this paper, we further explore this idea by evaluating with context-aware pronoun translation test set by training multi-encoder models trained on three different context settings viz, previous two sentences, random two sentences, and a mix of both as context. Specifically, we evaluate the models on the ContraPro test set to study how different contexts affect pronoun translation accuracy. The results show that the model can perform well on the ContraPro test set even when the context is random. We also analyze the source representations to study whether the context encoder generates noise. Our analysis shows that the context encoder provides sufficient information to learn discourse-level information. Additionally, we observe that mixing the selected context (the previous two sentences in this case) and the random context is generally better than the other settings.

摘要
近期研究表明，多encoder模型对选择 контекст无关，context encoder生成噪声可以提高模型在BLEU分数上的表现。在这篇论文中，我们进一步探究这个想法，通过训练基于三种不同context设置的多encoder模型，并在ContraPro测试集上评估其表现。结果显示，模型可以在随机context下表现良好。我们还分析了源表示，确定context encoder是否生成噪声。我们的分析表明，context encoder提供了足够的信息来学习论坛水平信息。此外，我们发现混合选定context（在这种情况下是前两句）和随机context通常比其他设置更好。

Evaluating Picture Description Speech for Dementia Detection using Image-text Alignment

paper_url: http://arxiv.org/abs/2308.07933
repo_url: None
paper_authors: Youxiang Zhu, Nana Lin, Xiaohui Liang, John A. Batsis, Robert M. Roth, Brian MacWhinney
for: 检测诊断阿尔茨海默病（dementia）
methods: 利用图像描述文本对预处理样本，并利用大型预训练图像文本对适应模型。
results: 提出了首个利用图像和描述文本输入，并利用图像文本对适应模型的诊断阿尔茨海默病模型，实现了state-of-the-art表现，检测精度达83.44%，高于文本只基eline模型的79.91%。

Abstract
Using picture description speech for dementia detection has been studied for 30 years. Despite the long history, previous models focus on identifying the differences in speech patterns between healthy subjects and patients with dementia but do not utilize the picture information directly. In this paper, we propose the first dementia detection models that take both the picture and the description texts as inputs and incorporate knowledge from large pre-trained image-text alignment models. We observe the difference between dementia and healthy samples in terms of the text's relevance to the picture and the focused area of the picture. We thus consider such a difference could be used to enhance dementia detection accuracy. Specifically, we use the text's relevance to the picture to rank and filter the sentences of the samples. We also identified focused areas of the picture as topics and categorized the sentences according to the focused areas. We propose three advanced models that pre-processed the samples based on their relevance to the picture, sub-image, and focused areas. The evaluation results show that our advanced models, with knowledge of the picture and large image-text alignment models, achieve state-of-the-art performance with the best detection accuracy at 83.44%, which is higher than the text-only baseline model at 79.91%. Lastly, we visualize the sample and picture results to explain the advantages of our models.

摘要

PIPPA: A Partially Synthetic Conversational Dataset

paper_url: http://arxiv.org/abs/2308.05884
repo_url: None
paper_authors: Tear Gosling, Alpin Dale, Yinhe Zheng
for: 本研究旨在提供一个基于人工智能的会话和扮演数据集，以便研究人工智能系统在角色扮演场景中的发展。
methods: 本研究使用了社区驱动的协同募集方法，吸引了一群游戏爱好者参与，共创造出了超过100万句话的会话记录，分布在26,000个对话会话中。
results: 本研究提供了一个名为PIPPA的半 sintetic数据集，该数据集包含了丰富的会话记录，可以为研究人工智能系统在角色扮演场景中的发展提供一个重要的资源。

Abstract
With the emergence of increasingly powerful large language models, there is a burgeoning interest in leveraging these models for casual conversation and role-play applications. However, existing conversational and role-playing datasets often fail to capture the diverse and nuanced interactions typically exhibited by real-world role-play participants. To address this limitation and contribute to the rapidly growing field, we introduce a partially-synthetic dataset named PIPPA (Personal Interaction Pairs between People and AI). PIPPA is a result of a community-driven crowdsourcing effort involving a group of role-play enthusiasts. The dataset comprises over 1 million utterances that are distributed across 26,000 conversation sessions and provides a rich resource for researchers and AI developers to explore and refine conversational AI systems in the context of role-play scenarios.

摘要
“大型语言模型的出现使得人们对协谈和角色扮演应用的兴趣增加。然而，现有的协谈和角色扮演数据集常常无法捕捉真实世界角色扮演者之间的多样化和细节化互动。为解决这个限制，我们介绍了一个名为PIPPA（人工智能与人际互动对）的半人工数据集。PIPPA是由社区营运的调询员们组成的一个志工团队所创建的，数据集包含了过百万句说话，分布在26,000个对话会议中，并提供了一个丰富的资源供研究人员和AI开发者们探索和检验协谈AI系统在角色扮演场景中的表现。”Note that Simplified Chinese is the standard writing system used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

paper_url: http://arxiv.org/abs/2308.05725
repo_url: None
paper_authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux
for: 这个论文是为了研究textless speech synthesis的高质量表达方法和数据集。
methods: 这篇论文使用了自适应学习的低比特率精炼单元来重新生成高质量的speech，并使用了26种自然的善意表达方式来生成各种不同的表达方式。
results: 该论文 introduce了一个高质量的表达型speech数据集，包括了读出的speech和自由对话，并提供了一个表达 benchmark 来评估不同自适应精炼单元的表达质量。

Abstract
Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source

摘要
最近的研究表明，可以基于低比特率独立单元进行高质量的语音重synthesis，而这些单元可以在自我超vised的方式下学习，因此可以捕捉到语音中的表达特征，如语音调、声音风格和非语音 vocalization。然而，这些方法的应用仍受到大多数语音合成数据集是阅读的限制，这限制了它们的自由和表达力。在这里，我们介绍Expresso，一个高质量的自由语音合成数据集，包括了阅读语音和自由对话，并在26种自由表达风格中进行了rendering。我们描述了这个数据集的挑战和潜力，并通过一个表达 benchark来评估重synthesis质量，其中任务是将输入编码成低比特率单元并在目标声音中重synthesize，保持内容和风格不变。我们使用自动测试 метри来评估重synthesis质量，并探讨了不同自动测试 метри的tradeoffs，以及bitrate、 speaker和风格的不变性。所有的数据集、评估 метри和基eline模型都是开源的。

A Preliminary Study of the Intrinsic Relationship between Complexity and Alignment

paper_url: http://arxiv.org/abs/2308.05696
repo_url: https://github.com/alibabaresearch/damo-convai
paper_authors: Yingxiu Zhao, Bowen Yu, Binyuan Hui, Haiyang Yu, Fei Huang, Yongbin Li, Nevin L. Zhang
for: 提高大型自然语言模型（LLMs）在开放领域指令数据上的训练，以实现与终端任务和用户偏好更好地对齐。
methods: 通过控制指令数据的复杂性来提高性能。提出了一种名为“tree-instruct”的方法，通过添加指令 semantic tree 中指定的节点数来生成新的指令数据，并通过调整添加的节点数来控制Difficulty Level。
results: 通过实验发现，增加复杂性可以持续提高性能，例如使用1,000个指令数据和10个节点可以提高胜率24%。同时发现，相同的字符数预算下，一些复杂的指令可以超越多样但简单的指令。此外，训练课程指令调整可能并不是预期的结果，关键在于增加复杂性。

Abstract
Training large language models (LLMs) with open-domain instruction data has yielded remarkable success in aligning to end tasks and user preferences. Extensive research has highlighted that enhancing the quality and diversity of instruction data consistently improves performance. However, the impact of data complexity, as a crucial metric, remains relatively unexplored in three aspects: (1) scaling law, where the sustainability of performance improvements with increasing complexity is uncertain, (2) additional tokens, whether the improvement brought by complexity comes from introducing more training tokens, and (3) curriculum tuning, where the potential advantages of incorporating instructions ranging from easy to difficult are not yet fully understood. In this paper, we propose \textit{tree-instruct} to systematically enhance the complexity of instruction data in a controllable manner. This approach adds a specified number of nodes into the instruction semantic tree, yielding new instruction data based on the modified tree. By adjusting the number of added nodes, we can control the difficulty level in the modified instruction data. Our preliminary experiments reveal the following insights: (1) Increasing complexity consistently leads to sustained performance improvements. For instance, using 1,000 instruction data and 10 nodes resulted in a substantial 24\% increase in win rate. (2) Under the same token budget, a few complex instructions outperform diverse yet simple instructions. (3) Curriculum instruction tuning might not yield the anticipated results; focusing on increasing complexity appears to be the key.

摘要
训练大型自然语言模型（LLM）与开放领域指令数据有着惊人的成功，它们可以很好地适应到任务和用户喜好。广泛的研究表明，提高指令数据质量和多样性可以一直提高性能。然而，数据复杂性的影响，作为一个关键指标，还没有得到充分的探索。特别是，有三个方面的研究仍然不够：（1）扩展法律，是否可以长期维持性能提高的可靠性，（2）新的特征 Tokens 是否真的带来了性能提高，以及（3）课程调整，是否可以通过从易到Difficult的指令进行调整来获得更好的效果。本文提出了一种名为“tree-instruct”的方法，可以系统地提高指令数据的复杂性。这种方法在指令semantic树中添加指定的节点数量，从而生成新的指令数据。通过调整添加的节点数量，可以控制修改后的指令数据的难度水平。我们的初步实验发现了以下结论：（1）增加复杂性一直会持续提高性能。例如，使用 1,000 个指令数据和 10 个节点，可以获得大量的 24% 的提高。（2）在同一个token预算下，一些复杂的指令会超过多样但简单的指令。（3）课程调整可能并不会带来预期的结果，而是关注增加复杂性才是关键。

Finding Already Debunked Narratives via Multistage Retrieval: Enabling Cross-Lingual, Cross-Dataset and Zero-Shot Learning

paper_url: http://arxiv.org/abs/2308.05680
repo_url: None
paper_authors: Iknoor Singh, Carolina Scarton, Xingyi Song, Kalina Bontcheva
for: This paper aims to detect stories that have already been debunked to reduce the manual efforts of professional fact-checkers and slow the spread of misinformation.
methods: The paper creates a novel dataset for cross-lingual retrieval of already debunked narratives using tweets as queries to a database of fact-checking articles. It also presents an extensive experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task.
results: The results show that the task of cross-lingual retrieval of already debunked narratives is challenging, and off-the-shelf Transformer models fail to outperform a strong lexical-based baseline (BM25). However, the paper’s multistage retrieval framework is robust and outperforms BM25 in most scenarios, enabling cross-domain and zero-shot learning without significantly harming the model’s performance.

Abstract
The task of retrieving already debunked narratives aims to detect stories that have already been fact-checked. The successful detection of claims that have already been debunked not only reduces the manual efforts of professional fact-checkers but can also contribute to slowing the spread of misinformation. Mainly due to the lack of readily available data, this is an understudied problem, particularly when considering the cross-lingual task, i.e. the retrieval of fact-checking articles in a language different from the language of the online post being checked. This paper fills this gap by (i) creating a novel dataset to enable research on cross-lingual retrieval of already debunked narratives, using tweets as queries to a database of fact-checking articles; (ii) presenting an extensive experiment to benchmark fine-tuned and off-the-shelf multilingual pre-trained Transformer models for this task; and (iii) proposing a novel multistage framework that divides this cross-lingual debunk retrieval task into refinement and re-ranking stages. Results show that the task of cross-lingual retrieval of already debunked narratives is challenging and off-the-shelf Transformer models fail to outperform a strong lexical-based baseline (BM25). Nevertheless, our multistage retrieval framework is robust, outperforming BM25 in most scenarios and enabling cross-domain and zero-shot learning, without significantly harming the model's performance.

摘要
该任务是检索已经证伪的故事，目的是检测已经被证实的故事。成功检测已经证伪的故事不仅可以减少专业 фактоCheckers的手动努力，还可以减slow下迷信的传播。然而，由于数据的不Ready availability，这是一个未得到充分研究的问题，特别是跨语言任务，即在不同语言的 онлайн文章被检查时， retrieve fact-checking articles。这篇论文填补了这一漏洞，通过以下三个方面：1. 创建了一个新的数据集，用于启发研究跨语言检索已经证伪的故事，使用推文作为查询语。2. 进行了广泛的实验，以评估 fine-tuned 和 off-the-shelf 多语言预训练Transformer模型的表现。3. 提出了一个多Stage框架，将跨语言检索已经证伪的故事任务分为两个阶段：精细化阶段和重新排序阶段。结果显示，跨语言检索已经证伪的故事是一个具有挑战性的任务，off-the-shelf Transformer模型无法超过一个强的字符基本模型（BM25）。然而，我们的多Stage Retrieval框架是可靠的，在大多数场景下超过 BM25，并且具有跨频域和零shot学习能力，无需明显危害模型性能。