cs.CL - 2023-09-07

Evaluation and Mitigation of Agnosia in Multimodal Large Language Models

  • paper_url: http://arxiv.org/abs/2309.04041
  • repo_url: None
  • paper_authors: Jiaying Lu, Jinmeng Rao, Kezhen Chen, Xiaoyuan Guo, Yawen Zhang, Baochen Sun, Carl Yang, Jie Yang
  • for: 本研究旨在评估和 Mitigate Multimodal Agnosia(agnosia in MLLMs),以提高MLLMs的视觉语言任务表现。
  • methods: 我们提出了一种名为EMMA(评估和 Mitigation of Multimodal Agnosia)的 frameworks,包括评估模块和修复模块。评估模块通过自动生成多样化的视觉问答示例来评估 MLLMs 中的agnosia的程度和方面。修复模块通过多模式 instrucion 调整来减少 MLLMs 中的agnosia。
  • results: 我们在七个state-of-the-art MLLMs 上进行了9K 个测试样本的评估,发现大多数模型具有不同程度和方面的agnosia。我们还开发了一个细化的 instrucion 集,并对 MLLMs 进行了调整,从而得到了明显的改进。
    Abstract While Multimodal Large Language Models (MLLMs) are widely used for a variety of vision-language tasks, one observation is that they sometimes misinterpret visual inputs or fail to follow textual instructions even in straightforward cases, leading to irrelevant responses, mistakes, and ungrounded claims. This observation is analogous to a phenomenon in neuropsychology known as Agnosia, an inability to correctly process sensory modalities and recognize things (e.g., objects, colors, relations). In our study, we adapt this similar concept to define "agnosia in MLLMs", and our goal is to comprehensively evaluate and mitigate such agnosia in MLLMs. Inspired by the diagnosis and treatment process in neuropsychology, we propose a novel framework EMMA (Evaluation and Mitigation of Multimodal Agnosia). In EMMA, we develop an evaluation module that automatically creates fine-grained and diverse visual question answering examples to assess the extent of agnosia in MLLMs comprehensively. We also develop a mitigation module to reduce agnosia in MLLMs through multimodal instruction tuning on fine-grained conversations. To verify the effectiveness of our framework, we evaluate and analyze agnosia in seven state-of-the-art MLLMs using 9K test samples. The results reveal that most of them exhibit agnosia across various aspects and degrees. We further develop a fine-grained instruction set and tune MLLMs to mitigate agnosia, which led to notable improvement in accuracy.
    摘要 While 多模态大语言模型(MLLM)广泛用于视觉语言任务,一种观察是它们在简单情况下也可能错误处理视觉输入或不遵循文本指令,导致无关回答、错误和未根据的声明。这种现象与神经科学中的识别障碍(Agnosia)类似,即不正确处理感官模式并识别物体(例如对象、颜色、关系)。在我们的研究中,我们采用了类似的概念,称之为“MLLM中的识别障碍”,我们的目标是全面评估并 Mitigate 这种障碍。受到神经科学诊断和治疗过程的启发,我们提出了一个新的框架EMMA(评估和 Mitigate 多模态识别障碍)。在EMMA框架中,我们开发了评估模块,可以自动生成多样化和细致的视觉问答示例,用于全面评估 MLMM 中的识别障碍。此外,我们还开发了修复模块,通过多模态指令调整,降低 MLMM 中的识别障碍。为证明我们的框架的有效性,我们使用了7个状态计算机最新的 MLMM,并对9K测试样本进行评估。结果表明,大多数 MLMM 具有不同方面和程度的识别障碍。我们进一步开发了细致的指令集,并对 MLMM 进行调整,这导致了显著的准确率提高。

Multiple Representation Transfer from Large Language Models to End-to-End ASR Systems

  • paper_url: http://arxiv.org/abs/2309.04031
  • repo_url: None
  • paper_authors: Takuma Udagawa, Masayuki Suzuki, Gakuto Kurata, Masayasu Muraoka, George Saon
  • for: 将大语言模型(LLM)的知识融入到端到端自动语音识别系统(ASR)中
  • methods: explore 多种方法来从不同层、上下文和模型中获取和传递多个 LLM 表示
  • results: 示出将多个 LLM 表示传递到扬声器基于 ASR 系统可以是一个有效的代替方案
    Abstract Transferring the knowledge of large language models (LLMs) is a promising technique to incorporate linguistic knowledge into end-to-end automatic speech recognition (ASR) systems. However, existing works only transfer a single representation of LLM (e.g. the last layer of pretrained BERT), while the representation of a text is inherently non-unique and can be obtained variously from different layers, contexts and models. In this work, we explore a wide range of techniques to obtain and transfer multiple representations of LLMs into a transducer-based ASR system. While being conceptually simple, we show that transferring multiple representations of LLMs can be an effective alternative to transferring only a single representation.
    摘要 通过传输大语言模型(LLM)的知识是一种可能的方法,以把语言知识 integrate 到端到端自动语音识别(ASR)系统中。然而,现有的工作只是将单个表示(例如,预训练BERT的最后一层)传输到 ASR 系统中,而文本表示的非唯一性意味着可以从不同的层、上下文和模型中获得不同的表示。在这个工作中,我们探讨了许多技术来获取和传输多个 LLM 的表示,并将其 integrate 到传感器基于 ASR 系统中。虽然概念简单,但我们发现传输多个 LLM 的表示可以是一种有效的替代方案。

TIDE: Textual Identity Detection for Evaluating and Augmenting Classification and Language Models

  • paper_url: http://arxiv.org/abs/2309.04027
  • repo_url: https://github.com/google-research-datasets/TIDAL
  • paper_authors: Emmanuel Klu, Sameer Sethi
  • for: 这篇论文旨在提高文本分类器和自然语言处理模型中的公平性,尤其是在文本数据集中存在不公平和排斥的情况下。
  • methods: 这篇论文提出了一种新的标识词典(TIDAL),包括15,123个标识词和相关的感受上下文,以及一种用于提高标识上下文和机器学习公平性技术的注释和增强工具。
  • results: 对比之下,这种助手注释技术可以提高人工审核过程的可靠性和速度,而且在评估和修复过程中,这种方法可以检测更多的差距和生成更公平的模型。
    Abstract Machine learning models can perpetuate unintended biases from unfair and imbalanced datasets. Evaluating and debiasing these datasets and models is especially hard in text datasets where sensitive attributes such as race, gender, and sexual orientation may not be available. When these models are deployed into society, they can lead to unfair outcomes for historically underrepresented groups. In this paper, we present a dataset coupled with an approach to improve text fairness in classifiers and language models. We create a new, more comprehensive identity lexicon, TIDAL, which includes 15,123 identity terms and associated sense context across three demographic categories. We leverage TIDAL to develop an identity annotation and augmentation tool that can be used to improve the availability of identity context and the effectiveness of ML fairness techniques. We evaluate our approaches using human contributors, and additionally run experiments focused on dataset and model debiasing. Results show our assistive annotation technique improves the reliability and velocity of human-in-the-loop processes. Our dataset and methods uncover more disparities during evaluation, and also produce more fair models during remediation. These approaches provide a practical path forward for scaling classifier and generative model fairness in real-world settings.
    摘要 We create a new, more comprehensive identity lexicon, TIDAL, which includes 15,123 identity terms and associated sense context across three demographic categories. We leverage TIDAL to develop an identity annotation and augmentation tool that can be used to improve the availability of identity context and the effectiveness of ML fairness techniques. We evaluate our approaches using human contributors, and additionally run experiments focused on dataset and model debiasing.Results show our assistive annotation technique improves the reliability and velocity of human-in-the-loop processes. Our dataset and methods uncover more disparities during evaluation, and also produce more fair models during remediation. These approaches provide a practical path forward for scaling classifier and generative model fairness in real-world settings.

LanSER: Language-Model Supported Speech Emotion Recognition

  • paper_url: http://arxiv.org/abs/2309.03978
  • repo_url: None
  • paper_authors: Taesik Gong, Josh Belanich, Krishna Somandepalli, Arsha Nagrani, Brian Eoff, Brendan Jou
  • for: 用于替代人工标注数据,使得大量语音数据和复杂的情感分类器易于扩展
  • methods: 使用弱监督学习,通过大语言模型对未标注数据进行推理,推导出情感标签
  • results: 在标准语音感知 task 上表现出色,比基eline模型更高效,并且表现出模型可以模拟语音中的语音特征。
    Abstract Speech emotion recognition (SER) models typically rely on costly human-labeled data for training, making scaling methods to large speech datasets and nuanced emotion taxonomies difficult. We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. For inferring weak labels constrained to a taxonomy, we use a textual entailment approach that selects an emotion label with the highest entailment score for a speech transcript extracted via automatic speech recognition. Our experimental results show that models pre-trained on large datasets with this weak supervision outperform other baseline models on standard SER datasets when fine-tuned, and show improved label efficiency. Despite being pre-trained on labels derived only from text, we show that the resulting representations appear to model the prosodic content of speech.
    摘要 <>转换文本到简化中文。> speech emotional recognition(SER)模型通常需要昂贵的人类标注数据进行训练,从而使得扩展方法到大量speech数据和细腻的情绪分类难以进行。我们介绍了LanSER,一种方法可以使用无标注数据进行训练,通过弱型学习进行启发。为了从文本中提取启发的情绪标签,我们使用文本排序approach,选择一个符合分类的情绪标签,通过自动语音识别获取speech脚本。我们的实验结果表明,在标准SER数据集上精化的模型,可以在标准SER数据集上超越其他基eline模型,并且显示改进的标签效率。即使只使用文本中 derivated的标签进行训练,我们发现模型 Apparently captured the prosody content of speech。

ImageBind-LLM: Multi-modality Instruction Tuning

  • paper_url: http://arxiv.org/abs/2309.03905
  • repo_url: https://github.com/opengvlab/llama-adapter
  • paper_authors: Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao
  • for: 图像和文本多模态指令调整方法
  • methods: 使用ImageBind进行图像编码器的学习绑定网络,并在LLaMA中进行语音和3D点云、视频等多模态指令注入。
  • results: 通过无需更多训练的方式,ImageBind-LLM可以响应多种模态指令,并且在语音、3D点云、视频等多种模态输入下展示出色的语言生成质量。
    Abstract We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.
    摘要 我们介绍ImageBind-LLM,一种基于ImageBind的多Modalities语言模型调教方法。现有工作主要集中在语言和图像指令调教上,与之不同的我们的ImageBind-LLM可以应对多Modalities条件,包括音频、3D点云、视频和它们的内存空间加法。在训练过程中,我们采用一个学习可 Bind 网络将 ImageBind 的图像编码器的嵌入空间与 LLaMA 的各层单词Token进行对齐。然后,通过 Bind 网络转换的图像特征被添加到 LLMA 的所有层单词Token中,逐渐注入视觉指令,无需注意力和初始值。帮助 joint embedding of ImageBind,简单的图像文本训练使我们的模型在多Modalities指令跟踪能力方面表现出色。在推理过程中,多Modalities输入被 feed 到对应的 ImageBind 编码器中,并被一种提议的视觉缓存模型进行进一步跨模态嵌入增强。这种培成模型通过从三百万个 ImageBind 提取出的图像特征进行重新调教,有效地减少了训练-推理模态差异。需要注意的是,我们的方法可以让 ImageBind-LLM在不同的模态上 respond 到指令,并且在语言生成质量方面表现出色。代码可以在 上下载。

Zero-Shot Audio Captioning via Audibility Guidance

  • paper_url: http://arxiv.org/abs/2309.03884
  • repo_url: None
  • paper_authors: Tal Shaharabany, Ariel Shaulov, Lior Wolf
  • For: The paper proposes a method for audio captioning, with the goal of generating fluent and faithful text descriptions of audio files.* Methods: The method uses a combination of three networks: a large language model (GPT-2), a multimodal matching network (ImageBind), and a text classifier. The method does not involve learning to perform captioning, but rather uses inference to generate text based on the input audio.* Results: The authors present results on the AudioCap dataset, showing that their method significantly enhances performance compared to a baseline lacking audibility guidance. Specifically, the method achieves high fluency, faithfulness to the input audio, and audibility.
    Abstract The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and the somewhat related (iii) audibility, which is the quality of being able to be perceived based only on audio. Our method is a zero-shot method, i.e., we do not learn to perform captioning. Instead, captioning occurs as an inference process that involves three networks that correspond to the three desired qualities: (i) A Large Language Model, in our case, for reasons of convenience, GPT-2, (ii) A model that provides a matching score between an audio file and a text, for which we use a multimodal matching network called ImageBind, and (iii) A text classifier, trained using a dataset we collected automatically by instructing GPT-4 with prompts designed to direct the generation of both audible and inaudible sentences. We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline, which lacks this objective.
    摘要 audio captioning的任务与图像和视频标题类似,但它得到了许多 menos attention。我们提出了三个愿景 для标题音频:(i)生成文本的流畅性,(ii)生成文本与输入音频的准确性,以及(iii)可见性,即基于 solely audio 可以被感知的质量。我们的方法是一种零截法方法,即不学习权值,而是通过三个网络来实现这三个愿景:(i)一个大语言模型,我们使用 GPT-2,(ii)一个用于对 audio 文件和文本进行匹配的多模态匹配网络,我们使用 ImageBind,(iii)一个用于分类文本的文本分类器,我们使用一个自动收集的数据集来训练。我们在 AudioCap 数据集上展示了我们的结果,表明可见导航对基准的性能有很大改善。

On Large Language Models’ Selection Bias in Multi-Choice Questions

  • paper_url: http://arxiv.org/abs/2309.03882
  • repo_url: None
  • paper_authors: Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
  • for: 本研究探讨了大语言模型(LLM)在多选问题(MCQ)中的表现,发现LLM具有自然选择偏好(selection bias),即选择位置在特定位置(如选项C)的偏好。
  • methods: 我们提出了一种新方法 called PriDe,它可以减轻 LLM 的选择偏好。PriDe 首先将观测到的模型预测分布分解成内在预测和选择ID的 posterior distribution。然后,它使用一小数量的测试样本来估计 posterior,并使用这些估计来减轻后续测试样本。
  • results: 我们的实验表明,PriDe 可以更有效率地减轻 LLM 的选择偏好,并且可以在不同领域中进行广泛应用。此外,我们还发现了 PriDe 估计的 posterior 可以很好地泛化到不同领域。
    Abstract Multi-choice questions (MCQs) serve as a common yet important task format in the research of large language models (LLMs). Our work shows that LLMs exhibit an inherent "selection bias" in MCQs, which refers to LLMs' preferences to select options located at specific positions (like "Option C"). This bias is prevalent across various LLMs, making their performance vulnerable to option position changes in MCQs. We identify that one primary cause resulting in selection bias is option numbering, i.e., the ID symbols A/B/C/D associated with the options. To mitigate selection bias, we propose a new method called PriDe. PriDe first decomposes the observed model prediction distribution into an intrinsic prediction over option contents and a prior distribution over option IDs. It then estimates the prior by permutating option contents on a small number of test samples, which is used to debias the subsequent test samples. We demonstrate that, as a label-free, inference-time method, PriDe achieves a more effective and computation-efficient debiasing than strong baselines. We further show that the priors estimated by PriDe generalize well across different domains, highlighting its practical potential in broader scenarios.
    摘要 (Simplified Chinese translation)LLMs 在多选问题 (MCQs) 中表现出一种内在的 "选择偏见",即选择位置为特定位置 (如 "选项 C" ) 的偏好。这种偏见遍布于多种 LLMs,使其在 MCQs 中的性能易受选项位置的变化影响。我们认为选项编号 (A/B/C/D) 是主要的 causal factor。为了解决这种偏见,我们提出了一种新的方法 called PriDe。 PriDe 首先将观察到的模型预测分布 decomposes 为内在预测和选项 ID 的先验分布。然后,它使用一小量测试样本中的内容排序来估计先验,并用这些先验来减偏测试样本。我们示出,作为无标签、执行时的方法,PriDe 在减偏性和计算效率方面超过了强基elines。此外,我们还表明了 PriDe 在不同领域中的 praktische 潜力。

Introducing “Forecast Utterance” for Conversational Data Science

  • paper_url: http://arxiv.org/abs/2309.03877
  • repo_url: None
  • paper_authors: Md Mahadi Hassan, Alex Knipper, Shubhra Kanti Karmaker
  • for: 预测任务的帮助,帮助用户通过自然语言交互,无需深入了解机器学习过程。
  • methods: 利用新概念“预测utterance”,自动和准确地理解用户预测目标,并将其转化为机器学习任务。 employed two zero-shot methods:1)实体提取(EE),2)问答技术(QA)。
  • results: 通过三组精心制作的数据集进行实验,证明了我们的目标的可能性,并证明了EE和QA技术在解释预测utterance中的效果。
    Abstract Envision an intelligent agent capable of assisting users in conducting forecasting tasks through intuitive, natural conversations, without requiring in-depth knowledge of the underlying machine learning (ML) processes. A significant challenge for the agent in this endeavor is to accurately comprehend the user's prediction goals and, consequently, formulate precise ML tasks. In this paper, we take a pioneering step towards this ambitious goal by introducing a new concept called Forecast Utterance and then focus on the automatic and accurate interpretation of users' prediction goals from these utterances. Specifically, we frame the task as a slot-filling problem, where each slot corresponds to a specific aspect of the goal prediction task. We then employ two zero-shot methods for solving the slot-filling task, namely: 1) Entity Extraction (EE), and 2) Question-Answering (QA) techniques. Our experiments, conducted with three meticulously crafted data sets, validate the viability of our ambitious goal and demonstrate the effectiveness of both EE and QA techniques in interpreting Forecast Utterances.
    摘要 想像一个智能代理人,能够帮助用户进行预测任务,通过自然、直观的对话,无需深入了解下面机器学习(ML)过程。这个目标具有很大挑战,即准确理解用户预测目标,并因此形成精准的ML任务。在这篇论文中,我们采取了一项先锋的方法,即预测话语(Forecast Utterance)的概念,然后将用户预测目标的自动和准确理解归类为槽筛问题。每个槽都对应于特定的预测目标方面。我们然后使用了两种零容量解决槽筛问题的方法,即实体提取(EE)和问答技术(QA)。我们的实验,在三个精心制作的数据集上进行,证明了我们的目标的可行性,并证明了EE和QA技术在解释预测话语中的效果。

USA: Universal Sentiment Analysis Model & Construction of Japanese Sentiment Text Classification and Part of Speech Dataset

  • paper_url: http://arxiv.org/abs/2309.03787
  • repo_url: https://huggingface.co/ganchengguang/USA-7B-instruction-incontext-learning
  • paper_authors: Chengguang Gan, Qinghao Zhang, Tatsunori Mori
  • for: 这篇论文旨在提高情感分析的性能,通过利用文本中单词的复杂影响和总体文本的含义相互强制的效果。
  • methods: 该论文提出了一种基于大语言模型的新方法,通过利用单词的情感方向对整个文本的情感含义产生强制效应,从而提高情感分析的性能。
  • results: 实验结果表明,该方法可以在四个新的情感分类和PART OF SPEECH(SCPOS)数据集上达到比gpt-3.5-turbo更高的性能。
    Abstract Sentiment analysis is a pivotal task in the domain of natural language processing. It encompasses both text-level sentiment polarity classification and word-level Part of Speech(POS) sentiment polarity determination. Such analysis challenges models to understand text holistically while also extracting nuanced information. With the rise of Large Language Models(LLMs), new avenues for sentiment analysis have opened. This paper proposes enhancing performance by leveraging the Mutual Reinforcement Effect(MRE) between individual words and the overall text. It delves into how word polarity influences the overarching sentiment of a passage. To support our research, we annotated four novel Sentiment Text Classification and Part of Speech(SCPOS) datasets, building upon existing sentiment classification datasets. Furthermore, we developed a Universal Sentiment Analysis(USA) model, with a 7-billion parameter size. Experimental results revealed that our model surpassed the performance of gpt-3.5-turbo across all four datasets, underscoring the significance of MRE in sentiment analysis.
    摘要

The Daunting Dilemma with Sentence Encoders: Success on Standard Benchmarks, Failure in Capturing Basic Semantic Properties

  • paper_url: http://arxiv.org/abs/2309.03747
  • repo_url: None
  • paper_authors: Yash Mahajan, Naman Bansal, Shubhra Kanti Karmaker
  • for: 本研究旨在比较五种广泛使用的句子编码器,包括Sentence-BERT、Universal Sentence Encoder(USE)、LASER、InferSent和Doc2vec,以及他们在下游任务中的表现和捕捉基本Semantic property的能力。
  • methods: 本研究采用了Retrospective方法,对五种句子编码器进行了SentEvalbenchmark的评估,并设计了四个Semantic评估标准,包括Paraphrasing、Synonym Replacement、Antonym Replacement和Sentence Jumbling,以评估这些编码器的表现。
  • results: 结果显示,Sentence-BERT和USE模型在Paraphrasing标准上表现最佳,SBERT在这两个标准上表现更加出色。LASER在Synonym Replacement标准上表现最佳。 Interestingly,所有的句子编码器在Antonym Replacement和Jumbling标准上都失败了。这些结果表明,虽然这些句子编码器在SentEvalbenchmark上表现非常出色,但它们仍然努力捕捉一些基本Semantic property,从而存在一定的挑战。
    Abstract In this paper, we adopted a retrospective approach to examine and compare five existing popular sentence encoders, i.e., Sentence-BERT, Universal Sentence Encoder (USE), LASER, InferSent, and Doc2vec, in terms of their performance on downstream tasks versus their capability to capture basic semantic properties. Initially, we evaluated all five sentence encoders on the popular SentEval benchmark and found that multiple sentence encoders perform quite well on a variety of popular downstream tasks. However, being unable to find a single winner in all cases, we designed further experiments to gain a deeper understanding of their behavior. Specifically, we proposed four semantic evaluation criteria, i.e., Paraphrasing, Synonym Replacement, Antonym Replacement, and Sentence Jumbling, and evaluated the same five sentence encoders using these criteria. We found that the Sentence-Bert and USE models pass the paraphrasing criterion, with SBERT being the superior between the two. LASER dominates in the case of the synonym replacement criterion. Interestingly, all the sentence encoders failed the antonym replacement and jumbling criteria. These results suggest that although these popular sentence encoders perform quite well on the SentEval benchmark, they still struggle to capture some basic semantic properties, thus, posing a daunting dilemma in NLP research.
    摘要 在这篇论文中,我们采用了回顾方法来评估和比较五种流行的句子编码器,即 Sentence-BERT、全局句子编码器(USE)、LASER、InferSent和Doc2vec,它们在下游任务上的表现和捕捉基本Semantic Properties的能力之间的关系。我们首先对所有五种句子编码器在Popular SentEval benchmark上进行评估,发现它们在多种Popular downstream task上表现很好。然而,无法找到一个在所有情况下赢的单一赢家,我们设计了进一步的实验来深入了解它们的行为。特别是,我们提出了四个semantic评估标准,即Paraphrasing、Synonym Replacement、Antonym Replacement和Sentence Jumbling,并对五种句子编码器进行评估。我们发现,Sentence-Bert和USE模型通过Paraphrasing标准,SBERT在两者中表现更佳。LASER在Synonym Replacement标准上表现出色。另外,所有句子编码器在Antonym Replacement和Jumbling标准上都失败。这些结果表明,虽然这些流行的句子编码器在SentEval benchmark上表现很好,但它们仍然很难捕捉一些基本的Semantic Properties,这是NPRL研究所存在的一个棘手的问题。

Word segmentation granularity in Korean

  • paper_url: http://arxiv.org/abs/2309.03713
  • repo_url: None
  • paper_authors: Jungyeul Park, Mija Kim
  • for: 这篇论文关注韩语Word分 segmentation粒度的问题。
  • methods: 文章分析了不同粒度水平的Word分 segmentation,并提供了适用于特定语言处理和文献注释任务的多种不同粒度水平。
  • results: 研究发现,只分Functional morphemes,包括案例标记和词尾变化,而不分其他词形 derivation 的粒度可以获得最佳的结构分析性能。这与过去韩语处理的常见做法不同,该做法需要将所有 morphemes 分开。
    Abstract This paper describes word {segmentation} granularity in Korean language processing. From a word separated by blank space, which is termed an eojeol, to a sequence of morphemes in Korean, there are multiple possible levels of word segmentation granularity in Korean. For specific language processing and corpus annotation tasks, several different granularity levels have been proposed and utilized, because the agglutinative languages including Korean language have a one-to-one mapping between functional morpheme and syntactic category. Thus, we analyze these different granularity levels, presenting the examples of Korean language processing systems for future reference. Interestingly, the granularity by separating only functional morphemes including case markers and verbal endings, and keeping other suffixes for morphological derivation results in the optimal performance for phrase structure parsing. This contradicts previous best practices for Korean language processing, which has been the de facto standard for various applications that require separating all morphemes.
    摘要

Exploring an LM to generate Prolog Predicates from Mathematics Questions

  • paper_url: http://arxiv.org/abs/2309.03667
  • repo_url: None
  • paper_authors: Xiaocheng Yang, Yik-Cheung Tam
    for: 这个论文的目的是调查是否可以通过 fine-tuning 模型来提高逻辑推理能力,并发现 Prolog 代码生成模型在性能方面的优势。methods: 该论文使用了链条思维的技术来 fine-tune LLaMA7B 模型,并开发了不同的 fine-tuned LLaMA7B 模型,包括 Prolog 代码生成模型、Prolog 代码 + 链条思维模型和链条思维 + Prolog 代码模型。results: 结果显示,Prolog 代码生成模型在性能方面超过了基eline模型,而组合生成模型并不显示明显的改善。同时,基于 GSM8K 的 Prolog 词库和相应的 fine-tuned Prolog 代码生成模型也被发布到研究社区。
    Abstract Recently, there has been a surge in interest in NLP driven by ChatGPT. ChatGPT, a transformer-based generative language model of substantial scale, exhibits versatility in performing various tasks based on natural language. Nevertheless, large language models often exhibit poor performance in solving mathematics questions that require reasoning. Prior research has demonstrated the effectiveness of chain-of-thought prompting in enhancing reasoning capabilities. Now, we aim to investigate whether fine-tuning a model for the generation of Prolog codes, a logic language, and subsequently passing these codes to a compiler can further improve accuracy. Consequently, we employ chain-of-thought to fine-tune LLaMA7B as a baseline model and develop other fine-tuned LLaMA7B models for the generation of Prolog code, Prolog code + chain-of-thought, and chain-of-thought + Prolog code, respectively. The results reveal that the Prolog generation model surpasses the baseline in performance, while the combination generation models do not yield significant improvements. The Prolog corpus based on GSM8K and the correspondingly finetuned Prolog generation model based on LLaMA7B are released to the research community.
    摘要 最近,有一股关注NLP的兴趣,归功于ChatGPT。ChatGPT是一种基于变换器的生成语言模型,表现了对自然语言的多种任务的灵活性。然而,大型语言模型经常在解决需要推理的数学问题上表现不佳。先前的研究表明了链式思维提示的效iveness。因此,我们想 investigate whether fine-tuning a model for the generation of Prolog codes, a logic language, and subsequently passing these codes to a compiler can further improve accuracy。所以,我们使用链式思维来精度调整LLaMA7B基eline模型,并开发了LLaMA7B模型的其他精度调整版本,包括生成Prolog代码、Prolog代码+链式思维和链式思维+Prolog代码。结果表明,Prolog代码生成模型的性能高于基eline,而组合生成模型不带有显著改善。基于GSM8K的Prolog词库和相应地精度调整的Prolog代码生成模型是由LLaMA7B模型来解决研究社区发布。

BNS-Net: A Dual-channel Sarcasm Detection Method Considering Behavior-level and Sentence-level Conflicts

  • paper_url: http://arxiv.org/abs/2309.03658
  • repo_url: None
  • paper_authors: Liming Zhou, Xiaowei Xu, Xiaodong Wang
  • for: 本研究旨在开发一种能够有效地Identify sarcastic expressions in text的模型,以便在实际应用中提高人机交互的效果。
  • methods: 本研究使用了深度学习方法,包括用户profile、句子结构和情感词等特征来进行分类。同时,研究还提出了一种名为BNS-Net的双渠道干扰检测模型,该模型通过两个渠道来检测句子中的干扰信息,其中一个渠道是基于行为水平的干扰检测,另一个渠道是基于句子水平的干扰检测。
  • results: 经过多种比较和缺省实验,研究发现BNS-Net可以准确地识别句子中的干扰表达,并达到了当前领域的最佳性能。
    Abstract Sarcasm detection is a binary classification task that aims to determine whether a given utterance is sarcastic. Over the past decade, sarcasm detection has evolved from classical pattern recognition to deep learning approaches, where features such as user profile, punctuation and sentiment words have been commonly employed for sarcasm detection. In real-life sarcastic expressions, behaviors without explicit sentimental cues often serve as carriers of implicit sentimental meanings. Motivated by this observation, we proposed a dual-channel sarcasm detection model named BNS-Net. The model considers behavior and sentence conflicts in two channels. Channel 1: Behavior-level Conflict Channel reconstructs the text based on core verbs while leveraging the modified attention mechanism to highlight conflict information. Channel 2: Sentence-level Conflict Channel introduces external sentiment knowledge to segment the text into explicit and implicit sentences, capturing conflicts between them. To validate the effectiveness of BNS-Net, several comparative and ablation experiments are conducted on three public sarcasm datasets. The analysis and evaluation of experimental results demonstrate that the BNS-Net effectively identifies sarcasm in text and achieves the state-of-the-art performance.
    摘要 <>translate_language: zh-CNSarcasm detection 是一个二分类任务,旨在判断给定的语句是否带有讽刺意味。过去十年,讽刺检测从经典的模式识别演化到深度学习方法,其中用户 profiling、标点符号和情感词等特征被广泛使用于讽刺检测。在实际生活中,讽刺表达中的行为常常没有显式的情感cue,因此我们提出了一种双通道讽刺检测模型,称为BNS-Net。该模型在两个通道中考虑行为和句子冲突。通道1:行为水平冲突通道,使用核心动词重建文本,同时利用修改注意机制来强调冲突信息。通道2:句子水平冲突通道,引入外部情感知识,将文本分解成显式和隐式句子,捕捉句子之间的冲突。为验证BNS-Net的有效性,我们进行了多种比较和减少实验,并对三个公共讽刺数据集进行了分析和评估。实验结果分析表明,BNS-Net能够有效地在文本中检测到讽刺,并达到了当前领域的state-of-the-art表现。

Loquacity and Visible Emotion: ChatGPT as a Policy Advisor

  • paper_url: http://arxiv.org/abs/2309.03595
  • repo_url: None
  • paper_authors: Claudia Biancotti, Carolina Camassa
  • for: 这个论文是用来评估 chatGPT 软件在复杂的写作任务中的潜力。
  • methods: 作者使用 chatGPT 软件来 compose a policy brief for the Board of the Bank of Italy,并对其生成的内容进行了评估。
  • results: 研究发现,使用 chatGPT 软件可以加速工作流程,提供结构化的内容建议,并在秒钟内生成大量、语言正确的文本。但是,需要专家指导,以避免生成的内容不正确、 superficiale 或 irrelevant。
    Abstract ChatGPT, a software seeking to simulate human conversational abilities, is attracting increasing attention. It is sometimes portrayed as a groundbreaking productivity aid, including for creative work. In this paper, we run an experiment to assess its potential in complex writing tasks. We ask the software to compose a policy brief for the Board of the Bank of Italy. We find that ChatGPT can accelerate workflows by providing well-structured content suggestions, and by producing extensive, linguistically correct text in a matter of seconds. It does, however, require a significant amount of expert supervision, which partially offsets productivity gains. If the app is used naively, output can be incorrect, superficial, or irrelevant. Superficiality is an especially problematic limitation in the context of policy advice intended for high-level audiences.
    摘要 chatgpt,一种软件尝试模拟人类对话能力,正在吸引越来越多的关注。它有时被描述为创新的产品ivity工具,包括创作工作。在这篇论文中,我们进行了一项实验,以评估它在复杂的写作任务中的潜力。我们问软件组织一份意大利银行董事会的政策报告。我们发现,chatgpt可以加速工作流程,提供结构良好的内容建议,并在几秒钟之内生成大量、语言正确的文本。但它需要较大的专家监督,这部分抵消了产生效益。如果应用不当,输出可能是错误的、 superficiale 或无关的。 superficiale 是特别问题在政策建议高级审批人群时。

Evaluating the Efficacy of Supervised Learning vs Large Language Models for Identifying Cognitive Distortions and Suicidal Risks in Chinese Social Media

  • paper_url: http://arxiv.org/abs/2309.03564
  • repo_url: https://github.com/thudm/chatglm2-6b
  • paper_authors: Hongzhi Qi, Qing Zhao, Changwei Song, Wei Zhai, Dan Luo, Shuo Liu, Yi Jing Yu, Fan Wang, Huijing Zou, Bing Xiang Yang, Jianqiang Li, Guanghui Fu
  • for: 本研究旨在探讨大语言模型在中国社交媒体平台上的应用前景,特别是在心理学领域。
  • methods: 本研究采用了supervised learning作为基础,对三种不同的大语言模型(零shot、少shot和精度调整)进行了比较。
  • results: 研究发现,大语言模型在中国社交媒体任务上表现明显差,主要是因为模型无法完全理解微分category。然而,GPT-4在多个场景中表现出优异,而GPT-3.5在精度调整后表现出显著提高在自杀风险分类中。
    Abstract Large language models, particularly those akin to the rapidly progressing GPT series, are gaining traction for their expansive influence. While there is keen interest in their applicability within medical domains such as psychology, tangible explorations on real-world data remain scant. Concurrently, users on social media platforms are increasingly vocalizing personal sentiments; under specific thematic umbrellas, these sentiments often manifest as negative emotions, sometimes escalating to suicidal inclinations. Timely discernment of such cognitive distortions and suicidal risks is crucial to effectively intervene and potentially avert dire circumstances. Our study ventured into this realm by experimenting on two pivotal tasks: suicidal risk and cognitive distortion identification on Chinese social media platforms. Using supervised learning as a baseline, we examined and contrasted the efficacy of large language models via three distinct strategies: zero-shot, few-shot, and fine-tuning. Our findings revealed a discernible performance gap between the large language models and traditional supervised learning approaches, primarily attributed to the models' inability to fully grasp subtle categories. Notably, while GPT-4 outperforms its counterparts in multiple scenarios, GPT-3.5 shows significant enhancement in suicide risk classification after fine-tuning. To our knowledge, this investigation stands as the maiden attempt at gauging large language models on Chinese social media tasks. This study underscores the forward-looking and transformative implications of using large language models in the field of psychology. It lays the groundwork for future applications in psychological research and practice.
    摘要 大型语言模型,特别是快速进步的GPT系列,在医疗领域中获得了更多的关注,但实际应用在社交媒体平台上的探索仍然很少。同时,社交媒体上的用户对自己的情感表达越来越 vocal,经常表达负面情感,甚至有些情感可能会演变为自杀倾向。在时间上有效地识别这些认知扭曲和自杀风险是关键的,以便有效地干预并可能避免不良情况。我们的研究进入了这个领域,通过实验在中文社交媒体平台上进行了两项重要任务:自杀风险识别和认知扭曲识别。使用supervised learning作为基础,我们评估和比较了大型语言模型的三种不同策略:零shot、几shot和精度调整。我们发现了大型语言模型和传统supervised learning方法之间的明显性能差距,主要归因于模型无法完全理解细部类别。特别是GPT-4在多个场景中表现出色,而GPT-3.5在精度调整后表现出了自杀风险分类的明显改善。根据我们所知,这是中文社交媒体任务上首次使用大型语言模型进行探索。这些研究实践了大型语言模型在医学领域的前瞻性和转型性应用,奠定了未来在医学研究和实践中的基础。

All Labels Together: Low-shot Intent Detection with an Efficient Label Semantic Encoding Paradigm

  • paper_url: http://arxiv.org/abs/2309.03563
  • repo_url: None
  • paper_authors: Jiangshu Du, Congying Xia, Wenpeng Yin, Tingting Liang, Philip S. Yu
  • for: 本文旨在提出一种综合利用意图标签的一对所有系统,以便对输入句子进行与所有标签候选者进行比较。
  • methods: 本文提出了一种终到端的一对所有系统,并使用了 indirect supervision from paraphrasing 进行预训练。
  • results: 实验表明,当训练资源极为稀少时,One-to-All 系统在 1-, 3- 和 5-shot 设置下表现出状态之 arts 性能。
    Abstract In intent detection tasks, leveraging meaningful semantic information from intent labels can be particularly beneficial for few-shot scenarios. However, existing few-shot intent detection methods either ignore the intent labels, (e.g. treating intents as indices) or do not fully utilize this information (e.g. only using part of the intent labels). In this work, we present an end-to-end One-to-All system that enables the comparison of an input utterance with all label candidates. The system can then fully utilize label semantics in this way. Experiments on three few-shot intent detection tasks demonstrate that One-to-All is especially effective when the training resource is extremely scarce, achieving state-of-the-art performance in 1-, 3- and 5-shot settings. Moreover, we present a novel pretraining strategy for our model that utilizes indirect supervision from paraphrasing, enabling zero-shot cross-domain generalization on intent detection tasks. Our code is at https://github.com/jiangshdd/AllLablesTogether.
    摘要 在意图探测任务中,利用意图标签中的有意义semantic信息可以对少量scenario非常有利。然而,现有的少量意图探测方法可能会忽略意图标签(例如,对意图视为索引)或者只使用一部分意图标签信息。在这种情况下,我们提出了一个端到端的One-to-All系统,允许输入语音与所有标签候选者进行比较。这种系统可以全面利用标签semantic信息。我们的实验表明,One-to-All在非常罕见的训练资源情况下表现非常出色,在1-, 3-和5-shot设置下达到了状态的最优性。此外,我们还提出了一种新的预训练策略,利用副本推导来启用零批次跨频域通用性。我们的代码可以在https://github.com/jiangshdd/AllLablesTogether中找到。

An Anchor Learning Approach for Citation Field Learning

  • paper_url: http://arxiv.org/abs/2309.03559
  • repo_url: None
  • paper_authors: Zilin Yuan, Borun Chen, Yimeng Dai, Yinghui Li, Hai-Tao Zheng, Rui Zhang
  • for: 本研究旨在提高参引字段学习性能,提供一种基于锚学习的新算法CIFAL。
  • methods: 本研究使用了锚学习来帮助捕捉参与不同风格的参引模式,并在不同的参引资料上进行训练。
  • results: 实验结果表明,CIFAL比现有方法提高了2.83%的场景级F1分数,并且对质量评估也有了较好的表现。
    Abstract Citation field learning is to segment a citation string into fields of interest such as author, title, and venue. Extracting such fields from citations is crucial for citation indexing, researcher profile analysis, etc. User-generated resources like academic homepages and Curriculum Vitae, provide rich citation field information. However, extracting fields from these resources is challenging due to inconsistent citation styles, incomplete sentence syntax, and insufficient training data. To address these challenges, we propose a novel algorithm, CIFAL (citation field learning by anchor learning), to boost the citation field learning performance. CIFAL leverages the anchor learning, which is model-agnostic for any Pre-trained Language Model, to help capture citation patterns from the data of different citation styles. The experiments demonstrate that CIFAL outperforms state-of-the-art methods in citation field learning, achieving a 2.83% improvement in field-level F1-scores. Extensive analysis of the results further confirms the effectiveness of CIFAL quantitatively and qualitatively.
    摘要 使用简化中文文本:引用字段学习是将引用字符串分解成 интерес的字段,如作者、标题和会议地点。从引用中提取这些字段非常重要,以便引用索引、研究人员资料分析等。用户生成的资源,如学术主页和CV,提供了丰富的引用字段信息。然而,从这些资源中提取字段具有挑战,主要是因为引用风格的不一致,句子结构不完整,以及训练数据的不足。为解决这些挑战,我们提出了一种新的算法,称为CIFAL(引用字段学习 by anchor learning),以提高引用字段学习性能。CIFAL利用了锚学习,这是对任何预训练语言模型的模型无关的,来帮助捕捉不同引用风格的引用模式。实验表明,CIFAL在引用字段学习方面的表现明显超过了现有方法,提高了字段级F1分数2.83%。广泛的分析结果证明了CIFAL的有效性,并且证明了其量化和质量上的优势。

Machine Learning for Tangible Effects: Natural Language Processing for Uncovering the Illicit Massage Industry & Computer Vision for Tactile Sensing

  • paper_url: http://arxiv.org/abs/2309.03470
  • repo_url: None
  • paper_authors: Rui Ouyang
  • for: This thesis explores how computer science can be used to fight human trafficking, specifically in the illicit massage industry, and how computer vision can create a sense of touch.
  • methods: The thesis uses natural language processing (NLP) to monitor the industry and create datasets, and also considers the use of agent-based models to create synthetic financial data. Additionally, the thesis describes the development of a novel sensor, the Digger Finger, which adapts the Gelsight sensor to find objects in granular media, and a low-cost six-axis force-torque sensor using a webcam and printed reference marker.
  • results: The thesis shows how NLP can be used to derive insights into the labor pressures and language barriers faced by employees in the industry, as well as the income, demographics, and societal pressures affecting sex buyers. Additionally, the thesis reports on the development of a novel sensor that is up to a hundred times less expensive than commercial sensors, allowing for a wider range of applications.
    Abstract I explore two questions in this thesis: how can computer science be used to fight human trafficking? And how can computer vision create a sense of touch? I use natural language processing (NLP) to monitor the United States illicit massage industry (IMI), a multi-billion dollar industry that offers not just therapeutic massages but also commercial sexual services. Employees of this industry are often immigrant women with few job opportunities, leaving them vulnerable to fraud, coercion, and other facets of human trafficking. Monitoring spatiotemporal trends helps prevent trafficking in the IMI. By creating datasets with three publicly-accessible websites: Google Places, Rubmaps, and AMPReviews, combined with NLP techniques such as bag-of-words and Word2Vec, I show how to derive insights into the labor pressures and language barriers that employees face, as well as the income, demographics, and societal pressures affecting sex buyers. I include a call-to-action to other researchers given these datasets. I also consider how to creating synthetic financial data, which can aid with counter-trafficking in the banking sector. I use an agent-based model to create both tabular and payee-recipient graph data. I then consider the role of computer vision in making tactile sensors. I report on a novel sensor, the Digger Finger, that adapts the Gelsight sensor to finding objects in granular media. Changes include using a wedge shape to facilitate digging, replacing the internal lighting LEDs with fluorescent paint, and adding a vibrator motor to counteract jamming. Finally, I also show how to use a webcam and a printed reference marker, or fiducial, to create a low-cost six-axis force-torque sensor. This sensor is up to a hundred times less expensive than commercial sensors, allowing for a wider range of applications. For this and earlier chapters I release design files and code as open source.
    摘要 我在这个论文中考察了两个问题:如何使用计算机科学来战击人口贩卖?以及如何使用计算机视觉创造一种触觉感?我使用自然语言处理(NLP)监测美国违法按摩业(IMI),这是一个多亿元的业务,提供了不仅按摩服务,还有商业性的性服务。该行业的员工多为移民女性,她们受到了诈骗、强迫和其他人贩卖的威胁。通过使用Google Places、Rubmaps和AMPReviews等三个公共可访问的网站,以及NLP技术如袋子模型和Word2Vec,我提取了员工面临的劳动压力和语言障碍,以及客户的收入、人口、社会压力等信息。我还采取了对其他研究人员的呼吁,并考虑了如何创建假金融数据,以便在银行业中应对贩卖。我使用代理模型创建了标题和付款人关系数据。然后,我考察了计算机视觉在创造触觉感方面的作用。我报告了一种新的感知器,即挖掘手指(Digger Finger),它将GELSight感知器改进为在粗粒媒体中找到物体。改进包括使用梯形尝试挖掘,取代内部照明LED的涂抹fluorescent paint,并添加震动电动机以防止堵塞。此外,我还显示了如何使用Webcam和印刷参照标记(fiducial)创建一个低成本的六个 степени力矩感知器。这个感知器比商业感知器便宜得多,可以扩展到更多的应用场景。为此和之前章节,我发布了设计文件和代码作为开源。

Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty

  • paper_url: http://arxiv.org/abs/2309.03433
  • repo_url: None
  • paper_authors: Chen Ling, Xujiang Zhao, Xuchao Zhang, Yanchi Liu, Wei Cheng, Haoyu Wang, Zhengzhang Chen, Takao Osaki, Katsushi Matsuda, Haifeng Chen, Liang Zhao
  • for: 提高大语言模型(LLM)在开放信息提取任务(OIE)中的表现。
  • methods: 提出了多种在Context中学习策略和表现 uncertainty 量化模块,以提高 LLM 的 instruction-following 能力和生成关系准确性。
  • results: 在三个 OIE benchmark 数据集上进行了实验,并证明了我们的方法可以与现有的指导方法相比, both quantitatively and qualitatively。
    Abstract Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text, typically in the form of (subject, relation, object) triples. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks due to two key issues. First, LLMs struggle to distinguish irrelevant context from relevant relations and generate structured output due to the restrictions on fine-tuning the model. Second, LLMs generates responses autoregressively based on probability, which makes the predicted relations lack confidence. In this paper, we assess the capabilities of LLMs in improving the OIE task. Particularly, we propose various in-context learning strategies to enhance LLM's instruction-following ability and a demonstration uncertainty quantification module to enhance the confidence of the generated relations. Our experiments on three OIE benchmark datasets show that our approach holds its own against established supervised methods, both quantitatively and qualitatively.
    摘要

From Base to Conversational: Japanese Instruction Dataset and Tuning Large Language Models

  • paper_url: http://arxiv.org/abs/2309.03412
  • repo_url: https://github.com/retarfi/jallm
  • paper_authors: Masahiro Suzuki, Masanori Hirano, Hiroki Sakaji
  • for: 这个论文目的是为了证明大语言模型(LLM)的交互性需要进行调教。
  • methods: 这个论文使用了扩展和筛选现有数据集,并将其应用于日本预训练基模型。它还使用了低级别适应(LoRA)调教技术来调教日本和英语基模型。
  • results: 研究表明,日本指令集数据集的效iveness得到了证明。同时,通过指令调教,即使使用较小的LLM,在下游任务中的性能也会得到改善。研究的指令集、调教模型和实现都公开在线可用。
    Abstract Instruction tuning is essential for large language models (LLMs) to become interactive. While many instruction tuning datasets exist in English, there is a noticeable lack in other languages. Also, their effectiveness has not been well verified in non-English languages. We construct a Japanese instruction dataset by expanding and filtering existing datasets and apply the dataset to a Japanese pre-trained base model. We performed Low-Rank Adaptation (LoRA) tuning on both Japanese and English existing models using our instruction dataset. We evaluated these models from both quantitative and qualitative perspectives. As a result, the effectiveness of Japanese instruction datasets is confirmed. The results also indicate that even with relatively small LLMs, performances in downstream tasks would be improved through instruction tuning. Our instruction dataset, tuned models, and implementation are publicly available online.
    摘要 大型语言模型(LLM)的指令调整是必要的,以使其成为互动型。然而,英语以外的语言的指令调整数据集尚缺乏,而且它们的效果尚未得到充分验证。我们使用扩展和筛选现有数据集,构建了一个日本语言指令数据集。我们使用这个数据集对日语预训练模型进行了低级别适应(LoRA)调整,并对英语和日语现有模型进行了相同的调整。我们从量化和质量两个角度进行了评估。结果表明,日语指令数据集的效果得到了证明,同时也表明,即使使用较小的LLM,在下游任务中的性能仍可以通过指令调整得到改进。我们的指令数据集、调整模型和实现在线公开可用。