cs.CL - 2023-09-21

Towards Lexical Analysis of Dog Vocalizations via Online Videos

  • paper_url: http://arxiv.org/abs/2309.13086
  • repo_url: None
  • paper_authors: Yufei Wang, Chunhao Zhang, Jieyi Huang, Mengyue Wu, Kenny Zhu
  • for: 这个研究是为了解含犬语言 semantics 的大挑战。
  • methods: 这个研究使用了数据驱动的方法,通过对犬叫声与相应的位置和活动之间的conditioned probability进行分析,探讨犬语言 semantics 的含义。
  • results: 研究发现了一些支持先前观察研究的犬叫声 semantics 的证据,如 growl 可能表示互动。此外,研究还提供了新的发现,例如 whimper 可以被细分为两种类型:寻求注意和不适。
    Abstract Deciphering the semantics of animal language has been a grand challenge. This study presents a data-driven investigation into the semantics of dog vocalizations via correlating different sound types with consistent semantics. We first present a new dataset of Shiba Inu sounds, along with contextual information such as location and activity, collected from YouTube with a well-constructed pipeline. The framework is also applicable to other animal species. Based on the analysis of conditioned probability between dog vocalizations and corresponding location and activity, we discover supporting evidence for previous heuristic research on the semantic meaning of various dog sounds. For instance, growls can signify interactions. Furthermore, our study yields new insights that existing word types can be subdivided into finer-grained subtypes and minimal semantic unit for Shiba Inu is word-related. For example, whimper can be subdivided into two types, attention-seeking and discomfort.
    摘要 研究动物语言 semantics 是一个大型挑战。本研究通过对狗叫声的数据驱动Investigation into the semantics of dog vocalizations has been a grand challenge. This study presents a data-driven investigation into the semantics of dog vocalizations by correlating different sound types with consistent semantics. We first present a new dataset of Shiba Inu sounds, along with contextual information such as location and activity, collected from YouTube with a well-constructed pipeline. The framework is also applicable to other animal species. Based on the analysis of conditioned probability between dog vocalizations and corresponding location and activity, we discover supporting evidence for previous heuristic research on the semantic meaning of various dog sounds. For instance, growls can signify interactions. Furthermore, our study yields new insights that existing word types can be subdivided into finer-grained subtypes and minimal semantic unit for Shiba Inu is word-related. For example, whimper can be subdivided into two types, attention-seeking and discomfort.Here's the translation in Traditional Chinese:研究动物语言 semantics 是一个大型挑战。本研究通过对狗叫声的数据驱动Investigation into the semantics of dog vocalizations has been a grand challenge. This study presents a data-driven investigation into the semantics of dog vocalizations by correlating different sound types with consistent semantics. We first present a new dataset of Shiba Inu sounds, along with contextual information such as location and activity, collected from YouTube with a well-constructed pipeline. The framework is also applicable to other animal species. Based on the analysis of conditioned probability between dog vocalizations and corresponding location and activity, we discover supporting evidence for previous heuristic research on the semantic meaning of various dog sounds. For instance, growls can signify interactions. Furthermore, our study yields new insights that existing word types can be subdivided into finer-grained subtypes and minimal semantic unit for Shiba Inu is word-related. For example, whimper can be subdivided into two types, attention-seeking and discomfort.

Foundation Metrics: Quantifying Effectiveness of Healthcare Conversations powered by Generative AI

  • paper_url: http://arxiv.org/abs/2309.12444
  • repo_url: None
  • paper_authors: Mahyar Abbasian, Elahe Khatibi, Iman Azimi, David Oniani, Zahra Shakeri Hossein Abad, Alexander Thieme, Ram Sriram, Zhongqi Yang, Yanshan Wang, Bryant Lin, Olivier Gevaert, Li-Jia Li, Ramesh Jain, Amir M. Rahmani
  • for: 这项研究的目的是为了评估医疗聊天机器人的性能,以提高患者的健康结果。
  • methods: 这项研究使用了现有的大语言模型评估指标,并对其进行了修改和扩展,以适应医疗聊天机器人的特点。
  • results: 研究结果表明,现有的评估指标无法完全评估医疗聊天机器人的性能,因为它们缺乏对医疗概念和患者需求的理解。新的评估指标可以更好地评估机器人的语言处理能力、实际医疗任务的影响和用户交互对话的效果。
    Abstract Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present an comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.
    摘要 优化人工智能将革新医疗服务,从传统患者护理转化为更个性化、高效、积极的过程。 chatbot 作为互动对话模型,将主导这种患者中心的变革。通过提供诊断、个性化生活建议、心理支持等服务,目标是大幅提高患者健康结果,同时减轻医疗提供者的劳作负担。由于医疗应用的生命重要性,需要建立一个统一和完整的评估指标集,以评估对话模型的性能。现有的评估指标,针对普通的大语言模型(LLM),表明对医疗和健康概念的 comprendio 和其在患者健康状态提高中的重要性存在缺失。此外,这些指标忽视了关键的用户中心因素,如信任建立、伦理、个性化、同理、用户理解和情感支持。本文的目的是探讨采用 LLM 的现状评估指标,并提出一个完整的评估指标集,以评估医疗 chatbot 的性能从患者视角。这些指标包括语言处理能力、对实际医疗任务的影响和用户互动对话的效果。最后,我们展开讨论关于定义和实施这些指标的挑战,尤其是对象audience、评估方法和提示技术的影响。

Active Learning for Multilingual Fingerspelling Corpora

  • paper_url: http://arxiv.org/abs/2309.12443
  • repo_url: None
  • paper_authors: Shuai Wang, Eric Nalisnick
  • for: 帮助手语数据稀缺问题
  • methods: 使用活动学习
  • results: 发现预训练可能利用手语语言之间的手势相似性,但是可能是视觉相似性而不是语言相似性引起的 benefita
    Abstract We apply active learning to help with data scarcity problems in sign languages. In particular, we perform a novel analysis of the effect of pre-training. Since many sign languages are linguistic descendants of French sign language, they share hand configurations, which pre-training can hopefully exploit. We test this hypothesis on American, Chinese, German, and Irish fingerspelling corpora. We do observe a benefit from pre-training, but this may be due to visual rather than linguistic similarities
    摘要 我们使用活动学习来帮助数据缺乏问题在手语中。特别是,我们进行了一种新的预训练分析。由于许多手语都是法语手语的语言后裔,因此它们可能具有相似的手势配置,预训练可能可以利用这些相似性。我们在美国、中国、德国和爱尔兰手语词汇集中测试了这个假设。我们确实发现了预训练的好处,但这可能是由视觉相似性而不是语言相似性导致的。Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. Traditional Chinese is also commonly used in Taiwan and Hong Kong.

Reranking for Natural Language Generation from Logical Forms: A Study based on Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12294
  • repo_url: None
  • paper_authors: Levon Haroutunian, Zhuang Li, Lucian Galescu, Philip Cohen, Raj Tumuluri, Gholamreza Haffari
  • for: 本文旨在提高大型自然语言模型(LLM)生成的自然语言质量。
  • methods: 本文提出了一种生成并重新排序的方法,包括首先通过提示LLM生成一组候选输出,然后使用任务特定的重新排序模型进行重新排序。
  • results: 经过广泛的实验表明,我们的方法可以提高LLM生成的输出质量,包括semantic consistency和fluency。
    Abstract Large language models (LLMs) have demonstrated impressive capabilities in natural language generation. However, their output quality can be inconsistent, posing challenges for generating natural language from logical forms (LFs). This task requires the generated outputs to embody the exact semantics of LFs, without missing any LF semantics or creating any hallucinations. In this work, we tackle this issue by proposing a novel generate-and-rerank approach. Our approach involves initially generating a set of candidate outputs by prompting an LLM and subsequently reranking them using a task-specific reranker model. In addition, we curate a manually collected dataset to evaluate the alignment between different ranking metrics and human judgements. The chosen ranking metrics are utilized to enhance the training and evaluation of the reranker model. By conducting extensive experiments on three diverse datasets, we demonstrate that the candidates selected by our reranker outperform those selected by baseline methods in terms of semantic consistency and fluency, as measured by three comprehensive metrics. Our findings provide strong evidence for the effectiveness of our approach in improving the quality of generated outputs.
    摘要 Our approach involves first generating a set of candidate outputs using an LLM and then reranking them using a task-specific reranker model. We also create a manually curated dataset to evaluate the alignment between different ranking metrics and human judgments. We use these ranking metrics to enhance the training and evaluation of the reranker model.We conduct extensive experiments on three diverse datasets and show that the candidates selected by our reranker outperform those selected by baseline methods in terms of semantic consistency and fluency, as measured by three comprehensive metrics. Our findings provide strong evidence for the effectiveness of our approach in improving the quality of generated outputs.

Inspire the Large Language Model by External Knowledge on BioMedical Named Entity Recognition

  • paper_url: http://arxiv.org/abs/2309.12278
  • repo_url: None
  • paper_authors: Junyi Bian, Jiaxuan Zheng, Yuyi Zhang, Shanfeng Zhu
  • for: 这 paper 的目的是解决生物医学命名实体识别(BioNER)任务,特别是利用大语言模型(LLM)来解决这个任务。
  • methods: 这 paper 使用了一种两步 Approach,首先将 NER 任务分解为 entity span EXTRACTION 和 entity type 确定两个步骤。其次,为 entity type 确定,我们将实体知识注入到 LLM 中以解决 LL 缺乏域知识的问题。
  • results: 实验结果表明,我们的 two-step BioNER 方法与之前的几 shot LLM 基eline 相比有了显著的改善。同时,将外部知识注入到 LLM 中也有效地提高了实体类别确定性。
    Abstract Large language models (LLMs) have demonstrated dominating performance in many NLP tasks, especially on generative tasks. However, they often fall short in some information extraction tasks, particularly those requiring domain-specific knowledge, such as Biomedical Named Entity Recognition (NER). In this paper, inspired by Chain-of-thought, we leverage the LLM to solve the Biomedical NER step-by-step: break down the NER task into entity span extraction and entity type determination. Additionally, for entity type determination, we inject entity knowledge to address the problem that LLM's lack of domain knowledge when predicting entity category. Experimental results show a significant improvement in our two-step BioNER approach compared to previous few-shot LLM baseline. Additionally, the incorporation of external knowledge significantly enhances entity category determination performance.
    摘要

Improving VTE Identification through Adaptive NLP Model Selection and Clinical Expert Rule-based Classifier from Radiology Reports

  • paper_url: http://arxiv.org/abs/2309.12273
  • repo_url: None
  • paper_authors: Jamie Deng, Yusen Wu, Hilary Hayssen, Brain Englum, Aman Kankaria, Minerva Mayorga-Carlin, Shalini Sahoo, John Sorkin, Brajesh Lal, Yelena Yesha, Phuong Nguyen
  • for: 这个研究旨在提高不结构化(free-text)医疗报告中的深部静脉血栓(DVT)和肺动脉血栓(PE)的识别率,以便更好地治疗Cardiovascular disease。
  • methods: 本研究使用自然语言处理(NLP)方法,结合深度学习(DL)和数据增强,以提高VTE事件的识别率。
  • results: 本研究的实验结果显示,模型具有97%的准确率和97%的F1分数在预测DVT,并具有98.3%的准确率和98.4%的F1分数在预测PE。
    Abstract Rapid and accurate identification of Venous thromboembolism (VTE), a severe cardiovascular condition including deep vein thrombosis (DVT) and pulmonary embolism (PE), is important for effective treatment. Leveraging Natural Language Processing (NLP) on radiology reports, automated methods have shown promising advancements in identifying VTE events from retrospective data cohorts or aiding clinical experts in identifying VTE events from radiology reports. However, effectively training Deep Learning (DL) and the NLP models is challenging due to limited labeled medical text data, the complexity and heterogeneity of radiology reports, and data imbalance. This study proposes novel method combinations of DL methods, along with data augmentation, adaptive pre-trained NLP model selection, and a clinical expert NLP rule-based classifier, to improve the accuracy of VTE identification in unstructured (free-text) radiology reports. Our experimental results demonstrate the model's efficacy, achieving an impressive 97\% accuracy and 97\% F1 score in predicting DVT, and an outstanding 98.3\% accuracy and 98.4\% F1 score in predicting PE. These findings emphasize the model's robustness and its potential to significantly contribute to VTE research.
    摘要 快速和准确地识别深静脉栓塞(VTE),包括深静脉栓塞(DVT)和肺动脉栓塞(PE),是诊断 cardiovascular 疾病的关键。通过自然语言处理(NLP)技术对医疗报告进行自动分析,已经在从退化数据库中提取VTE事件的方面展现出了扎实的进步。然而,由于医疗文本数据的有限性、报告的复杂性和多样性以及数据不均衡,训练深度学习(DL)和NLP模型的问题很大。这项研究提出了一种新的方法组合,包括DL方法、数据增强、适应预训练NLP模型选择和临床专家NLP规则基本分类器,以提高无结构(自由文本)医疗报告中VTE识别的准确率。我们的实验结果表明,该模型具有卓越的表现,在静脉栓塞(DVT)预测方面达到了97%的准确率和97%的F1分数,在肺动脉栓塞(PE)预测方面达到了98.3%的准确率和98.4%的F1分数。这些发现证明了模型的稳定性,并且它有potential为VTE研究做出重要贡献。

  • paper_url: http://arxiv.org/abs/2309.12269
  • repo_url: None
  • paper_authors: Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek
  • for: 这份论文是为了推动法律人工智能研究而创建的剑桥法律词汇库(CLC)的首次发布。CLC包含了超过250,000个英国法律案例,其中大多数案例发生在21世纪,但词汇库还包括16世纪的案例。
  • methods: 这篇论文提供了CLC的首次发布,包括Raw文本和元数据。同时,作者还提供了638个案例的注释,由法律专家进行标注。使用了这些注释数据,作者在GPT-3、GPT-4和RoBERTa模型上进行了案例结果抽取的训练和评估。
  • results: 作者通过使用GPT-3、GPT-4和RoBERTa模型进行了案例结果抽取的训练和评估。他们提供了这些模型的benchmark,以便用于未来的法律人工智能研究。同时,作者还进行了extensive的法律和伦理讨论,以 Addressing the potentially sensitive nature of this material。
    Abstract We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
    摘要 我们介绍了剑桥法律词库(CLC),一个为法律人工智能研究提供的词库。它包含了超过250,000个英国法院案例,大多数案例是21世纪的,但词库还包括了16世纪的案例。本文发布了词库的首个版本,其中包含了原始文本和元数据。同时,我们提供了638个案例的法律专家标注,用于训练和评估案例结果抽取模型。我们还进行了广泛的法律和伦理讨论,以Addressing the potentially sensitive nature of this material。因此,词库将仅 для研究用途发布,受限于 certain restrictions。

On the Relationship between Skill Neurons and Robustness in Prompt Tuning

  • paper_url: http://arxiv.org/abs/2309.12263
  • repo_url: None
  • paper_authors: Leon Ackermann, Xenia Ohmer
  • for: 这 paper 研究了 Prompt Tuning 方法在 pre-trained 大语言模型 (PLMs) 上的稳定性,以及这方法 在具体任务上如何活化特定的 neuron。
  • methods: 这 paper 使用了 RoBERTa 和 T5 进行实验,并研究了这些模型在不同任务上的表现。
  • results: 研究结果表明,Prompt Tuning 在不同任务上的表现不够稳定,而 T5 的表现比 RoBERTa 更加稳定。此外,研究还发现了 RoBERTa 和 T5 中的特定 neuron 在不同任务上的表现。
    Abstract Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Recently, based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer's feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these "skill neurons", using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data, with higher robustness for T5 than RoBERTa. At the same time, we replicate the existence of skill neurons in RoBERTa and further show that skill neurons also seem to exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model's ability to activate the relevant skill neurons on adversarial data.
    摘要 启发调整(Prompt Tuning)是一种Parameter-efficient finetuning方法,用于预训练大型自然语言模型(PLMs)。最近,通过RoBERTa的实验,表明Prompt Tuning可以活化特定的转换器网络中的高度预测和选择性 neuron,用于给定任务。在这篇论文中,我们研究Prompt Tuning的稳定性,与这些“技能neuron”(skill neurons)相关。使用RoBERTa和T5,我们发现,任务特定的启发调整可以在同类任务中进行转移,但对阴谋数据不具有很高的稳定性,T5的稳定性比RoBERTa更高。同时,我们复制了RoBERTa中的技能neurons,并证明T5中也存在技能neurons。更有趣的是,T5中非阴谋数据上定义的技能neurons还是阴谋数据上最预测性的 neurons的一部分,而RoBERTa中的技能neurons不是。我们认为,高度阴谋稳定性可能与模型活化相关的技能neurons的存在有关。

SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References

  • paper_url: http://arxiv.org/abs/2309.12250
  • repo_url: None
  • paper_authors: Matteo Gabburo, Siddhant Garg, Rik Koncel Kedziorski, Alessandro Moschitti
  • for: 本研究旨在提出一种新的问答评估 metric,用于评估句子级问答系统的正确率。
  • methods: 本研究使用多个参考答案(包括多个正确和错误的参考答案)来评估句子级问答系统的表现,并使用 transformer LM encoder 基于相似度metric 进行评估。
  • results: 研究结果表明,SQuArE metric 在 sentence-level 提取式(Answer Selection)和生成(GenQA)问答系统上具有更高的协调度和更好的性能,并且在多个学术和工业数据集上都具有优异的表现。
    Abstract Evaluation of QA systems is very challenging and expensive, with the most reliable approach being human annotations of correctness of answers for questions. Recent works (AVA, BEM) have shown that transformer LM encoder based similarity metrics transfer well for QA evaluation, but they are limited by the usage of a single correct reference answer. We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation), using multiple reference answers (combining multiple correct and incorrect references) for sentence-form QA. We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems, across multiple academic and industrial datasets, and show that it outperforms previous baselines and obtains the highest correlation with human annotations.
    摘要 评估问答系统非常困难和昂贵,人工标注正确答案为最可靠的方法。最近的研究(AVA、BEM)表明,基于转换器LM核心 metric 可以很好地传递问答评估,但它们受到唯一正确参考答案的限制。我们提议一种新的评估指标:SQuArE(句子级问答回答评估),使用多个参考答案(包括多个正确和错误参考)评估句子级问答系统。我们在多个学术和产业数据集上评估了SQuArE,并显示它超过了先前的基线和 humans 的标注相关度最高。

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

  • paper_url: http://arxiv.org/abs/2309.12234
  • repo_url: https://github.com/xuchennlp/s2t
  • paper_authors: Chen Xu, Xiaoqian Liu, Erfeng He, Yuhao Zhang, Qianqian Dong, Tong Xiao, Jingbo Zhu, Dapeng Man, Wu Yang
  • for: 这篇论文主要针对的是语音翻译任务中的同时双语连接主义推荐(CTC)框架,用于桥接语音和文本、源语言和目标语言之间的差异。
  • methods: 该模型使用了涉及训练和评估的同时双语CTC框架,利用训练录音和翻译文本作为同时目标,从而桥接语音和文本之间的差异。
  • results: 该模型在资源受限的情况下在MuST-C ST benchmark上达到了新的州OF-the-art表现,并且在语音识别任务中也显示了显著提高,这表明了涉及多语言学习的跨语言学习效应。
    Abstract In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at https://github.com/xuchennlp/S2T.
    摘要 在本研究中,我们提出了同步双语 Connectionist Temporal Classification(CTC)框架,这是一种创新的方法,利用了双CTC来跨越语言和Modalities在语音翻译(ST)任务中的差异。我们通过使用讲解和翻译作为同时目标 для CTC,我们的模型可以跨越音频和文本之间的差异,以及源语言和目标语言之间的差异。基于最近的CTC应用的进步,我们开发了一种改进的变体,BiL-CTC+,它在资源受限的情况下在MuST-C ST标准测试上达到了新的州Of-The-Art表现。很有趣的是,我们的方法还提高了语音识别性能,这表明了涉及跨语言学习的跨越效应,并证明了其广泛的适用性。代码可以在https://github.com/xuchennlp/S2T上获取。

  • paper_url: http://arxiv.org/abs/2309.12224
  • repo_url: None
  • paper_authors: Deepak Gupta, Kush Attal, Dina Demner-Fushman
  • for: 这个论文是为了回答公众的健康相关问题而提供视频回答。
  • methods: 这篇论文使用了一个批处理方法创建了两个大规模数据集:HealthVidQA-CRF和HealthVidQA-Prompt。然后,它们提出了单模态和多模态方法,可以从医疗视频中提取视觉回答。
  • results: 这篇论文的结果表明,创建了这两个数据集可以提高医疗视频回答任务的模型训练,并且视觉特征可以提高单模态和多模态方法的性能。
    Abstract The increase in the availability of online videos has transformed the way we access information and knowledge. A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks. The instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions. Toward this, this paper is focused on answering health-related questions asked by the public by providing visual answers from medical videos. The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions. To address this issue, we first proposed a pipelined approach to create two large-scale datasets: HealthVidQA-CRF and HealthVidQA-Prompt. Later, we proposed monomodal and multimodal approaches that can effectively provide visual answers from medical videos to natural language questions. We conducted a comprehensive analysis of the results, focusing on the impact of the created datasets on model training and the significance of visual features in enhancing the performance of the monomodal and multi-modal approaches. Our findings suggest that these datasets have the potential to enhance the performance of medical visual answer localization tasks and provide a promising future direction to further enhance the performance by using pre-trained language-vision models.
    摘要 “在线影片的更多可用性已经改变了我们取得信息和知识的方式。更多的人现在偏好使用指南影片,因为它们提供了一系列步骤的程序来完成特定任务。医疗领域的指南影片可能提供医疗问题上最佳的可视答案。这篇论文专注于通过提供医疗影片的可视答案来回答公众对健康问题的问题。医疗领域的大规模数据匮乏是开发应用程序的关键挑战。为解决这个问题,我们首先提出了管道方法,创建了HealthVidQA-CRF和HealthVidQA-Prompt两个大规模数据集。之后,我们提出了单模式和多模式的方法,可以从医疗影片中提取可视答案。我们进行了全面的分析结果,专注于数据集的创建影响模型训练的影响和可视特征的增强效果。我们的发现表明这些数据集具有提高医疗可视答案定位任务的潜力,并提供了未来发展的可能性,使用预训语音视觉模型。”

Code Soliloquies for Accurate Calculations in Large Language Models

  • paper_url: http://arxiv.org/abs/2309.12161
  • repo_url: https://github.com/luffycodes/tutorbot-spock-phys
  • paper_authors: Shashank Sonkar, MyCo Le, Xinghe Chen, Naiming Liu, Debshila Basu Mallick, Richard G. Baraniuk
  • for: 这个论文的目的是提高智能教学系统(ITS)中使用大语言模型(LLM)后端的质量,通过使用高质量的对话数据集来改进学生和ITS之间的交互。
  • methods: 这个论文使用了先进的GPT-4模型生成Synthetic学生教师对话,并采用了一种新的状态强调设计来解决GPT-4在处理简单的乘数任务时表现不佳的问题。
  • results: 这个论文的结果表明,使用这种新的状态强调设计可以增强Mock对话数据集的质量,特别是在需要计算的科学概念上。这种方法可以提高LLM后端的准确率和计算可靠性。
    Abstract High-quality conversational datasets are integral to the successful development of Intelligent Tutoring Systems (ITS) that employ a Large Language Model (LLM) backend. These datasets, when used to fine-tune the LLM backend, significantly enhance the quality of interactions between students and ITS. A common strategy for developing these datasets involves generating synthetic student-teacher dialogues using advanced GPT-4 models. However, challenges arise when these dialogues demand complex calculations, common in subjects like physics. Despite its advanced capabilities, GPT-4's performance falls short in reliably handling even simple multiplication tasks, marking a significant limitation in its utility for these subjects. To address these challenges, this paper introduces an innovative stateful prompt design. Our approach generates a mock conversation between a student and a tutorbot, both roles simulated by GPT-4. Each student response triggers a soliloquy (an inner monologue) in the GPT-tutorbot, which assesses whether its response would necessitate calculations. If so, it proceeds to script the required code in Python and then uses the resulting output to construct its response to the student. Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive. Our findings show that our Higgs model -- a LLaMA finetuned with datasets generated through our novel stateful prompt design -- proficiently utilizes Python for computations. Consequently, finetuning with our datasets enriched with code soliloquies enhances not just the accuracy but also the computational reliability of Higgs' responses.
    摘要 高品质对话数据集是智能教学系统(ITS)的成功发展的重要组成部分。这些数据集,当用于精度调整LLM后端,会显著提高学生和ITS之间的互动质量。一种常见的发展策略是使用高级GPT-4模型生成synthetic学生教师对话。然而,在涉及到物理等科学的复杂计算时,GPT-4的表现不具备可靠性,这成为了这些主题的限制。为了解决这些挑战,本文提出了一种创新的状态归并提示设计。我们的方法通过GPT-4 simulate学生和教师两个角色,然后通过模拟对话来生成Mock对话。每个学生回答都会触发GPT-tutorbot的内部对话(soliloquy),判断是否需要计算。如果需要,那么GPT-tutorbot会使用Python脚本编写代码,然后使用该代码生成回答给学生。我们的方法可以明显提高synthetic对话数据集的质量,特别是对于需要计算的主题。我们的研究发现,我们的Higgs模型(LLaMA finetuned with我们的新状态归并提示设计生成的数据集)可以高效地使用Python进行计算。因此,在我们的数据集中添加了代码soliloquy后,finetuning Higgs的精度和计算可靠性都会提高。

How-to Guides for Specific Audiences: A Corpus and Initial Findings

  • paper_url: http://arxiv.org/abs/2309.12117
  • repo_url: None
  • paper_authors: Nicola Fanton, Agnieszka Falenska, Michael Roth
  • for: 这篇论文 investigate wikiHow 上的 how-to 指南是否因为目标读者而有所不同。
  • methods: 作者使用了资深研究和计算方法来检查文本中的偏见。
  • results: 研究发现,wikiHow 上的 how-to 指南受到社会 norms 和潜在的偏见的影响。
    Abstract Instructional texts for specific target groups should ideally take into account the prior knowledge and needs of the readers in order to guide them efficiently to their desired goals. However, targeting specific groups also carries the risk of reflecting disparate social norms and subtle stereotypes. In this paper, we investigate the extent to which how-to guides from one particular platform, wikiHow, differ in practice depending on the intended audience. We conduct two case studies in which we examine qualitative features of texts written for specific audiences. In a generalization study, we investigate which differences can also be systematically demonstrated using computational methods. The results of our studies show that guides from wikiHow, like other text genres, are subject to subtle biases. We aim to raise awareness of these inequalities as a first step to addressing them in future work.
    摘要 教程文档应该根据目标群体的先前知识和需求进行定制,以efficient地引导读者到他们的目标。然而,targeting specific groups也可能折衣不同的社会规范和微妙的刻板印象。在这篇论文中,我们研究了wikiHow的教程文档在实践中是否受到目标读者影响。我们进行了两个案例研究,检查了特定读者群体的文本特质。在一个总结研究中,我们使用计算方法示出这些差异。结果显示,wikiHow的教程文档,如其他文章类型,受到微妙的偏见。我们希望通过提醒这些不平等来addressing them in future work。Note: Please note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

A Computational Analysis of Vagueness in Revisions of Instructional Texts

  • paper_url: http://arxiv.org/abs/2309.12107
  • repo_url: None
  • paper_authors: Alok Debnath, Michael Roth
  • for: 本研究旨在分析WikiHow上的修订历史数据,找出涉及uncertainty的修改。
  • methods: 本研究使用神经网络模型分析修订历史数据,并采用对比性评价任务来评估模型的性能。
  • results: 研究表明,使用神经网络模型可以有效地分辨修改后的 instrucion 和原始 instrucion。
    Abstract WikiHow is an open-domain repository of instructional articles for a variety of tasks, which can be revised by users. In this paper, we extract pairwise versions of an instruction before and after a revision was made. Starting from a noisy dataset of revision histories, we specifically extract and analyze edits that involve cases of vagueness in instructions. We further investigate the ability of a neural model to distinguish between two versions of an instruction in our data by adopting a pairwise ranking task from previous work and showing improvements over existing baselines.
    摘要 WikiHow 是一个开放领域的指导文章存储库,可以由用户修改。在这篇论文中,我们提取了对应的版本之间的对应。从含污染的历史记录开始,我们特定地提取和分析对于 instrucional 不具体的修改。我们进一步调查一个神经网络模型是否能够在我们的数据中分辨两个版本的指导。我们采用了之前的对应任务,并超过了现有的基线。

SemEval-2022 Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in Instructional Texts

  • paper_url: http://arxiv.org/abs/2309.12102
  • repo_url: https://github.com/acidann/claire
  • paper_authors: Michael Roth, Talita Anthonio, Anna Sauer
  • for: 这个研究是为了评估帮助文档中的解释是否有效。
  • methods: 该研究使用了人工修改的 instrucitonal 文档,并生成了多个解释选项。然后,收集了人类的可能性评估。
  • results: 该研究发现了21个参与者的系统,最佳系统的准确率为68.9%。此外,研究还发现了一些参与者的系统可以在特定上下文中identify多个可能的解释,准确率为75.2%。
    Abstract We describe SemEval-2022 Task 7, a shared task on rating the plausibility of clarifications in instructional texts. The dataset for this task consists of manually clarified how-to guides for which we generated alternative clarifications and collected human plausibility judgements. The task of participating systems was to automatically determine the plausibility of a clarification in the respective context. In total, 21 participants took part in this task, with the best system achieving an accuracy of 68.9%. This report summarizes the results and findings from 8 teams and their system descriptions. Finally, we show in an additional evaluation that predictions by the top participating team make it possible to identify contexts with multiple plausible clarifications with an accuracy of 75.2%.
    摘要 我们描述SemEval-2022任务7,一个共同任务,评估指导文章中的解释可能性。这个任务的数据集包括手动修订的使用指南,我们生成了备用的解释,并收集了人类可能性评估。参与系统的任务是自动确定解释的可能性在特定上下文中。总共有21个参与者,最佳系统的准确率为68.9%。这份报告总结了8个团队和他们的系统描述。此外,我们在额外评估中发现,参与者顶尖系统的预测可以在多个可能性的上下文中寻找正确的解释,准确率为75.2%。

AceGPT, Localizing Large Language Models in Arabic

  • paper_url: http://arxiv.org/abs/2309.12053
  • repo_url: https://github.com/freedomintelligence/acegpt
  • paper_authors: Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Juncai He, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu
  • for: 本研究旨在开发一个特性化为阿拉伯语的大语言模型(LLM),以满足当前主流模型无法充分考虑的阿拉伯语文化特点。
  • methods: 该研究提出了一种全面的解决方案,包括进一步预训练阿拉伯语文本,使用本地阿拉伯语指令进行精度调整(SFT),以及使用阿拉伯语GPT-4响应和人工智能反馈学习(RLAIF)。
  • results: 研究发现,通过使用这种解决方案,可以创造出具有当地文化特点和价值观的阿拉伯语LLM,能够满足不同应用场景中的阿拉伯语使用者需求。研究表明,在不同的benchmark上,包括阿拉伯语Vicuna-80和阿拉伯语AlpacaEval等,AceGPT模型都达到了开放阿拉伯语LLM的状态标准。尤其是在使用GPT-4时,AceGPT在Vicuna-80benchmark中超过Turbo,即使这个benchmark的规模较小。代码、数据和模型可以在https://github.com/FreedomIntelligence/AceGPT中找到。
    Abstract This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed 'AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks, including the instruction-following benchmark (i.e., Arabic Vicuna-80 and Arabic AlpacaEval), knowledge benchmark (i.e., Arabic MMLU and EXAMs), and the newly introduced Arabic Cultural and Value Alignment benchmark. Notably, AceGPT outperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the benchmark's limited scale. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
    摘要 The proposed approach is evaluated on various benchmarks, including Arabic Vicuna-80 and Arabic AlpacaEval for instruction-following, Arabic MMLU and EXAMs for knowledge benchmarking, and a newly introduced Arabic Cultural and Value Alignment benchmark. The results show that the proposed model, named 'AceGPT', sets the state-of-the-art standard for open Arabic LLMs across all benchmarks. Notably, AceGPT outperforms Turbo in the popular Vicuna-80 benchmark when evaluated with GPT-4, despite the limited scale of the benchmark.The codes, data, and models used in this study are available on GitHub at .

CAMERA: A Multimodal Dataset and Benchmark for Ad Text Generation

  • paper_url: http://arxiv.org/abs/2309.12030
  • repo_url: None
  • paper_authors: Masato Mita, Soichiro Murakami, Akihiko Kato, Peinan Zhang
    for: 提高自动广告文本生成(ATG)领域的研究,为它们提供一个全面的benchmark和明确的问题集,以便比较不同的方法。methods: 该研究使用了一种新的定义,即将ATG看作是跨应用领域的任务,利用多种模式的信息进行生成。具体来说,他们提出了一个首先的benchmark数据集,名为CA Multimodal Evaluation for Ad Text GeneRAtion(CAMERA),这个数据集特别地设计为ATG,以便利用多Modal信息进行评估。results: 研究人员通过多种基线模型的评估实验,证明了他们提出的benchmark和数据集的有用性。这些基线模型包括不同的预训练语言模型和多Modal信息的integration。此外,研究人员还讨论了当前任务的状态和未来的挑战。
    Abstract In response to the limitations of manual online ad production, significant research has been conducted in the field of automatic ad text generation (ATG). However, comparing different methods has been challenging because of the lack of benchmarks encompassing the entire field and the absence of well-defined problem sets with clear model inputs and outputs. To address these challenges, this paper aims to advance the field of ATG by introducing a redesigned task and constructing a benchmark. Specifically, we defined ATG as a cross-application task encompassing various aspects of the Internet advertising. As part of our contribution, we propose a first benchmark dataset, CA Multimodal Evaluation for Ad Text GeneRAtion (CAMERA), carefully designed for ATG to be able to leverage multi-modal information and conduct an industry-wise evaluation. Furthermore, we demonstrate the usefulness of our proposed benchmark through evaluation experiments using multiple baseline models, which vary in terms of the type of pre-trained language model used and the incorporation of multi-modal information. We also discuss the current state of the task and the future challenges.
    摘要 因为手动在线广告生产的限制,有关自动广告文本生成(ATG)的研究得到了广泛的关注。然而,不同方法的比较困难由于整个领域缺乏一个涵盖整个领域的标准准则和明确的输入输出问题集。为了解决这些挑战,本文准备了ATG领域的进步,包括将ATG定义为跨应用任务,涵盖互联网广告不同方面的多种方面。作为我们的贡献,我们提出了首个benchmark dataset,即CA Multimodal Evaluation for Ad Text GeneRAtion(CAMERA),特意设计用于ATG,以便利用多模态信息并进行产业批处。此外,我们通过多种基线模型的评估实验,表明了我们提posed benchmark的有用性。这些基线模型包括不同的预训练语言模型和多模态信息的 incorporation。我们还讨论了当前任务的状态和未来的挑战。

Stock Market Sentiment Classification and Backtesting via Fine-tuned BERT

  • paper_url: http://arxiv.org/abs/2309.11979
  • repo_url: None
  • paper_authors: Jiashu Lou
  • for: 这篇论文主要是为了研究基于实时信息获取的低延迟自动交易平台中的量化交易,以及如何通过情感因素提高交易效果。
  • methods: 本论文使用的方法包括BERT自然语言处理模型的建立和微调,以及基于Alpha191模型和情感标签的 regression 模型的建立。
  • results: 实验结果表明,将情感因素integrated into the Alpha191 model can significantly improve the return rate, with a return rate of 73.8% compared to the baseline and 32.41% compared to the original Alpha191 model during the trading period.
    Abstract With the rapid development of big data and computing devices, low-latency automatic trading platforms based on real-time information acquisition have become the main components of the stock trading market, so the topic of quantitative trading has received widespread attention. And for non-strongly efficient trading markets, human emotions and expectations always dominate market trends and trading decisions. Therefore, this paper starts from the theory of emotion, taking East Money as an example, crawling user comment titles data from its corresponding stock bar and performing data cleaning. Subsequently, a natural language processing model BERT was constructed, and the BERT model was fine-tuned using existing annotated data sets. The experimental results show that the fine-tuned model has different degrees of performance improvement compared to the original model and the baseline model. Subsequently, based on the above model, the user comment data crawled is labeled with emotional polarity, and the obtained label information is combined with the Alpha191 model to participate in regression, and significant regression results are obtained. Subsequently, the regression model is used to predict the average price change for the next five days, and use it as a signal to guide automatic trading. The experimental results show that the incorporation of emotional factors increased the return rate by 73.8\% compared to the baseline during the trading period, and by 32.41\% compared to the original alpha191 model. Finally, we discuss the advantages and disadvantages of incorporating emotional factors into quantitative trading, and give possible directions for further research in the future.
    摘要 随着大数据和计算设备的快速发展,基于实时信息获取的快速交易平台已成为股票交易市场的主要组成部分,因此量化交易的话题得到了广泛的关注。而在非强效交易市场中,人类情感和期望总是主宰市场趋势和交易决策。因此,本文从情感理论出发,使用东方财富为例,从其相应股票板块中提取用户评论标题数据,并进行数据清洁。然后,构建了一个自然语言处理模型BERT,并使用现有的标注数据集进行微调。实验结果显示,微调后的模型具有不同程度的性能改进。然后,根据上述模型,用户评论数据被标注为情感方向,并将所获取的标签信息与Alpha191模型组合使用进行回归,并获得了显著的回归结果。然后,使用回归模型预测下一个五天内的均价变化,并使其为自动交易的指导信号。实验结果表明,包含情感因素的 incorporation 提高了基线期间的回报率73.8%,并比原始 Alpha191 模型提高32.41%。最后,我们讨论了包含情感因素的量化交易的优缺点,并提出了未来研究的可能性。

SPICED: News Similarity Detection Dataset with Multiple Topics and Complexity Levels

  • paper_url: http://arxiv.org/abs/2309.13080
  • repo_url: None
  • paper_authors: Elena Shushkevich, Long Mai, Manuel V. Loureiro, Steven Derby, Tri Kurniawan Wijaya
    for: 这个论文的目的是提出一个新的新闻相似性数据集,以及四种生成新闻对的方法,以便进行新闻相似性检测任务的训练。methods: 这个论文使用了七个话题来分类新闻,并使用了四种不同的方法来生成新闻对。这些方法包括文本摘要、文本分类、命名实体识别和文本对比。results: 研究人员使用了MinHash、BERT、SBERT和SimCSE模型对创建的数据集进行了测试,并获得了良好的结果。这些模型可以准确地检测新闻的相似性,并且可以在不同的话题下进行更加精准的检测。
    Abstract Nowadays, the use of intelligent systems to detect redundant information in news articles has become especially prevalent with the proliferation of news media outlets in order to enhance user experience. However, the heterogeneous nature of news can lead to spurious findings in these systems: Simple heuristics such as whether a pair of news are both about politics can provide strong but deceptive downstream performance. Segmenting news similarity datasets into topics improves the training of these models by forcing them to learn how to distinguish salient characteristics under more narrow domains. However, this requires the existence of topic-specific datasets, which are currently lacking. In this article, we propose a new dataset of similar news, SPICED, which includes seven topics: Crime & Law, Culture & Entertainment, Disasters & Accidents, Economy & Business, Politics & Conflicts, Science & Technology, and Sports. Futhermore, we present four distinct approaches for generating news pairs, which are used in the creation of datasets specifically designed for news similarity detection task. We benchmarked the created datasets using MinHash, BERT, SBERT, and SimCSE models.
    摘要 现在,使用智能系统检测新闻文章中的重复信息已经非常普遍,这是因为新闻媒体的扩展而为用户提供更好的体验。然而,新闻的多样性可能会导致假阳性结果:简单的规则 such as 两个新闻都是关于政治的话可以提供强大 pero 误导的下游性能。将新闻相似度数据集分成话题可以使模型学习更窄的领域中的突出特征。然而,这需要话题特定的数据集的存在,现在它们缺失。在这篇文章中,我们提出了一个新的相似新闻数据集,名为SPICED,包括七个话题:犯罪与法律、文化与娱乐、灾难与事故、经济与业务、政治与冲突、科学与技术,和体育。此外,我们提出了四种不同的新闻对生成方法,用于创建专门用于新闻相似度检测任务的数据集。我们使用MinHash、BERT、SBERT和SimCSE模型对创建的数据集进行了benchmark测试。

Scaling up COMETKIWI: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task

  • paper_url: http://arxiv.org/abs/2309.11925
  • repo_url: None
  • paper_authors: Ricardo Rei, Nuno M. Guerreiro, José Pombal, Daan van Stigt, Marcos Treviso, Luisa Coheur, José G. C. de Souza, André F. T. Martins
  • for: 这个论文参与了WMT 2023共同任务中的质量估计(QE)任务。
  • methods: 作者使用了COMETKIWI-22模型(Rei et al., 2022b),并采用了多语言方法。
  • results: 作者的方法在所有任务上排名第一,并达到了 sentence-和 word-level 质量预测的状态机器人表现。 Comparing to the previous state-of-the-art COMETKIWI-22, the authors show large improvements in correlation with human judgements (up to 10 Spearman points) and surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.
    Abstract We present the joint contribution of Unbabel and Instituto Superior T\'ecnico to the WMT 2023 Shared Task on Quality Estimation (QE). Our team participated on all tasks: sentence- and word-level quality prediction (task 1) and fine-grained error span detection (task 2). For all tasks, we build on the COMETKIWI-22 model (Rei et al., 2022b). Our multilingual approaches are ranked first for all tasks, reaching state-of-the-art performance for quality estimation at word-, span- and sentence-level granularity. Compared to the previous state-of-the-art COMETKIWI-22, we show large improvements in correlation with human judgements (up to 10 Spearman points). Moreover, we surpass the second-best multilingual submission to the shared-task with up to 3.8 absolute points.
    摘要 我们现在把Unbabel和 Instituto Superior Técnico在WMT 2023共同任务中的贡献介绍。我们的团队参与了所有任务:句子和单词水平质量预测(任务1)以及细化错误探测(任务2)。对于所有任务,我们基于COMETKIWI-22模型(Rei等., 2022b)。我们的多语言方法在所有任务上排名第一,达到了质量预测的州际先进性水平。相比前一个州际先进COMETKIWI-22,我们显示了大幅提升与人类评估的相关度(最多10个斯宾塞分)。此外,我们超过了第二好的多语言提交到共同任务,差值最多3.8个绝对分。

InstructERC: Reforming Emotion Recognition in Conversation with a Retrieval Multi-task LLMs Framework

  • paper_url: http://arxiv.org/abs/2309.11911
  • repo_url: https://github.com/LIN-SHANG/InstructERC
  • paper_authors: Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, Sirui Wang
  • for: This paper aims to improve the development of emotion recognition in dialogue (ERC) by proposing a novel approach called InstructERC, which transforms the ERC task from a discriminative framework to a generative framework based on Large Language Models (LLMs).
  • methods: The proposed InstructERC approach uses a simple yet effective retrieval template module to explicitly integrate multi-granularity dialogue supervision information, as well as two additional emotion alignment tasks (speaker identification and emotion prediction) to implicitly model dialogue role relationships and future emotional tendencies.
  • results: The LLM-based plug-and-play plugin framework achieved comprehensive SOTA on three commonly used ERC datasets, outperforming all previous models. Extensive analysis of parameter-efficient and data-scaling experiments provide empirical guidance for applying InstructERC in practical scenarios.
    Abstract The development of emotion recognition in dialogue (ERC) has been consistently hindered by the complexity of pipeline designs, leading to ERC models that often overfit to specific datasets and dialogue patterns. In this study, we propose a novel approach, namely InstructERC, to reformulates the ERC task from a discriminative framework to a generative framework based on Large Language Models (LLMs) . InstructERC has two significant contributions: Firstly, InstructERC introduces a simple yet effective retrieval template module, which helps the model explicitly integrate multi-granularity dialogue supervision information by concatenating the historical dialog content, label statement, and emotional domain demonstrations with high semantic similarity. Furthermore, we introduce two additional emotion alignment tasks, namely speaker identification and emotion prediction tasks, to implicitly model the dialogue role relationships and future emotional tendencies in conversations. Our LLM-based plug-and-play plugin framework significantly outperforms all previous models and achieves comprehensive SOTA on three commonly used ERC datasets. Extensive analysis of parameter-efficient and data-scaling experiments provide empirical guidance for applying InstructERC in practical scenarios. Our code will be released after blind review.
    摘要 开发对话情感识别(ERC)技术一直受到管道设计的复杂性限制,导致ERC模型经常过拟合特定数据集和对话模式。在这项研究中,我们提出了一种新的方法,即InstructERC,它将ERC任务从推断性框架转换为生成性框架,基于大语言模型(LLM)。InstructERC具有两项重要贡献:首先,InstructERC引入了一种简单 yet有效的检索模板模块,该模块通过 concatenating 历史对话内容、标签声明和情感领域示例来显式地集成多级别对话监督信息。其次,我们引入了两个附加的情感对接任务,即发言人标识和情感预测任务,以隐式地模型对话角色关系和未来情感趋势在对话中。我们的 LLM 基于插件框架在三个常用的 ERC 数据集上实现了广泛的 SOTA 性能。我们进行了参数高效和数据扩展的实验分析,以提供实践场景中应用 InstructERC 的经验指南。我们的代码将在审核后发布。

Focal Inferential Infusion Coupled with Tractable Density Discrimination for Implicit Hate Speech Detection

  • paper_url: http://arxiv.org/abs/2309.11896
  • repo_url: https://github.com/lcs2-iiitd/fiadd
  • paper_authors: Sarah Masud, Ashutosh Bajpai, Tanmoy Chakraborty
  • for: 本研究旨在提高大型自然语言处理模型(PLM)对含有潜在仇恨语言表达的文本识别能力。
  • methods: 本研究使用了增强外部 контекст和距离基本metric的两种方法,并将其组合成为新的FOCUSED INFERENTIAL ADAPTIVE DENSITY DISCRIMINATION(FiADD)框架。
  • results: 对三个隐式仇恨数据集进行测试,FiADD显示出了明显的提高在两类和三类仇恨分类任务中。此外,在检测讽刺、反讽和立场表达中,FiADD也显示出类似的性能提高。
    Abstract Although pre-trained large language models (PLMs) have achieved state-of-the-art on many NLP tasks, they lack understanding of subtle expressions of implicit hate speech. Such nuanced and implicit hate is often misclassified as non-hate. Various attempts have been made to enhance the detection of (implicit) hate content by augmenting external context or enforcing label separation via distance-based metrics. We combine these two approaches and introduce FiADD, a novel Focused Inferential Adaptive Density Discrimination framework. FiADD enhances the PLM finetuning pipeline by bringing the surface form of an implicit hate speech closer to its implied form while increasing the inter-cluster distance among various class labels. We test FiADD on three implicit hate datasets and observe significant improvement in the two-way and three-way hate classification tasks. We further experiment on the generalizability of FiADD on three other tasks, namely detecting sarcasm, irony, and stance, in which surface and implied forms differ, and observe similar performance improvement. We analyze the generated latent space to understand its evolution under FiADD, which corroborates the advantage of employing FiADD for implicit hate speech detection.
    摘要 In this paper, we propose a novel Focused Inferential Adaptive Density Discrimination (FiADD) framework that combines these two approaches to enhance the PLM finetuning pipeline. FiADD brings the surface form of an implicit hate speech closer to its implied form while increasing the inter-cluster distance among various class labels. We test FiADD on three implicit hate datasets and observe significant improvement in the two-way and three-way hate classification tasks.Furthermore, we experiment on the generalizability of FiADD on three other tasks, namely detecting sarcasm, irony, and stance, in which surface and implied forms differ. We observe similar performance improvement, indicating the versatility of FiADD. We analyze the generated latent space to understand its evolution under FiADD, which corroborates the advantage of employing FiADD for implicit hate speech detection.

Is It Really Useful to Jointly Parse Constituency and Dependency Trees? A Revisit

  • paper_url: http://arxiv.org/abs/2309.11888
  • repo_url: None
  • paper_authors: Yanggang Gu, Yang Hou, Zhefeng Wang, Xinyu Duan, Zhenghua Li
  • for: 这个论文关注 JOIN 分析 sentence 中的 Constituency 树和 Dependency 树,即同时生成兼容的 Constituency 树和 Dependency 树。
  • methods: 该论文使用了更高效的解码算法,在训练阶段进行 JOIN 模型化,并提出了高阶分数组件来捕捉 Constituent-Dependency 之间的交互。
  • results: 论文在各种实验和分析中做出了四个方面的进步:1)更高效的解码算法,2)在训练阶段进行 JOIN 模型化,3)提出了高阶分数组件,4)通过深入的实验和分析获得了更多的启示。
    Abstract This work visits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. Compared with previous works, we make progress in four aspects: (1) adopting a much more efficient decoding algorithm, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components for constituent-dependency interaction, (4) gaining more insights via in-depth experiments and analysis.
    摘要 这个工作探讨了同时解析成分树和依赖树的问题,即为输入句子生成兼容的成分树和依赖树,这是吸引人的,因为这两种树是Syntax的补充。相比前一些工作,我们在四个方面进行了进步:(1)采用更高效的解码算法,(2)在训练阶段进行共同模型化,而不仅在推理阶段进行,(3)提出高阶分数组件来描述成分-依赖之间的互动,(4)通过深入实验和分析获得更多的发现。

Syntactic Variation Across the Grammar: Modelling a Complex Adaptive System

  • paper_url: http://arxiv.org/abs/2309.11869
  • repo_url: None
  • paper_authors: Jonathan Dunn
  • for: 本研究旨在量化语言系统中的变化,通过对49个本地英语语言变体的分类来描述语音变体之间的 sintactic 差异。
  • methods: 本研究使用了整个语法和各个语法结构之间的隔离来分类语音变体。
  • results: 结果表明,语音变体中的各个结构都存在变化,但在孤立的情况下,没有任何结构能够与整个语法一样好。这表明,语音变体中的变化部分由不同语法结构之间的交互所组成。此外,研究还发现,在不同语法结构下,语音变体之间的相似性很大。
    Abstract While language is a complex adaptive system, most work on syntactic variation observes a few individual constructions in isolation from the rest of the grammar. This means that the grammar, a network which connects thousands of structures at different levels of abstraction, is reduced to a few disconnected variables. This paper quantifies the impact of such reductions by systematically modelling dialectal variation across 49 local populations of English speakers in 16 countries. We perform dialect classification with both an entire grammar as well as with isolated nodes within the grammar in order to characterize the syntactic differences between these dialects. The results show, first, that many individual nodes within the grammar are subject to variation but, in isolation, none perform as well as the grammar as a whole. This indicates that an important part of syntactic variation consists of interactions between different parts of the grammar. Second, the results show that the similarity between dialects depends heavily on the sub-set of the grammar being observed: for example, New Zealand English could be more similar to Australian English in phrasal verbs but at the same time more similar to UK English in dative phrases.
    摘要 语言是一个复杂的适应系统,大多数语法变化研究通常只关注几个个体结构,忽略了语法 grammar 中其他结构之间的关系。这意味着语法网络,连接了数千个结构,被压缩成只有几个分离的变量。这篇论文测量了这种压缩的影响,通过对49个本地英语口语者群体在16个国家的语言变化进行系统性模型。我们使用整个语法以及语法中各个节点的隔离来分类方言,以Characterize语言变化的 sintactic differences。结果显示,首先,语法中很多个节点都存在变化,但是孤立地没有达到语法整体的水平。这表明,语法变化中有很重要的互动部分。其次,结果显示,对于不同的语言变化,相似性很大程度取决于观察到的语法子集。例如,新西兰英语可能与澳大利亚英语在短语动词方面更相似,而在 dative phrases 方面更相似于 UK 英语。

Knowledge Sanitization of Large Language Models

  • paper_url: http://arxiv.org/abs/2309.11852
  • repo_url: None
  • paper_authors: Yoichi Ishibashi, Hidetoshi Shimodaira
  • for: 防止语言模型泄露敏感信息
  • methods: 精细调整语言模型,让其生成无害回答
  • results: 实验结果表明,该方法不仅减少了特定知识泄露,还保持了语言模型的总性能,从而增强了防止抽取攻击和避免生成危险内容的防御。
    Abstract We explore a knowledge sanitization approach to mitigate the privacy concerns associated with large language models (LLMs). LLMs trained on a large corpus of Web data can memorize and potentially reveal sensitive or confidential information, raising critical security concerns. Our technique fine-tunes these models, prompting them to generate harmless responses such as ``I don't know'' when queried about specific information. Experimental results in a closed-book question-answering task show that our straightforward method not only minimizes particular knowledge leakage but also preserves the overall performance of LLM. These two advantages strengthen the defense against extraction attacks and reduces the emission of harmful content such as hallucinations.
    摘要 我们研究了一种知识净化方法,以降低大语言模型(LLM)中的隐私问题。 LLM 通过大量网络数据训练可能会记忆和泄露敏感或机密信息,这引起了严重的安全问题。我们的技术在这些模型中进行微调,使其在特定信息 queries 时生成无害的回答,如“我不知道”。实验结果表明,我们的简单方法不仅减少了特定知识泄露,还保持了 LLM 的总性能。这两个优点强化了对抽取攻击的防御和减少了负面内容的泄露,如幻见。

A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion Analysis

  • paper_url: http://arxiv.org/abs/2309.11849
  • repo_url: None
  • paper_authors: Xianhao Wei, Jia Jia, Xiang Li, Zhiyong Wu, Ziyi Wang
  • for: 这个研究旨在预测基于话语水平的细腻情感特征,以提高语音合成模型的表达性。
  • methods: 该研究使用了一种Style Transfer模型提取phoneme-level Local Prosody Embedding序列和全局风格嵌入,并提出了一种多级文本预测模型(D-MPM)来预测这两个情感特征。
  • results: 实验结果表明,多级文本信息有效地预测情感特征,并且话语水平提高了整体一致性和用户体验。而且,由于预测模型的合成效果比原始speech的风格传递效果更好,这种方法可能有助于语音合成模型更好地表达情感。
    Abstract This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text. To obtain fine-grained emotional prosodic features as predictive values for our model, we extract a phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style Embedding as prosodic speech features from the speech with the help of a style transfer model. We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features. The proposed model can be used to analyze better emotional prosodic features and thus guide the speech synthesis model to synthesize more expressive speech. To quantitatively evaluate the proposed model, we contribute a new and large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than 13,000 utterances annotated sequences to evaluate the proposed model. Experimental results on the DCA dataset show that the multi-scale text information effectively helps to predict prosodic features, and the discourse-level text improves both the overall coherence and the user experience. More interestingly, although we aim at the synthesis effect of the style transfer model, the synthesized speech by the proposed text prosodic analysis model is even better than the style transfer from the original speech in some user evaluation indicators.
    摘要 (Simplified Chinese translation)这篇论文探讨了基于干支文本的细腻情感分析中适用的语音特征预测方法。为了获得更好的情感语音特征预测值,我们从语音中提取了phoneme级别的本地语音嵌入序列(LPEs)和全局风格嵌入作为语音特征特征。我们提议一种基于多级文本的语音谱proboscis模型(D-MPM),利用多级文本来预测这两个语音特征。该模型可以用来分析更好的情感语音特征,并且导引语音合成模型生成更加表达的语音。为了评估该模型,我们提供了一个大规模的Discourse-level Chinese Audiobook(DCA)数据集,包含 более than 13,000个语音短语序列。实验结果表明,多级文本信息有效地预测语音特征,并且提高了整体准确率和用户体验。而且,尽管我们target的是style transfer模型的合成效果,但是由提posed的文本语音分析模型生成的语音还是在一些用户评估指标上更好于原始语音的style transfer。

A Chinese Prompt Attack Dataset for LLMs with Evil Content

  • paper_url: http://arxiv.org/abs/2309.11830
  • repo_url: None
  • paper_authors: Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu
  • for: 本研究旨在提供一个中文提示攻击数据集(CPAD),用于评估语言模型(LLMs)对提示攻击的抵御能力。
  • methods: 我们采用了多种攻击方法,包括提示攻击、恶意提示和目标攻击,以评估LLMs的安全性。
  • results: 我们运行了多个常见的中文LLMs在我们的数据集上,结果显示,我们的提示能够让LLMs失败,成功率约为70%。
    Abstract Large Language Models (LLMs) present significant priority in text understanding and generation. However, LLMs suffer from the risk of generating harmful contents especially while being employed to applications. There are several black-box attack methods, such as Prompt Attack, which can change the behaviour of LLMs and induce LLMs to generate unexpected answers with harmful contents. Researchers are interested in Prompt Attack and Defense with LLMs, while there is no publicly available dataset to evaluate the abilities of defending prompt attack. In this paper, we introduce a Chinese Prompt Attack Dataset for LLMs, called CPAD. Our prompts aim to induce LLMs to generate unexpected outputs with several carefully designed prompt attack approaches and widely concerned attacking contents. Different from previous datasets involving safety estimation, We construct the prompts considering three dimensions: contents, attacking methods and goals, thus the responses can be easily evaluated and analysed. We run several well-known Chinese LLMs on our dataset, and the results show that our prompts are significantly harmful to LLMs, with around 70% attack success rate. We will release CPAD to encourage further studies on prompt attack and defense.
    摘要 大语言模型(LLM)在文本理解和生成方面具有重要优先级。然而,LLM受到生成危险内容的风险,特别是在应用程序中使用时。现有许多黑盒攻击方法,如提示攻击,可以改变LLM的行为,使其生成意外的答案并含有危险内容。研究人员对提示攻击和LLM防御有浓厚的兴趣,但是现无公共可用的数据集来评估防御能力。在这篇论文中,我们介绍了一个中文提示攻击数据集(CPAD),用于测试LLM的防御能力。我们的提示采用三维构造:内容、攻击方法和目标,因此可以轻松地评估和分析回快。我们使用了一些知名的中文LLM在我们的数据集上进行测试,结果显示,我们的提示能够够Effectively harm LLMs,成功率约为70%。我们将CPAD公开发布,以便更多的研究人员可以进行提示攻击和防御研究。

Word Embedding with Neural Probabilistic Prior

  • paper_url: http://arxiv.org/abs/2309.11824
  • repo_url: None
  • paper_authors: Shaogang Ren, Dingcheng Li, Ping Li
  • for: 提高单词表示学习的词表示学习
  • methods: 使用概率先验来规范词表示学习
  • results: 提高了单词表示学习的表示精度和模型的稳定性
    Abstract To improve word representation learning, we propose a probabilistic prior which can be seamlessly integrated with word embedding models. Different from previous methods, word embedding is taken as a probabilistic generative model, and it enables us to impose a prior regularizing word representation learning. The proposed prior not only enhances the representation of embedding vectors but also improves the model's robustness and stability. The structure of the proposed prior is simple and effective, and it can be easily implemented and flexibly plugged in most existing word embedding models. Extensive experiments show the proposed method improves word representation on various tasks.
    摘要 为了提高词表示学习,我们提议一种概率先验,可以轻松地与词嵌入模型结合使用。与先前的方法不同,在我们的方法中,词嵌入被看作是一种概率生成模型,这使得我们可以对词表示学习强制一个先验。我们提出的先验不仅提高了嵌入向量的表示,还改善了模型的稳定性和鲁棒性。该结构简单而有效,可以轻松地实现并适应大多数现有的词嵌入模型。我们的实验结果表明,我们的方法可以在各种任务上提高词表示。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

SLHCat: Mapping Wikipedia Categories and Lists to DBpedia by Leveraging Semantic, Lexical, and Hierarchical Features

  • paper_url: http://arxiv.org/abs/2309.11791
  • repo_url: None
  • paper_authors: Zhaoyi Wang, Zhenyang Zhang, Jiaxin Qin, Mizuho Iwaihara
  • for: 提高 DBpedia 类型对 CaLiGraph 分类的准确率,实现大规模 Ontology 映射。
  • methods: 利用知识图结构、语义相似性和命名实体类型,自动生成训练数据,并使用 distant supervision 方法finetune 预训练语言模型 BERT。
  • results: 比基eline模型提高25%的准确率,提供实用的大规模 Ontology 映射解决方案。
    Abstract Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.
    摘要 《Wikipedia文章是以类别和列表的形式归类,提供了一个非常全面和通用的分类系统,但开放创建的问题导致了重复和不一致。将DBpedia类划到Wikipedia类划和列表上可以解决这个问题,实现一个大型知识图,对涉及到数字内容的分类进行Entity链接和类型化。然而,现有的CaLiGraph方法 produces incomplete and non-fine-grained mappings。在这篇论文中,我们将这个问题看作ontology alignment,利用知识图的结构信息和ontology类名的语义和语义特征,对可信的映射进行发现,并在远程监督方式下使用预训练语言模型BERT进行训练。我们的方法SLHCat包括两个主要部分:1. 利用知识图结构、语义相似度和命名实体类型自动生成训练数据。2. 使用训练数据进行finetuning和prompt-tuning预训练语言模型BERT,以捕捉类名的语义和语法性质。我们的模型SLHCat在一个由CaLiGraph-DBpedia映射对的3000个精细标注 dataset上进行评估,与基准模型相比,SLHCat在准确率上出现25%的大幅提升,提供了一个实用的大规模ontology映射解决方案。

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

  • paper_url: http://arxiv.org/abs/2309.11710
  • repo_url: https://github.com/elisakreiss/contextref
  • paper_authors: Elisa Kreiss, Eric Zelikman, Christopher Potts, Nick Haber
  • for: 这 paper 的目的是评估无参考度量表(CLIPScore)的准确性,以及这些方法是否与人类偏好相符。
  • methods: 该 paper 使用 ContextRef benchmark,该 benchmark 包括人类评分和多种稳定性检查,以评估无参考度量表的准确性。
  • results: 研究发现,无参考度量表方法在 ContextRef benchmark 上表现不佳,但通过精心微调可以得到显著改进。
    Abstract Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
    摘要 无参考度量(例如CLIPScore)使用预训练视觉语言模型直接评估图文描述,无需费时的参照文本。这些方法可以促进快速进步,但只有如果它们真正对人类偏好判断 align。在这篇文章中,我们介绍ContextRef,一个用于评估无参考度量的benchmark。ContextRef有两个组成部分:人类评分的多种已知质量维度,以及十种多样化的Robustness Check,用于暴露基础的弱点。ContextRef中的图文都会在上下文中展示,这与先前的工作表明上下文对描述质量很重要。使用ContextRef,我们评估了多种预训练模型、分数函数和 Context 的技术。 none of them 在ContextRef中成功,但我们显示了仔细的微调可以实现显著提高。ContextRef仍然是一个挑战性的benchmark,主要是因为上下文依赖性的挑战。

Memory-Augmented LLM Personalization with Short- and Long-Term Memory Coordination

  • paper_url: http://arxiv.org/abs/2309.11696
  • repo_url: None
  • paper_authors: Kai Zhang, Fubang Zhao, Yangyang Kang, Xiaozhong Liu
  • for: 这个研究旨在提高大语言模型(LLM)的个性化生成能力,以提高用户特定的结果。
  • methods: 该研究提出了一种新的计算机биологиMemory机制,结合高效的参数调整方案,以个性化LLM。
  • results: 实验结果表明,该方法能够有效地提高LLM的个性化生成能力,并且超过了之前的方法。
    Abstract Large Language Models (LLMs), such as GPT3.5, have exhibited remarkable proficiency in comprehending and generating natural language. However, their unpersonalized generation paradigm may result in suboptimal user-specific outcomes. Typically, users converse differently based on their knowledge and preferences. This necessitates the task of enhancing user-oriented LLM which remains unexplored. While one can fully train an LLM for this objective, the resource consumption is unaffordable. Prior research has explored memory-based methods to store and retrieve knowledge to enhance generation without retraining for new queries. However, we contend that a mere memory module is inadequate to comprehend a user's preference, and fully training an LLM can be excessively costly. In this study, we propose a novel computational bionic memory mechanism, equipped with a parameter-efficient fine-tuning schema, to personalize LLMs. Our extensive experimental results demonstrate the effectiveness and superiority of the proposed approach. To encourage further research into this area, we are releasing a new conversation dataset generated entirely by LLM based on an open-source medical corpus, as well as our implementation code.
    摘要 Translation notes:* "Large Language Models" is translated as "大型语言模型" (dàxíng yǔyán módelǐ)* "such as GPT3.5" is translated as "如GPT3.5" (rú GPT3.5)* "unpersonalized generation paradigm" is translated as "无个性生成模式" (wú gèxìng shēngchén móde)* "users converse differently based on their knowledge and preferences" is translated as "用户根据知识和偏好不同地交流" (yòngzhì jīngguī jīntiān bùdìng de jiāoxìng)* "this necessitates the task of enhancing user-oriented LLM" is translated as "这需要提高用户指向的LLM任务" (zhè xūyào tímiaokè yǐngyì LLM zhìwù)* "while one can fully train an LLM for this objective" is translated as "可以完全训练LLM以实现这个目标" (kěyǐ qiánzhèng xùntraining LLM yǐ jízhèng zhè ge mùtiān)* "the resource consumption is unaffordable" is translated as "资源消耗不可持续" (zīyuàn xiāohuò bùkěcháng)* "prior research has explored memory-based methods" is translated as "先前的研究曾经探索了记忆基于的方法" (xiānpjàn de yánjiū zhèngjīn tànsuō le jiěyì bázhì de fāngché)* "a mere memory module is inadequate to comprehend a user's preference" is translated as "简单的记忆模块无法理解用户的偏好" (jiǎndān de jiěyì móudāo wúfāng lǐjiě yǐngyì yòu zhèngxìng)* "fully training an LLM can be excessively costly" is translated as "完全训练LLM的成本过高" (qióngzhèng xùntraining LLM de zhèngběn guògāo)* "in this study, we propose a novel computational bionic memory mechanism" is translated as "在本研究中,我们提出了一种新的计算机bone植入记忆机制" (zhèng yàn yánjiū zhōng, wǒmen tìshì le yī zhī xīn de jìsuàn zhīyìng jīfāng)* "equipped with a parameter-efficient fine-tuning schema" is translated as "配备有效率精度调整方案" (fùyè yǒu xiǎngyì liàngdào fāng'àn)* "to personalize LLMs" is translated as "为LLM个性化" (wèi LLM yīngróng huà)* "our extensive experimental results demonstrate the effectiveness and superiority of the proposed approach" is translated as "我们广泛的实验结果表明我们提出的方法的有效性和优势" (wǒmen guǎngfāng de shíyàn jīngqì bùmíng wǒmen tìshì le fāngché de yǒu xìngxìng)* "to encourage further research in this area" is translated as "以促进这一领域的进一步研究" (yǐ jìnshì zhè yī lǐng yè, jìnshì zhè yī jìng yè)