cs.CL - 2023-08-12

MT4CrossOIE: Multi-stage Tuning for Cross-lingual Open Information Extraction

paper_url: http://arxiv.org/abs/2308.06552
repo_url: https://github.com/CSJianYang/Multilingual-Multimodal-NLP/tree/main/MT4CrossOIE
paper_authors: Zixiang Wang, Linzheng Chai, Jian Yang, Jiaqi Bai, Yuwei Yin, Jiaheng Liu, Hongcheng Guo, Tongliang Li, Liqun Yang, Hebboul Zine el-abidine, Zhoujun Li
for: 提高多语言开放信息提取（Cross-Lingual Open Information Extraction，简称CrossIE）的效果，使得模型能够在不同语言的文本上提取结构化信息。
methods: 提出了一种多阶段调整框架MT4CrossIE，通过将语言特定的知识注入到共享模型中来提高crossIE的性能。
results: 实验结果表明，通过组合模型基于和数据基于的转移技术，MT4CrossIE可以在多种benchmark上提高crossIE的性能，并且多个语言特定模块的组合对crossIE的性能有积极的影响。

Abstract
Cross-lingual open information extraction aims to extract structured information from raw text across multiple languages. Previous work uses a shared cross-lingual pre-trained model to handle the different languages but underuses the potential of the language-specific representation. In this paper, we propose an effective multi-stage tuning framework called MT4CrossIE, designed for enhancing cross-lingual open information extraction by injecting language-specific knowledge into the shared model. Specifically, the cross-lingual pre-trained model is first tuned in a shared semantic space (e.g., embedding matrix) in the fixed encoder and then other components are optimized in the second stage. After enough training, we freeze the pre-trained model and tune the multiple extra low-rank language-specific modules using mixture-of-LoRAs for model-based cross-lingual transfer. In addition, we leverage two-stage prompting to encourage the large language model (LLM) to annotate the multi-lingual raw data for data-based cross-lingual transfer. The model is trained with multi-lingual objectives on our proposed dataset OpenIE4++ by combing the model-based and data-based transfer techniques. Experimental results on various benchmarks emphasize the importance of aggregating multiple plug-in-and-play language-specific modules and demonstrate the effectiveness of MT4CrossIE in cross-lingual OIE\footnote{\url{https://github.com/CSJianYang/Multilingual-Multimodal-NLP}.

摘要
cross-lingual开放信息提取目标在多种语言之间提取结构化信息。先前的工作使用共享的cross-lingual预训练模型处理不同语言，但是未能充分利用语言特定表示的潜在优势。在这篇论文中，我们提出了一种高效的多阶段调整框架MT4CrossIE，用于提高cross-lingual开放信息提取。具体来说，在共享的semantic space（例如embedding matrix）中首先对cross-lingual预训练模型进行共享调整，然后其他组件在第二阶段进行优化。经过充分训练后，我们冻结预训练模型，并使用mixture-of-LoRAs进行模型基于cross-lingual传递。此外，我们利用两阶段提示来鼓励大语言模型（LLM）对多语言原始数据进行标注，以实现数据基于cross-lingual传递。我们在OpenIE4++数据集上训练了多语言目标，并结合模型基于和数据基于传递技术。实验结果在多个benchmark上表明，汇集多个插件和Play语言特定模块的重要性，并证明MT4CrossIE在cross-lingual OIE中的效果。

Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition

paper_url: http://arxiv.org/abs/2308.06547
repo_url: None
paper_authors: Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang, Yonghong Yan
for: 提高自动语音识别器的性能在半监督学习中，当标注数据稀缺时methods: 提议一种新的替代 pseudo-labeling 框架，包括一个通用的 CTC 损失函数、验证错误 Pseudo-label 的方法和自动调整 thresholdresults: 在实验中，该框架可以在半监督学习中提高自动语音识别器的性能，并且可以自动调整 threshold，避免手动调整 threshold 的痛苦

Abstract
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy pseudo-labels by accepting alternative tokens in the positions of incorrect tokens. Applying this loss function in pseudo-labeling requires detecting incorrect tokens in the predicted pseudo-labels. In this work, we adopt a confidence-based error detection method that identifies the incorrect tokens by comparing their confidence scores with a given threshold, thus necessitating the confidence score to be discriminative. Hence, the second proposed technique is the contrastive CTC loss function that widens the confidence gap between the correctly and incorrectly predicted tokens, thereby improving the error detection ability. Additionally, obtaining satisfactory performance with confidence-based error detection typically requires extensive threshold tuning. Instead, we propose an automatic thresholding method that uses labeled data as a proxy for determining the threshold, thus saving the pain of manual tuning.

摘要
当标注数据不足时，半超vised学习使用pseudo-labeling技术可以显著提高自动语音识别的性能。然而，pseudo-labels经常含有许多错误的标签。使用含有错误标签的labels作为损失函数中的参考数据会导致优化性能的问题。先前的工作已经尝试过抑制这个问题，可以是通过筛选pseudo-labels中的最噪音标签，或者提高pseudo-labels的质量。尽管这些方法有一定的效果，但是完全消除pseudo-labels中的错误标签是不现实的。在这种情况下，我们提出了一种新的框架，即代理pseudo-labeling。该框架包括以下几个组成部分：1. 一种通用的CTC损失函数，可以处理含有错误标签的pseudo-labels。该损失函数可以接受pseudo-labels中的错误标签，并且可以在预测过程中检测错误标签。2. 一种信息储存基于错误检测方法，可以在预测过程中检测pseudo-labels中的错误标签。该方法通过比较预测的信息储存与给定的阈值进行比较，以确定错误标签。3. 一种自动调整阈值的方法，可以使用标注数据作为代理，以便不需要手动调整阈值。这种新的框架可以减少因为使用含有错误标签的labels而导致的优化性能问题，并且可以在不完全消除pseudo-labels中的错误标签的情况下提高自动语音识别的性能。

With a Little Help from the Authors: Reproducing Human Evaluation of an MT Error Detector

paper_url: http://arxiv.org/abs/2308.06527
repo_url: None
paper_authors: Ondřej Plátek, Mateusz Lango, Ondřej Dušek
for: 本研究重新实现了瓦姆瓦斯和塞恩兹（2022）的人工评估实验，该实验evaluated一个自动检测机器翻译输出中的过度和不足翻译（翻译包含更多或更少信息 than the original）的自动系统。
methods: 我们使用了作者提供的文档和代码，但在重新实现实验setup时，我们发现了一些问题，并提供了改进可重现性的建议。
results: 我们的复制结果大致与原始研究的结论相符，但在某些情况下，我们 observeda statistically significant differences，表明了人工标注的高变异性。

Abstract
This work presents our efforts to reproduce the results of the human evaluation experiment presented in the paper of Vamvas and Sennrich (2022), which evaluated an automatic system detecting over- and undertranslations (translations containing more or less information than the original) in machine translation (MT) outputs. Despite the high quality of the documentation and code provided by the authors, we discuss some problems we found in reproducing the exact experimental setup and offer recommendations for improving reproducibility. Our replicated results generally confirm the conclusions of the original study, but in some cases, statistically significant differences were observed, suggesting a high variability of human annotation.

摘要

AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models

paper_url: http://arxiv.org/abs/2308.06507
repo_url: None
paper_authors: Siheng Li, Cheng Yang, Yichun Yin, Xinyu Zhu, Zesen Cheng, Lifeng Shang, Xin Jiang, Qun Liu, Yujiu Yang
for: 提高信息寻求对话生成的训练数据稀缺性问题
methods: 利用大语言模型几何学习和生成能力，将对话生成问题定义为语言模型预测问题，并在几个人对话的基础上训练语言模型来捕捉信息寻求过程的特征，生成高质量的synthetic对话
results: 对两个常用的数据集进行实验，证明AutoConv具有显著的提升和减少人工标注的依赖性

Abstract
Information-seeking conversation, which aims to help users gather information through conversation, has achieved great progress in recent years. However, the research is still stymied by the scarcity of training data. To alleviate this problem, we propose AutoConv for synthetic conversation generation, which takes advantage of the few-shot learning ability and generation capacity of large language models (LLM). Specifically, we formulate the conversation generation problem as a language modeling task, then finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process and use it for generating synthetic conversations with high quality. Experimental results on two frequently-used datasets verify that AutoConv has substantial improvements over strong baselines and alleviates the dependence on human annotation. In addition, we also provide several analysis studies to promote future research.

摘要
信息寻求对话，目前已经取得了很大的进步，但研究还面临着数据缺乏的问题。为解决这个问题，我们提出了AutoConv，它利用大语言模型（LLM）的几shot学习能力和生成能力来生成高质量的人工对话。具体来说，我们将对话生成问题定义为语言模型化问题，然后使用一些人类对话来训练LLM，以capture信息寻求过程中的特点。实验结果表明，AutoConv在两个常用的数据集上具有显著的提升和减少人类注释的依赖性。此外，我们还提供了一些分析研究，以便未来的研究。

NewsDialogues: Towards Proactive News Grounded Conversation

paper_url: http://arxiv.org/abs/2308.06501
repo_url: https://github.com/sihengli99/newsdialogues
paper_authors: Siheng Li, Yichun Yin, Cheng Yang, Wangjie Jiang, Yiwei Li, Zesen Cheng, Lifeng Shang, Xin Jiang, Qun Liu, Yujiu Yang
for: 本研究旨在提出一种新任务——积极新闻附加对话，以便对话系统可以主动领导对话，基于新闻中的一些关键话题。
methods: 本研究使用了一种名为Predict-Generate-Rank的方法，包括一个生成器用于预测和生成基于新闻的知识，以及一个排名器用于对多个回答进行排名，以避免曝光偏见。
results: 经过广泛的实验，研究发现了一些关键发现和挑战，以及提出了未来研究的一些方向。

Abstract
Hot news is one of the most popular topics in daily conversations. However, news grounded conversation has long been stymied by the lack of well-designed task definition and scarce data. In this paper, we propose a novel task, Proactive News Grounded Conversation, in which a dialogue system can proactively lead the conversation based on some key topics of the news. In addition, both information-seeking and chit-chat scenarios are included realistically, where the user may ask a series of questions about the news details or express their opinions and be eager to chat. To further develop this novel task, we collect a human-to-human Chinese dialogue dataset \ts{NewsDialogues}, which includes 1K conversations with a total of 14.6K utterances and detailed annotations for target topics and knowledge spans. Furthermore, we propose a method named Predict-Generate-Rank, consisting of a generator for grounded knowledge prediction and response generation, and a ranker for the ranking of multiple responses to alleviate the exposure bias. We conduct comprehensive experiments to demonstrate the effectiveness of the proposed method and further present several key findings and challenges to prompt future research.

摘要
热门新闻是日常对话中最受欢迎的话题之一，然而新闻基于对话的探讨长期受到缺乏有效定义任务和珍贵数据的限制。在这篇论文中，我们提出了一个新任务：主动新闻基于对话（Proactive News Grounded Conversation），在其中对话系统可以主动引导对话，基于新闻中的一些关键话题。此外，我们还包括了信息寻求和聊天场景，用户可能会提问新闻细节的多个问题或表达自己的意见并且很有兴趣聊天。为了进一步开发这个新任务，我们收集了一个人类对话 dataset 《NewsDialogues》，该 dataset 包含了1000个对话，总共14600个语音和详细的注释目标话题和知识范围。此外，我们还提出了一种方法，即预测生成排名（Predict-Generate-Rank），该方法包括一个生成基于新闻的预测和回答生成器，以及一个排名器用于多个答案的排名，以降低曝光偏见。我们进行了广泛的实验，以示提出的方法的效iveness，并提出了一些关键发现和未来研究的挑战。

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

paper_url: http://arxiv.org/abs/2308.06463
repo_url: https://github.com/robustnlp/cipherchat
paper_authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu
for: 本研究旨在探讨大语言模型（LLMs）的安全定制是否可以扩展到非自然语言（cipher）领域。
methods: 我们提出了一种名为 CipherChat 的新框架，允许人类通过密码提示和几个几shot 加密示例与 LLMs 进行交流。我们使用 CipherChat 评估当今最先进的 LLMs，包括 ChatGPT 和 GPT-4，在不同的人类密码下的11个安全领域中的表现。
results: 实验结果表明，某些密码可以在某些安全领域中绕过 GPT-4 的安全定制，这说明了在非自然语言领域中的安全定制的必要性。此外，我们发现 LLMs 似乎有一种’’秘密密码’’，并提出了一种名为 SelfCipher 的新方法，可以通过角色扮演和几个示例来触发这种能力。SelfCipher surprisingly 在大多数情况下超过了现有的人类密码。我们将代码和数据发布在 GitHub 上（https://github.com/RobustNLP/CipherChat）。

Abstract
Safety lies at the core of the development of Large Language Models (LLMs). There is ample work on aligning LLMs with human ethics and preferences, including data filtering in pretraining, supervised fine-tuning, reinforcement learning from human feedback, and red teaming, etc. In this study, we discover that chat in cipher can bypass the safety alignment techniques of LLMs, which are mainly conducted in natural languages. We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers. CipherChat enables humans to chat with LLMs through cipher prompts topped with system role descriptions and few-shot enciphered demonstrations. We use CipherChat to assess state-of-the-art LLMs, including ChatGPT and GPT-4 for different representative human ciphers across 11 safety domains in both English and Chinese. Experimental results show that certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains, demonstrating the necessity of developing safety alignment for non-natural languages. Notably, we identify that LLMs seem to have a ''secret cipher'', and propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability. SelfCipher surprisingly outperforms existing human ciphers in almost all cases. Our code and data will be released at https://github.com/RobustNLP/CipherChat.

摘要
安全是大语言模型（LLM）的核心发展之一。有很多工作在将 LLM 与人类伦理和偏好相匹配，包括预处理数据筛选、监督练习、人类反馈强化学习和红团等等。在这项研究中，我们发现可以通过密码来绕过 LLM 的安全对齐技术，这些技术主要是在自然语言上进行的。我们提出了一个新的框架 CipherChat，用于系统地检验非自然语言（密码）上 LLM 的安全对齐可行性。CipherChat 允许人们通过密码提示和系统角色描述以及几个加密示例来与 LLM 进行交流。我们使用 CipherChat 测试了当前的状态体 LLM，包括 ChatGPT 和 GPT-4，在不同的人类密码上进行了11个安全领域的测试。实验结果显示，某些密码可以在多个安全领域中绕过 GPT-4 的安全对齐，这说明了非自然语言的安全对齐的必要性。另外，我们发现 LLM 似乎有一个“秘密密码”，我们提出了一种新的 SelfCipher，只需要通过角色扮演和几个示例来诱发这种能力。SelfCipher surprisingly 在大多数情况下超过了现有的人类密码。我们的代码和数据将在 GitHub 上发布。

Text-to-Video: a Two-stage Framework for Zero-shot Identity-agnostic Talking-head Generation

paper_url: http://arxiv.org/abs/2308.06457
repo_url: https://github.com/zhichaowang970201/text-to-video
paper_authors: Zhichao Wang, Mengyu Dai, Keld Lundgaard
for: 本研究旨在提供一种基于文本的视频创建方法，具体来说是一种人脸无关的视频审核方法，以便在不同的语言和语速下生成可观看的视频。
methods: 本研究提出了一种两stage的方法，包括文本至语音转化和语音驱动的人脸讲话生成。在第一阶段，我们利用预训练的零shot模型实现文本至语音转化。在第二阶段，我们使用语音驱动的人脸讲话生成方法，以生成有趣的视频。
results: 本研究通过对不同的文本和语音样本进行比较分析，找到了最佳的文本至语音转化和语音驱动的人脸讲话生成方法。此外，我们还提供了一些Audio和视频示例，可以在以下链接中找到：https://github.com/ZhichaoWang970201/Text-to-Video/tree/main。

Abstract
The advent of ChatGPT has introduced innovative methods for information gathering and analysis. However, the information provided by ChatGPT is limited to text, and the visualization of this information remains constrained. Previous research has explored zero-shot text-to-video (TTV) approaches to transform text into videos. However, these methods lacked control over the identity of the generated audio, i.e., not identity-agnostic, hindering their effectiveness. To address this limitation, we propose a novel two-stage framework for person-agnostic video cloning, specifically focusing on TTV generation. In the first stage, we leverage pretrained zero-shot models to achieve text-to-speech (TTS) conversion. In the second stage, an audio-driven talking head generation method is employed to produce compelling videos privided the audio generated in the first stage. This paper presents a comparative analysis of different TTS and audio-driven talking head generation methods, identifying the most promising approach for future research and development. Some audio and videos samples can be found in the following link: https://github.com/ZhichaoWang970201/Text-to-Video/tree/main.

摘要
随着ChatGPT的出现，新的信息收集和分析方法得到了推动。然而，ChatGPT提供的信息仅限于文本，视觉化这些信息仍然受限。先前的研究曾经 explore zero-shot文本到视频（TTV）方法，将文本转换成视频。然而，这些方法缺乏控制音频个体的能力，妨碍其效iveness。为了解决这个限制，我们提议一种新的两阶段框架，专门针对人具无关的视频副本。在第一阶段，我们利用预训练的零shot模型实现文本到语音（TTS）转换。在第二阶段，我们使用音频驱动的人物头部生成方法生成有吸引力的视频，只要提供在第一阶段生成的音频。本文对不同的TTS和音频驱动人物头部生成方法进行比较分析，并确定未来研究和发展的最佳方法。有关音频和视频样例，请参考以下链接：https://github.com/ZhichaoWang970201/Text-to-Video/tree/main。

Demonstration-based learning for few-shot biomedical named entity recognition under machine reading comprehension

paper_url: http://arxiv.org/abs/2308.06454
repo_url: None
paper_authors: Leilei Su, Jian Chen, Yifan Peng, Cong Sun
for: 提高几少 BioNER 模型的识别能力
methods: 利用示例学习方法，将 BioNER 转化为机器阅读理解问题
results: 在6个数据集上，与基eline方法比较，提高了1.1%和1.0%的平均 F1 分数，并且可以与大量注释数据的完全监督学习方法竞争

Abstract
Although deep learning techniques have shown significant achievements, they frequently depend on extensive amounts of hand-labeled data and tend to perform inadequately in few-shot scenarios. The objective of this study is to devise a strategy that can improve the model's capability to recognize biomedical entities in scenarios of few-shot learning. By redefining biomedical named entity recognition (BioNER) as a machine reading comprehension (MRC) problem, we propose a demonstration-based learning method to address few-shot BioNER, which involves constructing appropriate task demonstrations. In assessing our proposed method, we compared the proposed method with existing advanced methods using six benchmark datasets, including BC4CHEMD, BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, and JNLPBA. We examined the models' efficacy by reporting F1 scores from both the 25-shot and 50-shot learning experiments. In 25-shot learning, we observed 1.1% improvements in the average F1 scores compared to the baseline method, reaching 61.7%, 84.1%, 69.1%, 70.1%, 50.6%, and 59.9% on six datasets, respectively. In 50-shot learning, we further improved the average F1 scores by 1.0% compared to the baseline method, reaching 73.1%, 86.8%, 76.1%, 75.6%, 61.7%, and 65.4%, respectively. We reported that in the realm of few-shot learning BioNER, MRC-based language models are much more proficient in recognizing biomedical entities compared to the sequence labeling approach. Furthermore, our MRC-language models can compete successfully with fully-supervised learning methodologies that rely heavily on the availability of abundant annotated data. These results highlight possible pathways for future advancements in few-shot BioNER methodologies.

摘要
Translated into Simplified Chinese:尽管深度学习技术已经达到了显著的成就，但它们往往需要大量的手动标注数据，并在几个shot场景下表现不佳。本研究的目标是提出一种策略，以提高模型在几个shot学习中识别生物医学实体的能力。我们将生物医学命名实体识别（BioNER）定义为机器阅读理解（MRC）问题，并提出了一种示例学习方法来解决几个shot BioNER。我们使用了六个标准测试集来评估我们的提议方法，包括BC4CHEMD、BC5CDR-Chemical、BC5CDR-疾病、NCBI-疾病、BC2GM和JNLPBA。我们根据25个shot和50个shot的学习实验来评估模型的效果，并发现在25个shot学习中，我们的方法与基eline方法相比，平均F1分数提高了1.1%，达到了61.7%、84.1%、69.1%、70.1%、50.6%和59.9%的水平。在50个shot学习中，我们进一步提高了平均F1分数，达到了73.1%、86.8%、76.1%、75.6%、61.7%和65.4%的水平。我们发现，在几个shot BioNER中，基于MRC语言模型的方法比sequence标注方法更有才能地识别生物医学实体。此外，我们的MRC语言模型可以与充分监督学习方法竞争，这些方法依赖于大量的注释数据的可用性。这些结果透视了未来几个shot BioNER方法的可能发展道路。

Simple Model Also Works: A Novel Emotion Recognition Network in Textual Conversation Based on Curriculum Learning Strategy

paper_url: http://arxiv.org/abs/2308.06450
repo_url: None
paper_authors: Jiang Li, Xiaoping Wang, Yingjian Liu, Qing Zhou, Zhigang Zeng
for: 本研究主要针对对话中的情感识别 зада项 (Emotion Recognition in Conversation, ERC) 进行研究，以提高情感识别的效率和精度。
methods: 本研究提出了一个基于学习范例 (Curriculum Learning, CL) 的情感识别网络 (Emotion Recognition Network, ERNetCL)，并使用时间编码 (Temporal Encoder, TE) 和空间编码 (Spatial Encoder, SE) 来融合先前的方法，以优化情感识别的效率和精度。
results: 实验结果显示，本研究的提案方法可以优化情感识别的效率和精度，并与其他基于标准方法的方法相比，具有明显的性能优化。

Abstract
Emotion Recognition in Conversation (ERC) has emerged as a research hotspot in domains such as conversational robots and question-answer systems. How to efficiently and adequately retrieve contextual emotional cues has been one of the key challenges in the ERC task. Existing efforts do not fully model the context and employ complex network structures, resulting in excessive computational resource overhead without substantial performance improvement. In this paper, we propose a novel Emotion Recognition Network based on Curriculum Learning strategy (ERNetCL). The proposed ERNetCL primarily consists of Temporal Encoder (TE), Spatial Encoder (SE), and Curriculum Learning (CL) loss. We utilize TE and SE to combine the strengths of previous methods in a simplistic manner to efficiently capture temporal and spatial contextual information in the conversation. To simulate the way humans learn curriculum from easy to hard, we apply the idea of CL to the ERC task to progressively optimize the network parameters of ERNetCL. At the beginning of training, we assign lower learning weights to difficult samples. As the epoch increases, the learning weights for these samples are gradually raised. Extensive experiments on four datasets exhibit that our proposed method is effective and dramatically beats other baseline models.

摘要
《对话中情感识别（ERC）》在领域如会话机器人和问答系统中已经成为研究热点。 efficiently和准确地检索上下文情感cue是ERC任务中的关键挑战。现有尝试没有完全考虑上下文，使用复杂的网络结构，导致计算资源占用过高而无法提供明显的性能提升。本文提出了一种基于学习纲程（CL）的情感识别网络（ERNetCL）。我们利用TE和SE组合previous方法的优点，以简单的方式高效地捕捉对话中的时间和空间上下文信息。通过模仿人类学习纲程的思想，我们在ERC任务中应用CL来逐渐优化ERNetCL的网络参数。在训练的开始时，我们将难度较高的样本分配低学习权重。随着epoch增加，这些样本的学习权重逐渐升高。我们在四个数据集进行了广泛的实验，结果显示，我们的提议方法效果明显，可以很好地超越基准模型。

Performance Prediction for Multi-hop Questions

paper_url: http://arxiv.org/abs/2308.06431
repo_url: None
paper_authors: Mohammadreza Samadi, Davood Rafiei
for: 预测开放领域多步问答（QA）问题的评估难度。
methods: 提出了一种新的预测方法multHP，用于预测开放领域多步问答问题的表现。
results: 对largest multi-hop QA数据集进行了广泛的评估，并显示了提档的表现，比传统单步QPP模型更好。 Additionally, the approach can be effectively used to optimize the parameters of QA systems, such as the number of documents to be retrieved, resulting in improved overall retrieval performance.

Abstract
We study the problem of Query Performance Prediction (QPP) for open-domain multi-hop Question Answering (QA), where the task is to estimate the difficulty of evaluating a multi-hop question over a corpus. Despite the extensive research on predicting the performance of ad-hoc and QA retrieval models, there has been a lack of study on the estimation of the difficulty of multi-hop questions. The problem is challenging due to the multi-step nature of the retrieval process, potential dependency of the steps and the reasoning involved. To tackle this challenge, we propose multHP, a novel pre-retrieval method for predicting the performance of open-domain multi-hop questions. Our extensive evaluation on the largest multi-hop QA dataset using several modern QA systems shows that the proposed model is a strong predictor of the performance, outperforming traditional single-hop QPP models. Additionally, we demonstrate that our approach can be effectively used to optimize the parameters of QA systems, such as the number of documents to be retrieved, resulting in improved overall retrieval performance.

摘要
我们研究了开放领域多步问答（QA）中的问题评估性能预测（QPP）问题，即估计评估一个多步问题的难度。尽管有很多关于预测广泛问答和搜索模型性能的研究，但是没有研究了多步问题的预测。这个问题具有多步搜索过程的多样性和步骤之间的依赖关系，以及需要进行推理。为解决这个挑战，我们提出了 multHP，一种新的预测开放领域多步问题性能的方法。我们对最大的多步问答数据集进行了广泛的评估，结果表明，我们提出的模型是一个强大的性能预测器，超过了传统单步 QPP 模型。此外，我们还证明了我们的方法可以有效地用于优化 QA 系统的参数，例如检索文档的数量，从而提高总体检索性能。

Dynamic Planning with a LLM

paper_url: http://arxiv.org/abs/2308.06391
repo_url: https://github.com/itl-ed/llm-dp
paper_authors: Gautier Dagan, Frank Keller, Alex Lascarides
for: 解决 embodied agent 应用问题，尤其是复杂的计划需要多步骤的情况。
methods: 融合 neural network 和 symbolic planner，使用 LLM 和 traditional planner 共同解决 embodied task。
results: LLM-DP 比 naive LLM ReAct baseline 更快和更高效地解决 Alfworld 问题。

Abstract
While Large Language Models (LLMs) can solve many NLP tasks in zero-shot settings, applications involving embodied agents remain problematic. In particular, complex plans that require multi-step reasoning become difficult and too costly as the context window grows. Planning requires understanding the likely effects of one's actions and identifying whether the current environment satisfies the goal state. While symbolic planners find optimal solutions quickly, they require a complete and accurate representation of the planning problem, severely limiting their use in practical scenarios. In contrast, modern LLMs cope with noisy observations and high levels of uncertainty when reasoning about a task. Our work presents LLM Dynamic Planner (LLM-DP): a neuro-symbolic framework where an LLM works hand-in-hand with a traditional planner to solve an embodied task. Given action-descriptions, LLM-DP solves Alfworld faster and more efficiently than a naive LLM ReAct baseline.

摘要
大型自然语言模型（LLM）可以解决许多自然语言处理任务在零模式下，但是包含身体代理的应用仍然是问题。特别是复杂的计划需要多步骤的理解和识别环境是否满足目标状态。而符号计划器可以快速找到优化解决方案，但它们需要完整和准确地表示计划问题，因此在实际场景中几乎无法使用。相比之下，现代LLM在面临噪音观察和高度不确定性时仍然能够有效地理解任务。我们的工作提出了LLM动态规划器（LLM-DP）：一种神经符号框架，在LLM与传统计划器之间协作解决身体任务。给出动作描述，LLM-DP比预期的LLM ReAct基线更快和高效地解决了Alfworld任务。

Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

paper_url: http://arxiv.org/abs/2308.06327
repo_url: None
paper_authors: Mohammad Soleymanpour, Mahmoud Al Ismail, Fahimeh Bahmaninezhad, Kshitiz Kumar, Jian Wu
for: 这个研究旨在提供一个英文作为次要区域的混合自动语音识别（ASR）设置中的双语解决方案。
methods: 我们的主要开发包括： (a) 发音词库使用文字单位而不是语音单位， (b) 完全双语对焦模型和随后的双语流式变数模型， (c) 平行Encoder结构 WITH 语言识别（LID）损失， (d) 平行Encoder WITH 辅助损失 для单语言预测。
results: 我们的工作在大规模训练和测试任务中显示出强大的英文混合能力。特别是双语IT模型在一个混合IT任务中从46.5%降至13.8%，同时也与单语IT模型（9.5%）在IT测试中仅差0.6%。

Abstract
We introduce a bilingual solution to support English as secondary locale for most primary locales in hybrid automatic speech recognition (ASR) settings. Our key developments constitute: (a) pronunciation lexicon with grapheme units instead of phone units, (b) a fully bilingual alignment model and subsequently bilingual streaming transformer model, (c) a parallel encoder structure with language identification (LID) loss, (d) parallel encoder with an auxiliary loss for monolingual projections. We conclude that in comparison to LID loss, our proposed auxiliary loss is superior in specializing the parallel encoders to respective monolingual locales, and that contributes to stronger bilingual learning. We evaluate our work on large-scale training and test tasks for bilingual Spanish (ES) and bilingual Italian (IT) applications. Our bilingual models demonstrate strong English code-mixing capability. In particular, the bilingual IT model improves the word error rate (WER) for a code-mix IT task from 46.5% to 13.8%, while also achieving a close parity (9.6%) with the monolingual IT model (9.5%) over IT tests.

摘要
我们介绍了一种双语解决方案，以英语为次要地区的hybrid自动语音识别（ASR）设置中支持英语。我们的关键发展包括：（a）使用字母单位而不是语音单位的发音词典，（b）完全双语对应模型和随后的双语流Transformer模型，（c）并行编码结构和语言标识（LID）损失，（d）并行编码器和辅助损失 для单语投影。我们认为，相比LID损失，我们提posed的辅助损失可以更好地特化并行编码器到各自的单语本地，从而为双语学习带来更强的特点。我们对大规模训练和测试任务进行了两种双语西班牙（ES）和双语意大利（IT）应用。我们的双语模型在英语混合代码 Task中显示出了强大的英语混合能力。特别是，双语IT模型将IT任务中的代码混合WER从46.5%降低至13.8%，同时也与单语IT模型（9.5%）在IT测试任务上达到了近似的水平（9.6%）。

Self-Alignment with Instruction Backtranslation

paper_url: http://arxiv.org/abs/2308.06259
repo_url: https://github.com/Spico197/Humback
paper_authors: Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis
for: 这个论文是为了提高语言模型的质量而写的（improve the quality of language models）。
methods: 这个论文使用自动生成的指令Prompt来自我增强语言模型（instruction backtranslation）。
results: 这个论文的方法可以高效地自我增强语言模型，并且在Alpaca领导者榜单上表现更好于所有不使用热静质量数据的LLaMa模型（outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data）。

Abstract
We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

摘要
我们提出了一种可扩展的方法，用于建立高质量的指令遵循语言模型，通过自动将人工写好的文本标记为相应的指令。我们的方法名为指令反翻译。我们的方法开始于一个基于小量种子数据的语言模型，并使用给定的网络资料。这个种子模型用于生成指令提示文本（自我扩充），然后选择高质量的示例从中间（自我审核）。这些数据然后用于训练更强的模型。两轮我们的方法训练后，我们的模型在不使用热静质料的情况下在Alpaca排行榜上表现最佳， demonstarting highly effective self-alignment。

KETM:A Knowledge-Enhanced Text Matching method

paper_url: http://arxiv.org/abs/2308.06235
repo_url: https://github.com/1094701018/ketm
paper_authors: Kexin Jiang, Yahui Zhao, Guozhe Jin, Zhenguo Zhang, Rongyi Cui
for: 这个论文是为了提高文本匹配 task 的性能，通过增强模型理解和逻辑能力。
methods: 该模型使用 Wiktionary retrieve 文本单词定义作为外部知识，并通过多angle pooling 提取文本和知识的特征向量。然后，通过权重门限机制将文本和知识进行权重 fusión，以提高模型的理解和逻辑能力。
results: 在四个 datasets 上进行了实验验证，结果显示，该模型在所有四个 datasets 上都表现良好，并且与不添加外部知识的基本模型相比，该模型的性能有所提高，这证明了该模型的有效性。

Abstract
Text matching is the task of matching two texts and determining the relationship between them, which has extensive applications in natural language processing tasks such as reading comprehension, and Question-Answering systems. The mainstream approach is to compute text representations or to interact with the text through attention mechanism, which is effective in text matching tasks. However, the performance of these models is insufficient for texts that require commonsense knowledge-based reasoning. To this end, in this paper, We introduce a new model for text matching called the Knowledge Enhanced Text Matching model (KETM), to enrich contextual representations with real-world common-sense knowledge from external knowledge sources to enhance our model understanding and reasoning. First, we use Wiktionary to retrieve the text word definitions as our external knowledge. Secondly, we feed text and knowledge to the text matching module to extract their feature vectors. The text matching module is used as an interaction module by integrating the encoder layer, the co-attention layer, and the aggregation layer. Specifically, the interaction process is iterated several times to obtain in-depth interaction information and extract the feature vectors of text and knowledge by multi-angle pooling. Then, we fuse text and knowledge using a gating mechanism to learn the ratio of text and knowledge fusion by a neural network that prevents noise generated by knowledge. After that, experimental validation on four datasets are carried out, and the experimental results show that our proposed model performs well on all four datasets, and the performance of our method is improved compared to the base model without adding external knowledge, which validates the effectiveness of our proposed method. The code is available at https://github.com/1094701018/KETM

摘要
文本匹配是自然语言处理任务中的一项重要任务，它有广泛的应用于阅读理解和问答系统等领域。主流方法是计算文本表示或通过注意力机制与文本进行交互，这些模型在文本匹配任务中表现良好。然而，这些模型在需要通过常识知ledge来解释的文本匹配任务时表现不够。为此，我们在这篇论文中提出了一种新的文本匹配模型，即知识增强文本匹配模型（KETM），以增强我们的模型理解和解释能力。首先，我们使用Wiktionary来提取文本单词定义作为我们的外部知识。然后，我们将文本和知识传递给文本匹配模块，以提取它们的特征向量。文本匹配模块被用作交互模块，并组合了编码层、相关层和聚合层。具体来说，交互过程会多次迭代，以获取深入的交互信息，并使用多角度聚合来提取文本和知识的特征向量。然后，我们使用阻块机制来融合文本和知识，以学习文本和知识的权重比例。最后，我们对四个数据集进行了实验验证，结果表明，我们提出的方法在所有四个数据集上表现出色，并且与基本模型相比，我们的方法性能有所提高，这 validate了我们的方法的有效性。代码可以在 GitHub 上找到。

A Large Language Model Enhanced Conversational Recommender System

paper_url: http://arxiv.org/abs/2308.06212
repo_url: None
paper_authors: Yue Feng, Shuchang Liu, Zhenghai Xue, Qingpeng Cai, Lantao Hu, Peng Jiang, Kun Gai, Fei Sun
for: 提高会话式推荐系统的效果（improve the effectiveness of conversational recommender systems）
methods: 利用大语言模型（Large Language Models, LLM）的理解和生成能力，并与专家模型（expert models）合作，以解决不同的子任务（sub-tasks）。
results: 实验结果表明，使用RLPF进行精心调整LLM，可以提高会话式推荐系统的性能（performance）。

Abstract
Conversational recommender systems (CRSs) aim to recommend high-quality items to users through a dialogue interface. It usually contains multiple sub-tasks, such as user preference elicitation, recommendation, explanation, and item information search. To develop effective CRSs, there are some challenges: 1) how to properly manage sub-tasks; 2) how to effectively solve different sub-tasks; and 3) how to correctly generate responses that interact with users. Recently, Large Language Models (LLMs) have exhibited an unprecedented ability to reason and generate, presenting a new opportunity to develop more powerful CRSs. In this work, we propose a new LLM-based CRS, referred to as LLMCRS, to address the above challenges. For sub-task management, we leverage the reasoning ability of LLM to effectively manage sub-task. For sub-task solving, we collaborate LLM with expert models of different sub-tasks to achieve the enhanced performance. For response generation, we utilize the generation ability of LLM as a language interface to better interact with users. Specifically, LLMCRS divides the workflow into four stages: sub-task detection, model matching, sub-task execution, and response generation. LLMCRS also designs schema-based instruction, demonstration-based instruction, dynamic sub-task and model matching, and summary-based generation to instruct LLM to generate desired results in the workflow. Finally, to adapt LLM to conversational recommendations, we also propose to fine-tune LLM with reinforcement learning from CRSs performance feedback, referred to as RLPF. Experimental results on benchmark datasets show that LLMCRS with RLPF outperforms the existing methods.

摘要
文本： conversational recommender systems (CRSs) 目的是为用户提供高质量的ITEM通过对话界面。它通常包含多个子任务，例如用户偏好描述、推荐、解释和ITEM信息搜索。为开发有效的CRS，存在一些挑战：1）如何正确地管理子任务；2）如何有效地解决不同的子任务；和3）如何正确地生成与用户交互的响应。现在，大型自然语言模型（LLM）在理解和生成方面表现出了无 precedent的能力，这提供了一个新的机会，以开发更有力的CRS。在这项工作中，我们提出了一种基于LLM的CRS，称之为LLMCRS，以解决以下挑战。LLMCRS分为四个阶段：子任务检测、模型匹配、子任务执行和响应生成。LLMCRS还实现了 schema-based instruction、demonstration-based instruction、动态子任务和模型匹配以及摘要生成，以使LLM生成愿望的结果。最后，为适应对话推荐，我们还提出了使用强化学习来调整LLM的性能反馈，称之为RLPF。实验结果表明，LLMCRS与RLPF比现有方法高效。Here's the translation:文本： conversational recommender systems (CRSs) 目的是为用户提供高质量的ITEM通过对话界面。它通常包含多个子任务，例如用户偏好描述、推荐、解释和ITEM信息搜索。为开发有效的CRS，存在一些挑战：1）如何正确地管理子任务；2）如何有效地解决不同的子任务；和3）如何正确地生成与用户交互的响应。现在，大型自然语言模型（LLM）在理解和生成方面表现出了无 precedent的能力，这提供了一个新的机会，以开发更有力的CRS。在这项工作中，我们提出了一种基于LLM的CRS，称之为LLMCRS，以解决以下挑战。LLMCRS分为四个阶段：子任务检测、模型匹配、子任务执行和响应生成。LLMCRS还实现了 schema-based instruction、demonstration-based instruction、动态子任务和模型匹配以及摘要生成，以使LLM生成愿望的结果。最后，为适应对话推荐，我们还提出了使用强化学习来调整LLM的性能反馈，称之为RLPF。实验结果表明，LLMCRS与RLPF比现有方法高效。

Thinking Like an Expert:Multimodal Hypergraph-of-Thought (HoT) Reasoning to boost Foundation Modals

paper_url: http://arxiv.org/abs/2308.06207
repo_url: None
paper_authors: Fanglong Yao, Changyuan Tian, Jintao Liu, Zequn Zhang, Qing Liu, Li Jin, Shuchao Li, Xiaoyu Li, Xian Sun
for: This paper aims to enhance the reasoning ability of foundation models by proposing a multimodal Hypergraph-of-Thought (HoT) reasoning paradigm, which can handle high-order multi-hop reasoning and multimodal comparative judgement.methods: The proposed HoT reasoning paradigm utilizes a textual hypergraph-of-thought and a visual hypergraph-of-thought, along with Cross-modal Co-Attention Graph Learning for multimodal comparative verification.results: Experimentations on the ScienceQA benchmark show that the proposed HoT-based T5 outperforms CoT-based GPT3.5 and chatGPT, and is on par with CoT-based GPT4 with a lower model size.

Abstract
Reasoning ability is one of the most crucial capabilities of a foundation model, signifying its capacity to address complex reasoning tasks. Chain-of-Thought (CoT) technique is widely regarded as one of the effective methods for enhancing the reasoning ability of foundation models and has garnered significant attention. However, the reasoning process of CoT is linear, step-by-step, similar to personal logical reasoning, suitable for solving general and slightly complicated problems. On the contrary, the thinking pattern of an expert owns two prominent characteristics that cannot be handled appropriately in CoT, i.e., high-order multi-hop reasoning and multimodal comparative judgement. Therefore, the core motivation of this paper is transcending CoT to construct a reasoning paradigm that can think like an expert. The hyperedge of a hypergraph could connect various vertices, making it naturally suitable for modelling high-order relationships. Inspired by this, this paper innovatively proposes a multimodal Hypergraph-of-Thought (HoT) reasoning paradigm, which enables the foundation models to possess the expert-level ability of high-order multi-hop reasoning and multimodal comparative judgement. Specifically, a textual hypergraph-of-thought is constructed utilizing triple as the primary thought to model higher-order relationships, and a hyperedge-of-thought is generated through multi-hop walking paths to achieve multi-hop inference. Furthermore, we devise a visual hypergraph-of-thought to interact with the textual hypergraph-of-thought via Cross-modal Co-Attention Graph Learning for multimodal comparative verification. Experimentations on the ScienceQA benchmark demonstrate the proposed HoT-based T5 outperforms CoT-based GPT3.5 and chatGPT, which is on par with CoT-based GPT4 with a lower model size.

摘要
基本模型的逻辑能力是其最重要的能力之一，表明其能够解决复杂的逻辑任务。链条思维（CoT）技术是提高基本模型的逻辑能力的有效方法，吸引了广泛的关注。然而，CoT的逻辑过程是线性的，步骤式的，类似于个人的逻辑思维，适用于解决一般和些许复杂的问题。然而，专家的思维模式具有两个突出的特征，不能由CoT处理得好，即高阶多跳 reasoning和多模态比较判断。因此，本文的核心动机是超越CoT，建立一种能够思考如专家的逻辑模型。基于hypergraph的Hypergraph-of-Thought（HoT）逻辑模型是这种思考的核心。在这种模型中，hyperedge可以连接多个顶点，使其自然地适用于高阶关系的模elling。从这个意义上，本文创新地提出了一种多模态HoT逻辑模型，使基本模型具有专家水平的高阶多跳 reasoning和多模态比较判断能力。具体来说，通过 triple作为主要思想来模型高阶关系，并通过多跳步行路径生成多跳推理。此外，我们还提出了跨模态的Co-Attention图学习来实现多模态比较验证。实验结果表明，基于HoT的T5超过了CoT-based GPT3.5和chatGPT，与CoT-based GPT4的性能相似，但模型规模较小。