cs.CL - 2023-08-09

Performance Analysis of Transformer Based Models (BERT, ALBERT and RoBERTa) in Fake News Detection

paper_url: http://arxiv.org/abs/2308.04950
repo_url: https://github.com/shafna81/fakenewsdetection
paper_authors: Shafna Fitria Nur Azizah, Hasan Dwi Cahyono, Sari Widya Sihwi, Wisnu Widiarto
for: 本研究旨在探讨使用 transformer 模型 для检测假新闻在印度尼西亚语言中的表现。
methods: 本研究使用了 ALBERT 和 RoBERTa 两种改进的 transformer 模型，并对其进行比较，以检测假新闻的性能。
results: 研究发现，使用 ALBERT 模型可以达到 87.6% 的准确率、86.9% 的精度、86.9% F1 分数和 174.5 次/秒（epoch）的运行时间，超过了非 transformer 方法的性能。

Abstract
Fake news is fake material in a news media format but is not processed properly by news agencies. The fake material can provoke or defame significant entities or individuals or potentially even for the personal interests of the creators, causing problems for society. Distinguishing fake news and real news is challenging due to limited of domain knowledge and time constraints. According to the survey, the top three areas most exposed to hoaxes and misinformation by residents are in Banten, DKI Jakarta and West Java. The model of transformers is referring to an approach in the field of artificial intelligence (AI) in natural language processing utilizing the deep learning architectures. Transformers exercise a powerful attention mechanism to process text in parallel and produce rich and contextual word representations. A previous study indicates a superior performance of a transformer model known as BERT over and above non transformer approach. However, some studies suggest the performance can be improved with the use of improved BERT models known as ALBERT and RoBERTa. However, the modified BERT models are not well explored for detecting fake news in Bahasa Indonesia. In this research, we explore those transformer models and found that ALBERT outperformed other models with 87.6% accuracy, 86.9% precision, 86.9% F1-score, and 174.5 run-time (s/epoch) respectively. Source code available at: https://github.com/Shafna81/fakenewsdetection.git

摘要
假新闻是指在新闻媒体格式中存在假信息，但是由新闻机构不当处理而导致的假物。假新闻可能会诋毁或攻击重要个体或组织，甚至是为个人利益。分辨假新闻和真实新闻是困难的，因为有限的领域知识和时间限制。据调查，居民最常受到诈骗和不实信息的地区为印度尼西亚巴仁、特区雅加达和西爪哇。 transformers 是一种人工智能（AI）自然语言处理领域的方法，使用深度学习架构。 transformers 使用强大的注意机制来并行处理文本，生成rich和上下文敏感的单词表示。前一项研究表明，BERT 模型在非 transformer 方法之上显示出超越性，但是一些研究表明，使用改进的 BERT 模型，如 ALBERT 和 RoBERTa，可以提高性能。然而，这些改进 BERT 模型在印度尼西亚语言检测假新闻方面尚未得到充分探索。本研究探讨了这些 transformer 模型，发现 ALBERT 模型在准确率、精度、 F1 分数和运行时间等方面均达到了最高水平，具体数据为 87.6%、86.9%、86.9% 和 174.5（s/epoch）。源代码可以在 GitHub 上找到：https://github.com/Shafna81/fakenewsdetection.git。

Extrapolating Large Language Models to Non-English by Aligning Languages

paper_url: http://arxiv.org/abs/2308.04948
repo_url: None
paper_authors: Wenhao Zhu, Yunzhe Lv, Qingxiu Dong, Fei Yuan, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, Lei Li
for: 提高大语言模型（LLM）对非英语语言的能力
methods: 使用语义对齐和指令调整来强化预训练的LLM在非英语语言上
results: x-LLaMA模型在六种非英语语言的跨语言标准 bencmark 上平均高于英语 instrucction-tuned 对手（Alpaca） by 42.50%，并在中文人文任务上达到了8.2%的提升。

Abstract
Due to the unbalanced training data distribution, the language ability of large language models (LLMs) is often biased towards English. In this paper, we propose to empower pre-trained LLMs on non-English languages by building semantic alignment across languages. We perform instruction-tuning on LLaMA with both translation task data and cross-lingual general task data to obtain cross-lingual models (x-LLaMA). Experiment results on cross-lingual benchmark XQUAD and MLQA show that x-LLaMA models outperform the English instruction-tuned counterpart (Alpaca) by 42.50% on average on six non-English languages. Further experiments on Chinese benchmark C-Eval show that x-LLaMA achieves significant improvement on Chinese humanities tasks, outperforming Alpaca by 8.2%. We also discover that incorporating non-English text on the target side of translation data is particularly effective for boosting non-English ability. Besides, we find that semantic alignment within LLM can be further strengthened as translation task data scales up and we present the formulation of the underlying scaling law. Evaluation results on translation dataset Flores-101 show that \method outperforms previous LLaMA-based models in all evaluated directions. Code and data will be available at: https://github.com/OwenNJU/x-LLM.

摘要
由于训练数据的不均衡分布，大型自然语言模型（LLM）的语言能力 часто受到英语的影响。在这篇论文中，我们提出了使用语义对Alignment来强化预训练的LLM在非英语语言上的能力。我们通过对LLaMA进行 instrucion-tuning，使用翻译任务数据和跨语言通用任务数据来获得跨语言模型（x-LLaMA）。实验结果表明，x-LLaMA模型在六种非英语语言的跨语言标准 bencmark XQUAD和MLQA上的表现比英语 instrucion-tuned counterpart（Alpaca）提高42.50%的平均值。进一步的实验表明，x-LLaMA在中文人文任务上 achieve significan improvement，比Alpaca提高8.2%。我们还发现，在目标语言的翻译数据中包含非英语文本时，特别有效地提高非英语能力。此外，我们发现在翻译任务数据尺度上，LLM的语义对Alignment可以进一步强化。我们还提出了翻译数据集Flores-101上的扩展法则。评估结果表明，我们的方法在所有评估方向上都超过了之前的LLaMA-based模型。代码和数据将在https://github.com/OwenNJU/x-LLM上公开。

Integrating large language models and active inference to understand eye movements in reading and dyslexia

paper_url: http://arxiv.org/abs/2308.04941
repo_url: https://github.com/donnarumma/hai_language
paper_authors: Francesco Donnarumma, Mirco Frosolone, Giovanni Pezzulo
for: simulating reading and eye movements using a computational model
methods: hierarchical active inference, combining strengths of large language models and active inference
results: proficiency in reading known and unknown words and sentences, exploration of maladaptive inference effects in dyslexia, potential implications for understanding and addressing dyslexia

Abstract
We present a novel computational model employing hierarchical active inference to simulate reading and eye movements. The model characterizes linguistic processing as inference over a hierarchical generative model, facilitating predictions and inferences at various levels of granularity, from syllables to sentences. Our approach combines the strengths of large language models for realistic textual predictions and active inference for guiding eye movements to informative textual information, enabling the testing of predictions. The model exhibits proficiency in reading both known and unknown words and sentences, adhering to the distinction between lexical and nonlexical routes in dual-route theories of reading. Notably, our model permits the exploration of maladaptive inference effects on eye movements during reading, such as in dyslexia. To simulate this condition, we attenuate the contribution of priors during the reading process, leading to incorrect inferences and a more fragmented reading style, characterized by a greater number of shorter saccades. This alignment with empirical findings regarding eye movements in dyslexic individuals highlights the model's potential to aid in understanding the cognitive processes underlying reading and eye movements, as well as how reading deficits associated with dyslexia may emerge from maladaptive predictive processing. In summary, our model represents a significant advancement in comprehending the intricate cognitive processes involved in reading and eye movements, with potential implications for understanding and addressing dyslexia through the simulation of maladaptive inference. It may offer valuable insights into this condition and contribute to the development of more effective interventions for treatment.

摘要
我们提出了一种新的计算模型，使用层次活动推理来模拟阅读和视力运动。该模型将语言处理视为推理过程中的层次生成模型，从而实现不同级别的预测和推理，从音节到句子。我们的方法结合了大型语言模型的实用性和活动推理的指导力，以便测试预测。模型能够预测已知和未知词和句子，并且遵循 dual-route 理论中的词和非词路径分离。特别是，我们的模型允许探索误差推理对视力运动的影响，如阅读障碍。为了模拟这种情况，我们在阅读过程中减少了假设的影响，导致错误的推理和更多的短暂快速跳跃，这与词 Reading 障碍者的实际观察结果相符。这种对models 的应用可能为理解阅读和视力运动的认知过程提供valuable 信息，以及如何通过模拟误差推理来理解和治疗阅读障碍。

Unsupervised Out-of-Distribution Dialect Detection with Mahalanobis Distance

paper_url: http://arxiv.org/abs/2308.04886
repo_url: None
paper_authors: Sourya Dipta Das, Yash Vadi, Abhishek Unnam, Kuldeep Yadav
for: 本研究旨在提高 dialect classification 系统的总性性能，并应对实际场景中可能出现的异常输入。
methods: 我们提出了一种基于 Mahalanobis 距离的无监督方法，使用 wav2vec 2.0 变换器模型的所有中间层 embedding 进行多任务学习。
results: 我们的方法与其他现有的 OOD 检测方法比较，显著地提高了检测准确率。

Abstract
Dialect classification is used in a variety of applications, such as machine translation and speech recognition, to improve the overall performance of the system. In a real-world scenario, a deployed dialect classification model can encounter anomalous inputs that differ from the training data distribution, also called out-of-distribution (OOD) samples. Those OOD samples can lead to unexpected outputs, as dialects of those samples are unseen during model training. Out-of-distribution detection is a new research area that has received little attention in the context of dialect classification. Towards this, we proposed a simple yet effective unsupervised Mahalanobis distance feature-based method to detect out-of-distribution samples. We utilize the latent embeddings from all intermediate layers of a wav2vec 2.0 transformer-based dialect classifier model for multi-task learning. Our proposed approach outperforms other state-of-the-art OOD detection methods significantly.

摘要
<>dialect 分类在多种应用中使用，如机器翻译和语音识别，以提高整体系统性能。在实际场景中，部署的 диалект分类模型可能会遇到不同于训练数据分布的输入，也称为 OUT-OF-DISTRIBUTION（OOD）样本。这些 OOD 样本可能会导致不预期的输出，因为这些 диаLECT 的样本在模型训练时未被考虑。 OUT-OF-DISTRIBUTION 检测是一个新的研究领域，在 диалект分类上尚未受到充分关注。为了解决这个问题，我们提出了一种简单 yet 有效的无监督 Mahalanobis 距离特征基于方法。我们利用了所有 intermediate layer 的 latent 嵌入，来进行多任务学习。我们的提议方法在比较其他现有的 OOD 检测方法时表现出色。

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists

paper_url: http://arxiv.org/abs/2308.04885
repo_url: https://github.com/uds-lsv/vowel-harmony-from-word-lists
paper_authors: Julius Steuer, Badr Abdullah, Johann-Mattis List, Dietrich Klakow
for: 这项研究的目的是使用数据驱动的计算模型量化口音协调。
methods: 研究人员使用了语音模型（PLMs）来定义一种信息论 metric来衡量口音协调的程度。
results: 研究人员通过使用不含扩Difficult inflection的词形来覆盖更多的语言，并使用word list来训练模型，成功地捕捉了一些语言中的口音协调模式。

Abstract
We present a cross-linguistic study that aims to quantify vowel harmony using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have relied heavily on inflected word-forms in the analysis of vowel harmony. We instead train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists with a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

摘要
我们发表了一项跨语言研究，旨在使用数据驱动的计算模型量化元音协调。具体来说，我们定义了基于自然语言词典中元音预测性的信息理论度量，并使用语音模型（PLM）来估算。先前的量化研究主要基于语法变化word形式进行分析元音协调。我们 Instead，我们使用跨语言相似的 lemma 形式来训练我们的模型，这使得我们能够更好地涵盖更多的未研究语言。我们的训练数据包括每种语言1000个单词最多的列表。尽管我们使用的数据比之前使用的 corpora 更小，但我们的实验表明我们的神经网络PLMs 能够 Capture 元音协调模式在一组语言中。我们的工作还表明，word lists 是 typological 研究的有价值资源，并且提供了未来研究低资源、未研究语言的新可能性。

Emotion-Conditioned Text Generation through Automatic Prompt Optimization

paper_url: http://arxiv.org/abs/2308.04857
repo_url: None
paper_authors: Yarik Menchaca Resendiz, Roman Klinger
for: 这个论文主要目的是提出一种自动生成受情况条件文本的方法，以便在不需要贵重的精度调整或者培育大型语言模型从头开始的情况下，可以达到竞争性的结果。
methods: 这种方法使用迭代优化程序，通过添加、删除或者替换 tokens，对提供的提示进行优化。为了评估优化后的提示的质量，我们使用一个文本分类器，以确定生成文本中是否满足情况条件。
results: 我们在使用这种方法进行情感条件文本生成 task 时，与手动设计的提示相比，能够达到更高的 macro-average F1 值（0.75），而手动设计的seed prompts 只能达到 macro-average F1 值为 0.22。

Abstract
Conditional natural language generation methods often require either expensive fine-tuning or training a large language model from scratch. Both are unlikely to lead to good results without a substantial amount of data and computational resources. Prompt learning without changing the parameters of a large language model presents a promising alternative. It is a cost-effective approach, while still achieving competitive results. While this procedure is now established for zero- and few-shot text classification and structured prediction, it has received limited attention in conditional text generation. We present the first automatic prompt optimization approach for emotion-conditioned text generation with instruction-fine-tuned models. Our method uses an iterative optimization procedure that changes the prompt by adding, removing, or replacing tokens. As objective function, we only require a text classifier that measures the realization of the conditional variable in the generated text. We evaluate the method on emotion-conditioned text generation with a focus on event reports and compare it to manually designed prompts that also act as the seed for the optimization procedure. The optimized prompts achieve 0.75 macro-average F1 to fulfill the emotion condition in contrast to manually designed seed prompts with only 0.22 macro-average F1.

摘要
常用的自然语言生成方法经常需要 either 昂贵的微调或者从scratch学习大型语言模型。两者都不太可能导致良好的结果，除非有庞大的数据和计算资源。提示学习无需改变大型语言模型的参数，表现出了可持续的潜在。这种方法在零和几个shot文本分类和结构预测方面已经得到了广泛的关注，但在条件文本生成方面却受到了有限的关注。我们提出了首个自动提示优化方法 для情感条件文本生成，使用迭代优化过程，通过添加、删除或替换Token来更新提示。我们的方法只需要一个可测量实现条件变量的文本分类器作为目标函数。我们对情感条件文本生成进行评估，并与手动设计的种子提示进行比较。得到的优化提示达到0.75的macro-average F1，以满足情感条件，而手动设计的种子提示只达到0.22的macro-average F1。

TSSR: A Truncated and Signed Square Root Activation Function for Neural Networks

paper_url: http://arxiv.org/abs/2308.04832
repo_url: None
paper_authors: Yuanhao Gong
for: 这篇论文主要是为了提出一种新的激活函数called Truncated and Signed Square Root (TSSR)函数。
methods: 这篇论文使用了TSSR函数，该函数具有odd、非线性、卷积和导数 kontinuous和总是正的特点。
results: 试验表明，提议的TSSR函数在比较难以学习的问题上表现更好，比如计算机视觉、自然语言处理和语音识别等领域。

Abstract
Activation functions are essential components of neural networks. In this paper, we introduce a new activation function called the Truncated and Signed Square Root (TSSR) function. This function is distinctive because it is odd, nonlinear, monotone and differentiable. Its gradient is continuous and always positive. Thanks to these properties, it has the potential to improve the numerical stability of neural networks. Several experiments confirm that the proposed TSSR has better performance than other stat-of-the-art activation functions. The proposed function has significant implications for the development of neural network models and can be applied to a wide range of applications in fields such as computer vision, natural language processing, and speech recognition.

摘要
translate into Simplified Chinese:activation functions are crucial components of neural networks. In this paper, we introduce a new activation function called the Truncated and Signed Square Root (TSSR) function. This function is distinctive because it is odd, nonlinear, monotone, and differentiable. Its gradient is continuous and always positive. Thanks to these properties, it has the potential to improve the numerical stability of neural networks. Several experiments confirm that the proposed TSSR has better performance than other state-of-the-art activation functions. The proposed function has significant implications for the development of neural network models and can be applied to a wide range of applications in fields such as computer vision, natural language processing, and speech recognition.Note:* "odd" is translated as "奇数" (qīshū)* "nonlinear" is translated as "非线性" (fēi lǐnéng)* "monotone" is translated as "单调" (dāngdiào)* "differentiable" is translated as "可导数" (kědǎoxiàng)* "continuous" is translated as "连续" (liánxù)* "always positive" is translated as "总是正" (zǒngshì zhèng)

Evaluating the Generation Capabilities of Large Chinese Language Models

paper_url: http://arxiv.org/abs/2308.04823
repo_url: https://github.com/Felixgithub2017/CG-Eval
paper_authors: Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang
for: 这篇论文是为了评估大型中文语言模型在不同学术领域的生成能力而写的。
methods: 这篇论文使用了多种指标来评估模型的生成质量，包括准确率、相关性、朴素质量等。
results: 论文发现大型中文语言模型在六个领域中的生成能力强度不同，sciences and engineering领域的模型表现最好，而judicial examination领域的模型表现最差。同时，论文还提出了一个可重复性的Gscore指标来评估模型的生成质量。

Abstract
This paper presents CG-Eval, the first comprehensive evaluation of the generation capabilities of large Chinese language models across a wide range of academic disciplines. The models' performance was assessed based on their ability to generate accurate and relevant responses to different types of questions in six disciplines, namely, Science and Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical Practitioner Qualification Examination, Judicial Examination, and Certified Public Accountant Examination. This paper also presents Gscore, a composite index derived from the weighted sum of multiple metrics to measure the quality of model's generation against a reference. The test data and test results can be found at http://cgeval.besteasy.com/.

摘要

CLEVA: Chinese Language Models EVAluation Platform

paper_url: http://arxiv.org/abs/2308.04813
repo_url: None
paper_authors: Yanyang Li, Jianqiao Zhao, Duo Zheng, Zi-Yuan Hu, Zhi Chen, Xiaohui Su, Yongfeng Huang, Shijia Huang, Dahua Lin, Michael R. Lyu, Liwei Wang
for: 评估中文大型自然语言模型（LLM）的能力 has become an increasingly significant issue, and the paper aims to address this issue by presenting a comprehensive platform for evaluating Chinese LLMs.
methods: The platform, called CLEVA, employs a standardized workflow to assess LLMs’ performance across various dimensions, regularly updating a competitive leaderboard. It also curates a significant proportion of new data and develops a sampling strategy to alleviate contamination.
results: Large-scale experiments featuring 23 influential Chinese LLMs have validated CLEVA’s efficacy.Here is the same information in Simplified Chinese text:
for: 评估中文大型自然语言模型（LLM）的能力已成为一个越来越重要的问题，该文章提出了一种全面的评估平台。
methods: 该平台叫做CLEVA，它使用标准化的工作流程评估不同维度的LLM表现，定期更新竞争性的 liderboard。它还curates a significant proportion of new data和开发了一种避免污染的采样策略。
results: 23种Influential Chinese LLMs的大规模实验已经验证了CLEVA的有效性。

Abstract
With the continuous emergence of Chinese Large Language Models (LLMs), how to evaluate a model's capabilities has become an increasingly significant issue. The absence of a comprehensive Chinese benchmark that thoroughly assesses a model's performance, the unstandardized and incomparable prompting procedure, and the prevalent risk of contamination pose major challenges in the current evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round. Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 influential Chinese LLMs have validated CLEVA's efficacy.

摘要
To address these challenges, we present CLEVA, a user-friendly platform that holistically evaluates Chinese LLMs. Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, and regularly updates a competitive leaderboard. To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round.CLEVA is easy to use and requires just a few mouse clicks and a model API. Users can conduct a thorough evaluation with minimal coding. Large-scale experiments featuring 23 influential Chinese LLMs have validated CLEVA's efficacy.In simplified Chinese, the text would be:中文大语模型（LLM）的出现使得评估模型能力的问题日益重要。但是现在中文LLM的评估遇到了一些挑战，包括中文benchmark的缺乏，评估程序的标准化和比较不一致，以及污染的问题。为了解决这些挑战，我们提出了CLEVA，一个易用的平台，可以全面评估中文LLM。我们的平台使用标准化的工作流程，评估模型的能力在不同的维度上，并 régulièrement更新竞争性的领先板。为了解决污染的问题，CLEVA获取了大量的新数据，并开发了一个确保每个领先板都有唯一子集的抽样方法。使用CLEVA需要只需要几个点键和模型API，用户可以快速进行充分的评估， minimal coding。大规模的实验表明，CLEVA具有效果。

A Bipartite Graph is All We Need for Enhancing Emotional Reasoning with Commonsense Knowledge

paper_url: http://arxiv.org/abs/2308.04811
repo_url: https://github.com/stevekgyang/bhg
paper_authors: Kailai Yang, Tianlin Zhang, Shaoxiong Ji, Sophia Ananiadou
for: 这种研究旨在提高人工智能系统的情感理解能力，特别是在社交媒体上的舆论挖掘和Empathy对话系统中。
methods: 这种方法使用二分图等维度异质图（BHG）方法，将上下文感知的语音表示和知识表示作为不同类型的节点模型，并提出了两种新的知识聚合节点类型来自动过滤和交互知识。
results: 这种方法在对比之下显著超越了现有的知识混合方法，并且可以直接普适化到不同类型和级别的知识源。

Abstract
The context-aware emotional reasoning ability of AI systems, especially in conversations, is of vital importance in applications such as online opinion mining from social media and empathetic dialogue systems. Due to the implicit nature of conveying emotions in many scenarios, commonsense knowledge is widely utilized to enrich utterance semantics and enhance conversation modeling. However, most previous knowledge infusion methods perform empirical knowledge filtering and design highly customized architectures for knowledge interaction with the utterances, which can discard useful knowledge aspects and limit their generalizability to different knowledge sources. Based on these observations, we propose a Bipartite Heterogeneous Graph (BHG) method for enhancing emotional reasoning with commonsense knowledge. In BHG, the extracted context-aware utterance representations and knowledge representations are modeled as heterogeneous nodes. Two more knowledge aggregation node types are proposed to perform automatic knowledge filtering and interaction. BHG-based knowledge infusion can be directly generalized to multi-type and multi-grained knowledge sources. In addition, we propose a Multi-dimensional Heterogeneous Graph Transformer (MHGT) to perform graph reasoning, which can retain unchanged feature spaces and unequal dimensions for heterogeneous node types during inference to prevent unnecessary loss of information. Experiments show that BHG-based methods significantly outperform state-of-the-art knowledge infusion methods and show generalized knowledge infusion ability with higher efficiency. Further analysis proves that previous empirical knowledge filtering methods do not guarantee to provide the most useful knowledge information. Our code is available at: https://github.com/SteveKGYang/BHG.

摘要
“context-aware情感理解能力”是人工智能系统在对话中的重要特点，尤其在社交媒体上的情感分析和Empathy对话系统中。由于许多情感表达 implicit nature，因此通常使用常识来填充语音 semantics 和对话模型。然而，大多数先前知识混入方法通过 empirical knowledge filtering 和自定义 architectures 来实现知识与语音的交互，这可能会抛弃有用的知识方面和限制其在不同的知识来源上的一致性。基于这些观察，我们提出了一种 Bipartite Heterogeneous Graph (BHG) 方法来增强情感理解。在 BHG 中，提取的上下文化语音表示和知识表示被模型为不同类型的异常节点。我们还提出了两种新的知识聚合节点类型，以自动实现知识过滤和交互。BHG 基于的知识混入方法可以直接普遍应用于不同类型和多维度的知识来源。此外，我们还提出了一种 Multi-dimensional Heterogeneous Graph Transformer (MHGT) 来进行图reasoning，可以保持不变的特征空间和不等维度的异常节点类型 durante inference，以避免不必要的信息损失。实验表明，BHG 基于的方法在情感理解方面表现出色，并且具有更高的一致性和效率。进一步分析表明，先前的 empirical knowledge filtering 方法并不能提供最有用的知识信息。我们的代码可以在 GitHub 上找到：https://github.com/SteveKGYang/BHG。

ADMUS: A Progressive Question Answering Framework Adaptable to Multiple Knowledge Sources

paper_url: http://arxiv.org/abs/2308.04800
repo_url: None
paper_authors: Yirui Zhan, Yanzeng Li, Minhao Zhang, Lei Zou
for: 提高KBQA系统在实际场景中的应用性，使得KBQA系统能够轻松地适应不同的数据集。
methods: 提出了一种基于深度学习模型的数据独立KBQA系统，通过解耦KBQA系统的架构，使得系统能够轻松地适应不同的数据集，并且支持多语言和多知识基础的混合使用。
results: 在多种不同的数据集上进行了实质性的试验，证明了ADMUS系统的高效性和灵活性。在线示例可以在https://answer.gstore.cn/pc/index.html中找到。

Abstract
With the introduction of deep learning models, semantic parsingbased knowledge base question answering (KBQA) systems have achieved high performance in handling complex questions. However, most existing approaches primarily focus on enhancing the model's effectiveness on individual benchmark datasets, disregarding the high costs of adapting the system to disparate datasets in real-world scenarios (e.g., multi-tenant platform). Therefore, we present ADMUS, a progressive knowledge base question answering framework designed to accommodate a wide variety of datasets, including multiple languages, diverse backbone knowledge bases, and disparate question answering datasets. To accomplish the purpose, we decouple the architecture of conventional KBQA systems and propose this dataset-independent framework. Our framework supports the seamless integration of new datasets with minimal effort, only requiring creating a dataset-related micro-service at a negligible cost. To enhance the usability of ADUMS, we design a progressive framework consisting of three stages, ranges from executing exact queries, generating approximate queries and retrieving open-domain knowledge referring from large language models. An online demonstration of ADUMS is available at: https://answer.gstore.cn/pc/index.html

摘要
随着深度学习模型的引入，基于语义解析的知识库问答（KBQA）系统在处理复杂问题的性能得到了显著提高。然而，大多数现有方法主要是强调改进模型在特定 benchmark 数据集上的效果，忽视了在实际场景中适应不同数据集的高成本（例如，多租户平台）。因此，我们提出了 ADMUS，一个适应多种数据集的进步知识库问答框架。为了实现这一目标，我们将 convent ional KBQA 系统的架构划分为多个独立的组件，并且提出了一种不同数据集的独立框架。这些组件可以轻松地与新的数据集集成，只需要创建一个数据集相关的微服务，成本极低。为了提高 ADMUS 的可用性，我们设计了一个进步的框架，包括三个阶段：在执行精确查询、生成approx query和从大语言模型中提取开放领域知识三个阶段。在线示例可以在以下地址找到：https://answer.gstore.cn/pc/index.html。

Automatically measuring speech fluency in people with aphasia: first achievements using read-speech data

paper_url: http://arxiv.org/abs/2308.04763
repo_url: None
paper_authors: Lionel Fontan, Typhanie Prince, Aleksandra Nowakowska, Halima Sahraoui, Silvia Martinez-Ferreiro
for: This study aims to assess the relevance of a signal processing algorithm for the automatic measurement of speech fluency in people with aphasia (PWA).
methods: The study uses a forward-backward divergence segmentation and a clustering algorithm to compute four automatic predictors of speech fluency, and combines these predictors into multivariate regression models to predict the average SLP ratings of speech fluency.
results: The study finds that the algorithms used can constitute a cost-effective and reliable tool for the assessment of the speech fluency of patients with aphasia in read-aloud tasks, with accurate predictions and high correlation coefficients between the automatic predictions and SLP ratings.

Abstract
Background: Speech and language pathologists (SLPs) often relyon judgements of speech fluency for diagnosing or monitoringpatients with aphasia. However, such subjective methods havebeen criticised for their lack of reliability and their clinical cost interms of time. Aims: This study aims at assessing the relevance of a signalprocessingalgorithm, initially developed in the field of language acquisition, for the automatic measurement of speech fluency in people with aphasia (PWA). Methods & Procedures: Twenty-nine PWA and five control participantswere recruited via non-profit organizations and SLP networks. All participants were recorded while reading out loud a set ofsentences taken from the French version of the Boston Diagnostic Aphasia Examination. Three trained SLPs assessed the fluency of each sentence on a five-point qualitative scale. A forward-backward divergence segmentation and a clustering algorithm were used to compute, for each sentence, four automatic predictors of speech fluency: pseudo-syllable rate, speech ratio, rate of silent breaks, and standard deviation of pseudo-syllable length. The four predictors were finally combined into multivariate regression models (a multiplelinear regression - MLR, and two non-linear models) to predict the average SLP ratings of speech fluency, using a leave-one speaker-out validation scheme. Outcomes & Results: All models achieved accurate predictions of speech fluency ratings, with average root-mean-square errors as low as 0.5. The MLR yielded a correlation coefficient of 0.87 with reference ratings at the sentence level, and of 0.93 when aggregating the data for each participant. The inclusion of an additional predictor sensitive to repetitions improved further the predictions with a correlation coefficient of 0.91 at the sentence level, and of 0.96 at the participant level. Conclusions: The algorithms used in this study can constitute a cost-effective and reliable tool for the assessment of the speech fluency of patients with aphasia in read-aloud tasks. Perspectives for the assessment of spontaneous speech are discussed.

摘要
背景：语言学和语音学师（SLP）经常依靠语言流畅性的评估来诊断或监测患有语言异常的患者。然而，这些主观方法受到了不可靠性的批评，以及严重影响临床成本。目标：本研究目的是评估一种信号处理算法在诊断语言异常患者（PWA）的语言流畅性方面的可靠性。方法与程序：招募了29名PWA和5名控制参与者，来自非营利组织和SLP网络。所有参与者在念出句子时被录音，并且使用法语版本的波士顿语言鉴别检测测试套件中的句子。三名SLP评估每句语言流畅性的五个质量水平。使用前后弧 divergence 分 segmentation 和归一化算法计算每句语言流畅性的四个自动预测器：pseudo-syllable rate、speech ratio、silent breaks 率和pseudo-syllable length 的标准差。这四个预测器最终组合成多变量回归模型（多元回归）和两种非线性模型来预测SLP评估语言流畅性的平均分数，使用了留一个说话者验证方案。结果与结论：所有模型均达到了准确的语言流畅性评估结果，平均根据值为0.5。多元回归模型在句子水平获得了0.87的相关系数，并在每名参与者的数据归一化后获得0.93的相关系数。在添加一个更多的预测器时，预测结果进一步改善，句子水平相关系数提高到0.91，每名参与者的相关系数提高到0.96。结论：这些算法可以成为诊断患有语言异常的患者语言流畅性的可靠和Cost-effective工具。对叙述语言的评估可能性进行了讨论。

Building Interpretable and Reliable Open Information Retriever for New Domains Overnight

paper_url: http://arxiv.org/abs/2308.04756
repo_url: None
paper_authors: Xiaodong Yu, Ben Zhou, Dan Roth
for: 这篇论文是为了提高信息检索（IR）和知识检索（KR）的性能而写的。
methods: 这篇论文使用了 dense retrieval 模型，通过使用 dense vectors 来表示查询和知识段落，并通过学习字符和Semantic相似性来进行学习。
results: 论文提出了一个信息检索管道，该管道使用 entity/event linking 模型和查询分解模型来更加准确地关注查询中不同的信息单元。论文表明，相比单个 dense vectors 和端到端超vision，该管道可以更好地提高 passage coverages 和denotation accuracies，并且更加可读性和可靠性。

Abstract
Information retrieval (IR) or knowledge retrieval, is a critical component for many down-stream tasks such as open-domain question answering (QA). It is also very challenging, as it requires succinctness, completeness, and correctness. In recent works, dense retrieval models have achieved state-of-the-art (SOTA) performance on in-domain IR and QA benchmarks by representing queries and knowledge passages with dense vectors and learning the lexical and semantic similarity. However, using single dense vectors and end-to-end supervision are not always optimal because queries may require attention to multiple aspects and event implicit knowledge. In this work, we propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query. We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks. It will be the go-to system to use for applications that need to perform IR on a new domain without much dedicated effort, because of its superior interpretability and cross-domain performance.

摘要
信息检索（IR）或知识检索是许多下游任务的关键组件，如开放领域问答（QA）。它具有精炼、完整和正确的要求。在最近的工作中，稠密检索模型已经实现了领域内IR和QA benchmark的状态最佳性（SOTA）性能，通过将查询和知识段表示为稠密矢量，并学习语义和语言相似性。但是，使用单个稠密矢量和端到端超vis�� Nobel是不 siempre最佳，因为查询可能需要对多个方面进行注意力和隐藏知识。在这种情况下，我们提议一个信息检索管道，使用实体/事件关联模型和查询分解模型，以更加准确地关注不同信息单元的查询。我们表明，相比于单一稠密矢量和端到端超vis�� Nobel，我们的提议管道可以更好地提高通过五个IR和QA benchmark的段覆盖率和涵义准确率。这将成为新领域IR应用的标准系统，因为它的超过其他解释和跨领域性能。

Slot Induction via Pre-trained Language Model Probing and Multi-level Contrastive Learning

paper_url: http://arxiv.org/abs/2308.04712
repo_url: None
paper_authors: Hoang H. Nguyen, Chenwei Zhang, Ye Liu, Philip S. Yu
for: 本研究的目的是提高对话系统中的自然语言理解能力，尤其是任务导向对话（TOD）系统中的意图检测和插槽填充任务。
methods: 本研究使用了无监督语言模型（PLM）探测和对比学习机制，利用无监督语义知识和句子级意图标签信号来进行槽induction任务。
results: 研究结果表明，使用PLM探测和对比学习机制可以有效地实现槽induction任务，并且可以与token级监督模型相似或更高的性能。此外，当扩展到新意图时，我们的SI目标还可以提高插槽填充任务的性能。

Abstract
Recent advanced methods in Natural Language Understanding for Task-oriented Dialogue (TOD) Systems (e.g., intent detection and slot filling) require a large amount of annotated data to achieve competitive performance. In reality, token-level annotations (slot labels) are time-consuming and difficult to acquire. In this work, we study the Slot Induction (SI) task whose objective is to induce slot boundaries without explicit knowledge of token-level slot annotations. We propose leveraging Unsupervised Pre-trained Language Model (PLM) Probing and Contrastive Learning mechanism to exploit (1) unsupervised semantic knowledge extracted from PLM, and (2) additional sentence-level intent label signals available from TOD. Our approach is shown to be effective in SI task and capable of bridging the gaps with token-level supervised models on two NLU benchmark datasets. When generalized to emerging intents, our SI objectives also provide enhanced slot label representations, leading to improved performance on the Slot Filling tasks.

摘要
To address this challenge, we propose leveraging Unsupervised Pre-trained Language Model (PLM) Probing and Contrastive Learning mechanisms to extract unsupervised semantic knowledge from PLM and utilize additional sentence-level intent label signals available from TOD. Our approach is effective in the SI task and can bridge the gap with token-level supervised models on two NLU benchmark datasets.Moreover, our SI objectives also provide enhanced slot label representations, leading to improved performance on Slot Filling tasks. This is particularly useful when dealing with emerging intents, where traditional slot label representations may not be effective. Our approach offers a promising solution for improving the efficiency and accuracy of NLU systems in TOD applications.

Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

paper_url: http://arxiv.org/abs/2308.04711
repo_url: None
paper_authors: Tim Hartill, Diana Benavides-Prado, Michael Witbrock, Patricia J. Riddle
For: The paper aims to improve the performance of smaller language models on challenging short-answer question-answering tasks by combining rationales generated by a larger language model with longer contexts created from a multi-hop dense retrieval system.* Methods: The paper proposes two methods for combining rationales and contexts: Rationale Ranking (RR) and Reasoning with Retrieval-Augmented Training Data (RATD). RR involves training a model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness, and then combining the scores to derive combined contexts. RATD involves training a smaller reasoning model using retrieval-augmented training datasets to utilize relevant information from longer text sequences.* Results: The paper finds that both methods are effective, but the RATD method is more straightforward to apply and produces the strongest results in unseen settings. The proposed models also generally outperform direct prompts against much larger models in both few-shot chain-of-thought and few-shot answer-only settings.

Abstract
When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we train a smaller Reasoning model using retrieval-augmented training datasets such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. Generally we find that both methods are effective but that the $\textit{RATD}$ method is more straightforward to apply and produces the strongest results in the unseen setting on which we focus. Our single best Reasoning model using only 440 million parameters materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings.

摘要
The first method, called $\textit{RR}$, trains a Rationale Ranking model to score both generated rationales and retrieved contexts based on relevance and truthfulness. We then use the scores to combine the contexts from both knowledge sources using various strategies.The second method, called $\textit{RATD}$, trains a smaller reasoning model using retrieval-augmented training datasets. This allows the model to learn how to utilize relevant information from longer text sequences, even if they contain many irrelevant sentences.We find that both methods are effective, but the $\textit{RATD}$ method is easier to apply and produces the strongest results in unseen settings. Our best reasoning model, using only 440 million parameters, significantly improves upon strong prior baselines (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and even outperforms direct prompts against larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and few-shot answer-only settings. Our proposed models also generally outperform direct prompts against much larger models in both settings.

A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology

paper_url: http://arxiv.org/abs/2308.04709
repo_url: None
paper_authors: Sean Wu, Michael Koo, Lesley Blum, Andy Black, Liyo Kao, Fabien Scalzo, Ira Kurtz
for: This study investigated the medical knowledge capability of large language models (LLMs) in the context of internal medicine subspecialty multiple-choice test-taking ability.
methods: The study compared the performance of several open-source LLMs (Koala 7B, Falcon 7B, Stable-Vicuna 13B, and Orca Mini 13B) to GPT-4 and Claude 2 on multiple-choice questions in the field of Nephrology.
results: The study found that current widely used open-sourced LLMs have poor zero-shot reasoning ability compared to GPT-4 and Claude 2, with an overall success rate of 17.1% - 25.5% in answering nephSAP multiple-choice questions correctly.

Abstract
In recent years, there have been significant breakthroughs in the field of natural language processing, particularly with the development of large language models (LLMs). These LLMs have showcased remarkable capabilities on various benchmarks. In the healthcare field, the exact role LLMs and other future AI models will play remains unclear. There is a potential for these models in the future to be used as part of adaptive physician training, medical co-pilot applications, and digital patient interaction scenarios. The ability of AI models to participate in medical training and patient care will depend in part on their mastery of the knowledge content of specific medical fields. This study investigated the medical knowledge capability of LLMs, specifically in the context of internal medicine subspecialty multiple-choice test-taking ability. We compared the performance of several open-source LLMs (Koala 7B, Falcon 7B, Stable-Vicuna 13B, and Orca Mini 13B), to GPT-4 and Claude 2 on multiple-choice questions in the field of Nephrology. Nephrology was chosen as an example of a particularly conceptually complex subspecialty field within internal medicine. The study was conducted to evaluate the ability of LLM models to provide correct answers to nephSAP (Nephrology Self-Assessment Program) multiple-choice questions. The overall success of open-sourced LLMs in answering the 858 nephSAP multiple-choice questions correctly was 17.1% - 25.5%. In contrast, Claude 2 answered 54.4% of the questions correctly, whereas GPT-4 achieved a score of 73.3%. We show that current widely used open-sourced LLMs do poorly in their ability for zero-shot reasoning when compared to GPT-4 and Claude 2. The findings of this study potentially have significant implications for the future of subspecialty medical training and patient care.

摘要
近年来，自然语言处理领域有了 significativ breakthrough，特别是大语言模型（LLMs）的发展。这些 LLMs 在不同的标准准则上显示了惊人的能力。在医疗领域，未来 LLMs 和其他未来 AI 模型的具体作用仍然 unclear。这些模型在未来可能用于 adaptive physician training、医疗 copilot 应用和数字patient interaction 场景。AI 模型在医疗教育和患者护理中的参与度取决于它们在特定医学领域的知识内容的掌握程度。本研究 investigated LLMs 在内科亚专业多选题测试能力方面的医学知识能力。我们比较了多个开源 LLMs（Koala 7B、Falcon 7B、Stable-Vicuna 13B 和 Orca Mini 13B）和 GPT-4 和 Claude 2 在尼科логи亚专业多选题中的表现。尼科логи亚专业是内科中的一个特别概念复杂的亚专业领域。本研究的目的是评估 LLM 模型在 nephSAP 多选题中的正确答案能力。全部 open-sourced LLMs 在 858 个 nephSAP 多选题中正确答案的成功率为 17.1% - 25.5%。与此相比， Claude 2 答对了 54.4% 的问题，而 GPT-4 则达到了 73.3%。我们显示了当前广泛使用的 open-sourced LLMs 在零次学习时的能力远低于 GPT-4 和 Claude 2。本研究的结果可能对未来医疗专业培训和患者护理产生重要影响。

Generating News-Centric Crossword Puzzles As A Constraint Satisfaction and Optimization Problem

paper_url: http://arxiv.org/abs/2308.04688
repo_url: None
paper_authors: Kaito Majima, Shotaro Ishihara
for: 这个研究旨在创建一个自动生成新闻对应的十字WORD游戏，以增强新闻学习的教育用途。
methods: 这个研究使用了一个问题设定和优化问题的方法，将新闻中的词汇集成十字WORD游戏中，以增加学习效果。
results: 研究发现，即使仅有少量的新闻词汇，仍可以生成新闻对应的十字WORD游戏，并且可以在不同的环境下进行优化。

Abstract
Crossword puzzles have traditionally served not only as entertainment but also as an educational tool that can be used to acquire vocabulary and language proficiency. One strategy to enhance the educational purpose is personalization, such as including more words on a particular topic. This paper focuses on the case of encouraging people's interest in news and proposes a framework for automatically generating news-centric crossword puzzles. We designed possible scenarios and built a prototype as a constraint satisfaction and optimization problem, that is, containing as many news-derived words as possible. Our experiments reported the generation probabilities and time required under several conditions. The results showed that news-centric crossword puzzles can be generated even with few news-derived words. We summarize the current issues and future research directions through a qualitative evaluation of the prototype. This is the first proposal that a formulation of a constraint satisfaction and optimization problem can be beneficial as an educational application.

摘要
十字WORD puzzles 不仅作为娱乐，还可以作为学习工具，帮助学习者提高词汇和语言水平。一种增强教育效果的策略是个性化，例如包含特定主题的词汇。这篇论文关注于鼓励人们对新闻的兴趣，并提出了自动生成新闻中心十字WORD puzzles 的框架。我们设计了可能的情景，建立了一个约束满足优化问题，即包含最多新闻来源的词汇。我们的实验报告了生成概率和时间，并对不同条件进行了评估。结果表明，可以使用少量新闻来源生成新闻中心十字WORD puzzles。我们通过质量评估我们的原型，总结了当前的问题和未来的研究方向。这是首次提出了一种约束满足优化问题可以作为教育应用。

TBIN: Modeling Long Textual Behavior Data for CTR Prediction

paper_url: http://arxiv.org/abs/2308.08483
repo_url: None
paper_authors: Shuwei Chen, Xiang Li, Jian Dong, Jin Zhang, Yongkang Wang, Xingxing Wang
for: 预测点击率 (CTR) 在推荐中发挥关键作用，因此启发自最近崛起的语言模型 (LM) 的工作，通过将用户行为数据组织成文本格式，使用 LM 理解用户在semantic水平上的兴趣。
methods: 本文提出了一种 Textual Behavior-based Interest Chunking Network (TBIN)，该方法结合了有效的本地相关哈希算法和偏移 chunk-based self-attention，以解决上述限制。
results: 实验结果表明，TBIN 可以有效地预测 CTR，并且在实际食物推荐平台上进行了在线实验，得到了较高的预测精度。

Abstract
Click-through rate (CTR) prediction plays a pivotal role in the success of recommendations. Inspired by the recent thriving of language models (LMs), a surge of works improve prediction by organizing user behavior data in a \textbf{textual} format and using LMs to understand user interest at a semantic level. While promising, these works have to truncate the textual data to reduce the quadratic computational overhead of self-attention in LMs. However, it has been studied that long user behavior data can significantly benefit CTR prediction. In addition, these works typically condense user diverse interests into a single feature vector, which hinders the expressive capability of the model. In this paper, we propose a \textbf{T}extual \textbf{B}ehavior-based \textbf{I}nterest Chunking \textbf{N}etwork (TBIN), which tackles the above limitations by combining an efficient locality-sensitive hashing algorithm and a shifted chunk-based self-attention. The resulting user diverse interests are dynamically activated, producing user interest representation towards the target item. Finally, the results of both offline and online experiments on real-world food recommendation platform demonstrate the effectiveness of TBIN.

摘要
点击率（CTR）预测在推荐中扮演重要的角色。鉴于最近崛起的语言模型（LM），一些工作将用户行为数据 format 为文本，并使用 LM 理解用户的 semantic 价值。 Although promising, these works have to truncate the textual data to reduce the quadratic computational overhead of self-attention in LMs. However, it has been studied that long user behavior data can significantly benefit CTR prediction. In addition, these works typically condense user diverse interests into a single feature vector, which hinders the expressive capability of the model.In this paper, we propose a 文本行为基因网络（TBIN）， which tackles the above limitations by combining an efficient locality-sensitive hashing algorithm and a shifted chunk-based self-attention. The resulting user diverse interests are dynamically activated, producing user interest representation towards the target item. Finally, the results of both offline and online experiments on real-world food recommendation platform demonstrate the effectiveness of TBIN.

Sudowoodo: a Chinese Lyric Imitation System with Source Lyrics

paper_url: http://arxiv.org/abs/2308.04665
repo_url: None
paper_authors: Yongzhu Chang, Rongsheng Zhang, Lin Jiang, Qihang Chen, Le Zhang, Jiashu Pu
for: 这个论文的目的是提出一种基于中文歌词模型的中文歌词模仿系统（Sudowoodo），以便通过模仿已有的歌词来生成新的歌词。
methods: 这个论文使用了一种新的框架，该框架基于中文歌词模型，并使用了一些 keyword-based lyrics 模型来构建一个平行训练集。然后，该论文使用了一种新的 lyrics imitation 模型来训练这个平行训练集。最后，该论文使用了一种 post-processing 模块来筛选和排序生成的歌词，以选择最高质量的歌词。
results: 该论文的实验结果表明，使用该新的框架和模型可以更好地进行中文歌词模仿。此外，该论文还提供了一个 demo 视频，详细介绍了该系统的使用和应用。

Abstract
Lyrics generation is a well-known application in natural language generation research, with several previous studies focusing on generating accurate lyrics using precise control such as keywords, rhymes, etc. However, lyrics imitation, which involves writing new lyrics by imitating the style and content of the source lyrics, remains a challenging task due to the lack of a parallel corpus. In this paper, we introduce \textbf{\textit{Sudowoodo}, a Chinese lyrics imitation system that can generate new lyrics based on the text of source lyrics. To address the issue of lacking a parallel training corpus for lyrics imitation, we propose a novel framework to construct a parallel corpus based on a keyword-based lyrics model from source lyrics. Then the pairs \textit{(new lyrics, source lyrics)} are used to train the lyrics imitation model. During the inference process, we utilize a post-processing module to filter and rank the generated lyrics, selecting the highest-quality ones. We incorporated audio information and aligned the lyrics with the audio to form the songs as a bonus. The human evaluation results show that our framework can perform better lyric imitation. Meanwhile, the \textit{Sudowoodo} system and demo video of the system is available at \href{https://Sudowoodo.apps-hp.danlu.netease.com/}{Sudowoodo} and \href{https://youtu.be/u5BBT_j1L5M}{https://youtu.be/u5BBT\_j1L5M}.

摘要
文章摘要：本文介绍了一种新的中文歌词模仿系统——《嗨嗨》（Sudowoodo），可以基于源歌词文本生成新歌词。由于缺乏平行训练集，歌词模仿 зада务一直是一个挑战。我们提出了一种新的框架，利用关键词基于的歌词模型从源歌词中构建平行训练集。然后，我们使用这些对（新歌词、源歌词）进行训练歌词模仿模型。在推理过程中，我们利用一个后处理模块来筛选和排序生成的歌词，选择最高质量的歌词。此外，我们还 incorporated 音频信息并将歌词与音频进行对应，形成了完整的歌曲。人工评估结果显示，我们的框架可以实现更好的歌词模仿。同时，《嗨嗨》系统和demo视频也可以在 \href{https://Sudowoodo.apps-hp.danlu.netease.com/}{Sudowoodo} 和 \href{https://youtu.be/u5BBT_j1L5M}{https://youtu.be/u5BBT\_j1L5M} 上查看。

Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach

paper_url: http://arxiv.org/abs/2308.04645
repo_url: None
paper_authors: Ercong Nie, Helmut Schmid, Hinrich Schütze
for: 本研究旨在建立一个自动 syntax 分析系统 для $\mathbf{M}$iddle $\mathbf{H}$igh $\mathbf{G}$erman $\mathbf{MHG}$，但是 Due to the lack of annotated parse data, 训练一个自动 syntax 分析系统是一个困难的任务。
methods: 我们运用了 cross-lingual transfer 技术，使用 MG parse dataset 进行训练，并通过 delexicalization 方法将 MG parse 结果转换为 MHG parse 结果。
results: 我们的 delexicalized constituency parser 在 MHG 测试集上表现出色，实现了 F1 分数 67.3%，比最佳 zero-shot cross-lingual baseline 高出 28.6% 点。这个鼓舞人心的结果说明了自动 syntax 分析在其他具有相似挑战的古语言中的实际可行性。

Abstract
Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for $\mathbf{M}$iddle $\mathbf{H}$igh $\mathbf{G}$erman $\mathbf{MHG}$ under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and $\mathbf{M}$odern $\mathbf{G}$erman $\mathbf{MG}$, along with the abundance of MG treebank resources. Specifically, by employing the $\mathit{delexicalization}$ method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. These encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.

摘要
古代语言处理自然语言处理（NLP）任务的基础角色。然而，为古代语言solely靠注释分析数据自动生成语法分析系统是一项困难的任务，因为这些语言的语法特征往往具有独特的挑战。这需要广泛的语言专业知识，从而导致了资源的缺乏。为了突破这个障碍，跨语言传递技术，即不需要或只需要少量注释数据的低资源目标语言中的技术，提供了一个有前途的解决方案。在这种研究中，我们关注于在实际条件下构建一个中世纪高德语（MHG）的分析器，无需注释MHG语法数据进行训练。我们利用中世纪高德语和现代德语之间的语言相似性和结构相似性，同时利用现代德语语法数据资源的充足。具体来说，我们采用了delexicalization方法，通过在MG语法数据集上训练分析器，并对MHG语法进行跨语言传递。我们的delexicalized分析器在MHG测试集上显示了Remarkable性能，达到了67.3%的F1分数。它超过了零零假设的跨语言基线值28.6个百分点。这些鼓舞人心的结果证明了自动语法分析在其他面临类似挑战的古代语言中的实用性和潜力。

Single-Sentence Reader: A Novel Approach for Addressing Answer Position Bias

paper_url: http://arxiv.org/abs/2308.04566
repo_url: https://github.com/sonqt/single-sentence-reader
paper_authors: Son Quoc Tran, Matt Kretchmar
for: 本研究旨在解决机器阅读理解（MRC）模型借助于偶极性相关性（也称为数据集偏见或笔记痕），从而提高MRC模型的 robustness。
methods: 本研究提出了一种新的单句读者方法，以解决答案位置偏见问题。六种不同的模型被实现，并进行了严格的性能分析。
results: 实验结果表明，提案的单句读者方法可以nearly match conventional training sets上模型的性能，证明其有效性。

Abstract
Machine Reading Comprehension (MRC) models tend to take advantage of spurious correlations (also known as dataset bias or annotation artifacts in the research community). Consequently, these models may perform the MRC task without fully comprehending the given context and question, which is undesirable since it may result in low robustness against distribution shift. This paper delves into the concept of answer-position bias, where a significant percentage of training questions have answers located solely in the first sentence of the context. We propose a Single-Sentence Reader as a new approach for addressing answer position bias in MRC. We implement this approach using six different models and thoroughly analyze their performance. Remarkably, our proposed Single-Sentence Readers achieve results that nearly match those of models trained on conventional training sets, proving their effectiveness. Our study also discusses several challenges our Single-Sentence Readers encounter and proposes a potential solution.

摘要

Ahead of the Text: Leveraging Entity Preposition for Financial Relation Extraction

paper_url: http://arxiv.org/abs/2308.04534
repo_url: None
paper_authors: Stefan Pasch, Dimitrios Petridis
for: 这篇论文是为了解决金融实体关系标注任务的，使用了REFind数据集。
methods: 该方法包括将提供的实体插入文本中相应位置，然后使用基于变换器的语言模型roberta-large进行文本分类，最后进行后处理来处理模型生成的不可能预测。
results: 该方法在比赛的公共排名榜上获得了第一名。

Abstract
In the context of the ACM KDF-SIGIR 2023 competition, we undertook an entity relation task on a dataset of financial entity relations called REFind. Our top-performing solution involved a multi-step approach. Initially, we inserted the provided entities at their corresponding locations within the text. Subsequently, we fine-tuned the transformer-based language model roberta-large for text classification by utilizing a labeled training set to predict the entity relations. Lastly, we implemented a post-processing phase to identify and handle improbable predictions generated by the model. As a result of our methodology, we achieved the 1st place ranking on the competition's public leaderboard.

摘要
在ACM KDF-SIGIR 2023比赛的Entity Relation任务中，我们采用了一种多步方法。首先，我们将提供的实体 inserting到文本中相应的位置。然后，我们使用abeled训练集来使transformer基于语言模型roberta-large进行文本分类的精度。最后，我们实施了一个后处理阶段，以便通过模型生成的不可能预测来处理。因此，我们的方法在比赛的公共排名板上获得了第一名。

Revisiting Disentanglement and Fusion on Modality and Context in Conversational Multimodal Emotion Recognition

paper_url: http://arxiv.org/abs/2308.04502
repo_url: None
paper_authors: Bobo Li, Hao Fei, Lizi Liao, Yu Zhao, Chong Teng, Tat-Seng Chua, Donghong Ji, Fei Li
for: 本研究targets at further improving the performance of multimodal emotion recognition in conversation (MM-ERC) tasks, by properly modeling the multimodal feature and conversational context.
methods: 该研究提出了一种 dual-level disentanglement mechanism (DDM)和一种 contribution-aware fusion mechanism (CFM)，以及一种 context refusion mechanism (CRM)，用于在feature disentanglement和feature fusion stages中 simultanously modeling the multimodal feature and conversational context.
results: 在两个公共MM-ERC数据集上，该系统实现了新的状态OF-THE-ART性能，并且进行了进一步的分析，表明所提出的方法可以充分利用多 modal和Contextual features，并且具有推广到更广泛的 conversational multimodal任务的潜在潜力。

Abstract
It has been a hot research topic to enable machines to understand human emotions in multimodal contexts under dialogue scenarios, which is tasked with multimodal emotion analysis in conversation (MM-ERC). MM-ERC has received consistent attention in recent years, where a diverse range of methods has been proposed for securing better task performance. Most existing works treat MM-ERC as a standard multimodal classification problem and perform multimodal feature disentanglement and fusion for maximizing feature utility. Yet after revisiting the characteristic of MM-ERC, we argue that both the feature multimodality and conversational contextualization should be properly modeled simultaneously during the feature disentanglement and fusion steps. In this work, we target further pushing the task performance by taking full consideration of the above insights. On the one hand, during feature disentanglement, based on the contrastive learning technique, we devise a Dual-level Disentanglement Mechanism (DDM) to decouple the features into both the modality space and utterance space. On the other hand, during the feature fusion stage, we propose a Contribution-aware Fusion Mechanism (CFM) and a Context Refusion Mechanism (CRM) for multimodal and context integration, respectively. They together schedule the proper integrations of multimodal and context features. Specifically, CFM explicitly manages the multimodal feature contributions dynamically, while CRM flexibly coordinates the introduction of dialogue contexts. On two public MM-ERC datasets, our system achieves new state-of-the-art performance consistently. Further analyses demonstrate that all our proposed mechanisms greatly facilitate the MM-ERC task by making full use of the multimodal and context features adaptively. Note that our proposed methods have the great potential to facilitate a broader range of other conversational multimodal tasks.

摘要
研究团队已经很长时间来关注人工智能机器人理解人类情感的问题，即在对话场景下的多模态情感分析（MM-ERC）。这个问题在过去几年内得到了不间断的关注，而且提出了各种方法来提高任务性能。大多数现有方法都将 MM-ERC 视为标准的多模态分类问题，然后进行多模态特征分离和融合以最大化特征用用。然而，我们在再次检视 MM-ERC 的特点时，发现需要同时模型多模态特征和对话上下文。在这种情况下，我们提出了一种新的方法，以提高任务性能。在特征分离阶段，我们根据对比学习技术提出了双级分离机制（DDM），以分离特征到模态空间和话语空间两个级别。在特征融合阶段，我们提出了参与度感知融合机制（CFM）和上下文融合机制（CRM），用于多模态和上下文集成。CFM 可以动态管理多模态特征的贡献，而 CRM 可以flexibly 协调对话上下文的引入。在两个公共 MM-ERC 数据集上，我们的系统在新的状态均达到了领先的性能。进一步分析表明，我们的所有提出的机制都可以帮助 MM-ERC 任务，使得机器人可以更好地理解人类情感。此外，我们的方法还有潜在的推广性，可以推动更多的对话多模态任务的进步。

DialogRE^C+: An Extension of DialogRE to Investigate How Much Coreference Helps Relation Extraction in Dialogs

paper_url: http://arxiv.org/abs/2308.04498
repo_url: https://github.com/palm2333/dialogre_coreference
paper_authors: Yiyun Xiong, Mengwei Dai, Fei Li, Hao Fei, Bobo Li, Shengqiong Wu, Donghong Ji, Chong Teng
For: This paper introduces a new benchmark dataset called DialogRE^C+, which incorporates coreference resolution into the dialogue relation extraction (DRE) task.* Methods: The paper manually annotates a total of 5,068 coreference chains over 36,369 argument mentions based on existing DialogRE data, and develops four coreference-enhanced graph-based DRE models.* Results: The paper evaluates the effect of automatically extracted coreference chains and demonstrates the practicality of the DialogRE^C+ dataset for other domains and tasks.

Abstract
Dialogue relation extraction (DRE) that identifies the relations between argument pairs in dialogue text, suffers much from the frequent occurrence of personal pronouns, or entity and speaker coreference. This work introduces a new benchmark dataset DialogRE^C+, introducing coreference resolution into the DRE scenario. With the aid of high-quality coreference knowledge, the reasoning of argument relations is expected to be enhanced. In DialogRE^C+ dataset, we manually annotate total 5,068 coreference chains over 36,369 argument mentions based on the existing DialogRE data, where four different coreference chain types namely speaker chain, person chain, location chain and organization chain are explicitly marked. We further develop 4 coreference-enhanced graph-based DRE models, which learn effective coreference representations for improving the DRE task. We also train a coreference resolution model based on our annotations and evaluate the effect of automatically extracted coreference chains demonstrating the practicality of our dataset and its potential to other domains and tasks.

摘要
对话关系提取（DRE）在对话文本中识别对话参与者之间的关系，受到个人 pronouns 和实体、发言人核心Reference的频繁出现困扰。这项工作提出了一个新的标准数据集 DialogRE^C+，将核心 Reference resolution 引入 DRE 场景中。通过高质量核心 Reference 知识的帮助，我们预期可以提高对话关系的逻辑推理。在 DialogRE^C+ 数据集中，我们 manually annotated 总共 5,068 个核心 Reference chain sobre 36,369 个 argue mention，其中包括 speaker chain、person chain、location chain 和 organization chain 四种不同类型的核心 Reference chain。我们还开发了 4 种核心 Reference 增强的图基于 DRE 模型，这些模型学习了有效的核心 Reference 表示，以提高 DRE 任务的性能。此外，我们还训练了基于我们的标注的核心 Reference 解析模型，并评估了自动提取的核心 Reference chain 的实用性， thereby demonstrating the practicality of our dataset and its potential to other domains and tasks.

A Bi-directional Multi-hop Inference Model for Joint Dialog Sentiment Classification and Act Recognition

paper_url: http://arxiv.org/abs/2308.04424
repo_url: None
paper_authors: Li Zheng, Fei Li, Yuyang Chai, Chong Teng, Donghong Ji
for: 这个论文的目的是提出一种新的对话情感分类（DSC）和行为识别（DAR）任务的解决方案，以同时预测每个对话中的情感标签和行为标签。
methods: 该方法利用了一种特征选择网络和双向多跳推理网络，以逐步提取和融合对话中的丰富情感和行为 clue。此外，该方法还使用了对比学习和双向学习来直接模型情感和行为标签之间的相关性。
results: 对两个常用的数据集进行了实验，结果显示，与州立标准比较，该模型在DAR和DSC任务上的性能提高至少2.6%和1.4%。此外，该模型不仅提高了性能，还增强了对这两个任务的解释性。

Abstract
The joint task of Dialog Sentiment Classification (DSC) and Act Recognition (DAR) aims to predict the sentiment label and act label for each utterance in a dialog simultaneously. However, current methods encode the dialog context in only one direction, which limits their ability to thoroughly comprehend the context. Moreover, these methods overlook the explicit correlations between sentiment and act labels, which leads to an insufficient ability to capture rich sentiment and act clues and hinders effective and accurate reasoning. To address these issues, we propose a Bi-directional Multi-hop Inference Model (BMIM) that leverages a feature selection network and a bi-directional multi-hop inference network to iteratively extract and integrate rich sentiment and act clues in a bi-directional manner. We also employ contrastive learning and dual learning to explicitly model the correlations of sentiment and act labels. Our experiments on two widely-used datasets show that BMIM outperforms state-of-the-art baselines by at least 2.6% on F1 score in DAR and 1.4% on F1 score in DSC. Additionally, Our proposed model not only improves the performance but also enhances the interpretability of the joint sentiment and act prediction task.

摘要
joint任务对话情感分类（DSC）和行为识别（DAR）的目标是同时预测每句话的情感标签和行为标签。然而，现有方法只是一个方向地编码对话上下文，这限制了它们对对话上下文的理解能力。另外，这些方法忽略了情感和行为标签之间的直接相关性，这会导致对 sentiment和 act 标签的捕捉不充分，从而降低对这两个任务的准确性和有效性。为了解决这些问题，我们提议一种双向多跳推理模型（BMIM），利用特征选择网络和双向多跳推理网络来顺序提取和融合rich的情感和行为 clue。我们还使用对比学习和双向学习来直接模型情感和行为标签之间的相关性。我们的实验结果表明，BMIM在两个广泛使用的 dataset 上表现至少比状态ixelbaselines 2.6%的F1分数提高，并且我们的提出的模型不仅改善性能，还提高了对这两个任务的合理性。

Character-level NMT and language similarity

paper_url: http://arxiv.org/abs/2308.04398
repo_url: None
paper_authors: Josef Jon, Ondřej Bojar
for: 这个论文是为了研究使用Transformer架构进行字符级别的人工智能翻译，以及不同语言相似度和训练数据大小对翻译的影响。
methods: 这个论文使用了字符级别的Neural Machine Translation（NMT）模型，并使用了Transformer架构。
results: 研究发现，在相似语言之间的翻译benefits于字符级别输入分 segmentation，而在不相似语言之间，字符级别vanilla Transformer-base经常落后于字符级别分 segmentation。研究也证实了之前的发现，可以通过 fine-tuning已经训练过的字符级别模型来逼近这些模型的性能。

Abstract
We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset on translation between Czech and Croatian, German, Hungarian, Slovak, and Spanish. We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation, while for less related languages, character-level vanilla Transformer-base often lags behind subword-level segmentation. We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.

摘要
我们研究使用Transformer架构进行字符级别人工智能翻译的效果，在捷克和克罗地亚、德国、匈牙利、斯洛伐克和西班牙之间的翻译中进行了不同语言相似性和训练数据大小的研究。我们使用自动MT指标进行评估，并发现在相似语言之间的翻译中，字符级别输入分 segmentation 带来了好处，而在较不相似的语言中，字符级凯旋Transformer-base 经常落后于字符级别分 segmentation。我们证实了之前的发现，可以通过 fine-tuning 已经训练过的字符级别模型来减变这个差距。

Learning Evaluation Models from Large Language Models for Sequence Generation

paper_url: http://arxiv.org/abs/2308.04386
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Jingbo Zhu
for: 这篇论文目的是将大语言模型的评估能力转移到较轻量级的语言模型中，以解决大语言模型的 computation challenge。
methods: 这篇论文提出了一个称为ECT的评估能力转移方法，通过将评估模型从LLMs学习到较轻量级的语言模型中，以提高sequence generation模型的评估能力。
results: 实验结果显示，使用ECT学习评估模型后，sequence generation模型的生成结果对于常用的 метри和ChatGPT进行评估都有所提高。

Abstract
Large language models achieve state-of-the-art performance on sequence generation evaluation, but typically have a large number of parameters. This is a computational challenge as presented by applying their evaluation capability at scale. To overcome the challenge, in this paper, we propose \textbf{ECT}, an \textbf{e}valuation \textbf{c}apability \textbf{t}ransfer method, to transfer the evaluation capability from LLMs to relatively lightweight language models. Based on the proposed ECT, we learn various evaluation models from ChatGPT, and employ them as reward models to improve sequence generation models via reinforcement learning and reranking approaches. Experimental results on machine translation, text style transfer, and summarization tasks demonstrate the effectiveness of our ECT. Notably, applying the learned evaluation models to sequence generation models results in better generated sequences as evaluated by commonly used metrics and ChatGPT.

摘要
大型语言模型在序列生成评价评价中表现出色，但它们通常具有较多参数。这是一个计算挑战，因为在应用它们的评价能力的大规模应用中，需要大量的计算资源。为解决这个挑战，在这篇论文中，我们提出了ECT（评价能力传输方法），用于将大型语言模型中的评价能力传输到较轻量级的语言模型中。基于ECT，我们从ChatGPT中学习了多种评价模型，并使其作为激励学习和重新排序的奖励模型。实验结果表明，ECT可以有效地提高序列生成模型的性能，并且应用学习的评价模型可以根据常用的指标和ChatGPT进行评价。