2023-11-21

cs.CL

cs.CL - 2023-11-21

Attribution and Alignment: Effects of Local Context Repetition on Utterance Production and Comprehension in Dialogue

paper_url: http://arxiv.org/abs/2311.13061
repo_url: None
paper_authors: Aron Molnar, Jaap Jumelet, Mario Giulianelli, Arabella Sinclair
for: 这个论文目的是评估语言模型在对话中重复的水平是否与人类对话水平相似，以及语言模型在理解过程中如何处理语言重复。
methods: 这个论文使用了语言模型的生成和理解方法来进行研究。
results: 研究发现语言模型在对话中的重复水平和人类对话水平相似，并且在理解过程中使用了类似于人类对话中的语言重复机制。

Abstract
Language models are often used as the backbone of modern dialogue systems. These models are pre-trained on large amounts of written fluent language. Repetition is typically penalised when evaluating language model generations. However, it is a key component of dialogue. Humans use local and partner specific repetitions; these are preferred by human users and lead to more successful communication in dialogue. In this study, we evaluate (a) whether language models produce human-like levels of repetition in dialogue, and (b) what are the processing mechanisms related to lexical re-use they use during comprehension. We believe that such joint analysis of model production and comprehension behaviour can inform the development of cognitively inspired dialogue generation systems.

摘要
语言模型常被用作现代对话系统的脊梁。这些模型通常在大量的整流文本上进行预训练。在评估语言模型生成时，通常会对重复进行惩罚。然而，重复是对话的关键组成部分。人类在对话中使用本地和伙伴特定的重复，这些重复被人类用户首选，并导致更successful的对话。在这种研究中，我们计划（a）判断语言模型在对话中是否生成出人类水平的重复，以及（b）模型在理解过程中如何处理lexical re-use的机制。我们认为，这种联合分析模型的生产和理解行为可以推导出认知驱动的对话生成系统的发展。

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark

paper_url: http://arxiv.org/abs/2311.13053
repo_url: https://github.com/haroldliuj/multiapi
paper_authors: Xiao Liu, Jianfeng Lin, Jiawei Zhang
for: 推进大语言模型在多modal信息上的能力提升
methods: 使用了235种多样化的API调用和2038个contextual prompts进行大语言模型的评估
results: 发现大语言模型在API调用决策方面表现出色，但面临域名确定、功能选择和 argue生成等挑战，并且卫星context可能会降低性能。

Abstract
The proliferation of Large Language Models like ChatGPT has significantly advanced language understanding and generation, impacting a broad spectrum of applications. However, these models predominantly excel in text-based tasks, overlooking the complexity of real-world multimodal information. This study introduces MultiAPI, a pioneering comprehensive large-scale API benchmark dataset aimed at expanding LLMs' proficiency in multimodal contexts. Developed collaboratively through ChatGPT, MultiAPI consists of 235 diverse API calls and 2,038 contextual prompts, offering a unique platform evaluation of tool-augmented LLMs handling multimodal tasks. Through comprehensive experiments, our findings reveal that while LLMs demonstrate proficiency in API call decision-making, they face challenges in domain identification, function selection, and argument generation. What's more, we surprisingly notice that auxiliary context can actually impair the performance. An in-depth error analysis paves the way for a new paradigm to address these challenges, suggesting a potential direction for future LLM research.

摘要
大量语言模型如ChatGPT的普及已经对语言理解和生成带来了巨大的进步，影响了广泛的应用领域。然而，这些模型主要 excel在文本任务上，忽视了现实世界中的多模式信息的复杂性。本研究提出了MultiAPI，一个创新的大规模API Benchmark数据集，旨在扩展LLMs在多模式Context下的能力。通过与ChatGPT合作开发，MultiAPI包含235种多样化API调用和2038个contextual prompt，为Tool-augmented LLMs在多模式任务中的评估提供了一个独特的平台。经过广泛的实验，我们发现，虽然LLMs在API调用决策方面表现出色，但面临域名确定、功能选择和Arguments生成等挑战。更有意外的是，auxiliary context可能会降低性能。一个深入的错误分析，为未来LLM研究提供了一个新的思路，指向了一个可能的方向。

Systematic word meta-sense extension

paper_url: http://arxiv.org/abs/2311.13029
repo_url: https://github.com/jadeleiyu/sworme
paper_authors: Lei Yu
for: 测试和改善语言模型的字面Extension能力，以扩展字面意义到新的semantic domain（也称为meta-sense）。
methods: 引入了一个新的任务called systematic word meta-sense extension (SWORME)，用于测试和改善语言模型的字面Extension能力。
results: 研究发现语言模型倾向于对于相关的semantic domain进行逐步的字面Extension，但在非常不直观的meaning extension中表现较差。提出了一种基于比喻的word meaning extension方法，并证明了这种方法可以有效地提高语言模型的系统性。

Abstract
The meaning of polysemous words often varies in a highly productive yet predictable way. Generalizing the regularity between conventional senses to derive novel word meaning is crucial for automated processing of non-literal language uses such as figurative expressions. We introduce a novel task called systematic word meta-sense extension (SWORME) to test and improve language models' ability to extend word meaning to denote new semantic domains (also called meta-senses) that bear regular semantic relations with existing senses. We found that language models prefer incremental lexical semantic change toward conceptually similar meta-senses such as logical metonymy, and are much worse at predicting highly non-literal meaning extensions such as metaphors. We propose a novel analogy-based method of word meaning extension, and show that it effectively improves language model systematicity in making both gradual and radical types of meta-sense extension. We further demonstrate that learning systematic meta-sense extensions benefits language models on multiple benchmarks of figurative language understanding.

摘要
文本中的多义词语常有一定的规律性变化，从某种意义上扩展到新的 semantic domain（也称为meta-sense）是自动处理非直译语言用法的关键。我们提出了一项新任务 called systematic word meta-sense extension (SWORME)，检测和提高语言模型对word meaning的扩展能力。我们发现语言模型偏好逐步含义变化，即从某种相似的meta-sense中进行逐步扩展，而不是难以预测的非直译含义扩展。我们提出了一种基于对比的方法，并证明其能够有效地提高语言模型的系统性。此外，我们还证明了学习系统atic meta-sense extension可以提高语言模型对figurative language的理解能力。

Data Diversity Matters for Robust Instruction Tuning

paper_url: http://arxiv.org/abs/2311.14736
repo_url: None
paper_authors: Alexander Bukharin, Tuo Zhao
for: 本研究目的是Alignment大型自然语言模型， instruction tuningStep中的一个中心挑战是数据集选择，因为数据集的组合可以直接影响下游性能。
methods: 我们提出了一个新的算法，即Quality-Diversity Instruction Tuning（QDIT），可以控制数据集的多样性和质量，从而进行深入的研究，了解多样性和质量对于 instrucction following 能力的影响。
results: 我们在多个大规模 instruction tuning 数据集上验证了 QDIT 的性能，发现它可以提高 worst-case 性能 by 18%，同时保持或提高 average 性能，相比于质量驱动基elines。

Abstract
Instruction tuning has emerged as a key step in aligning large language models. One of the central challenges of instruction tuning is dataset selection, as the composition of the instruction tuning dataset can significantly impact downstream performance. In particular, researchers have hypothesized that dataset diversity and dataset quality are important indicators of downstream performance. However, it is not clear how to automatically select high quality and diverse data or how exactly quality and diversity affect instruction following ability. To resolve these issues, we propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT). QDIT provides a principled algorithm to control dataset diversity and quality, allowing us to conduct an in depth study on the effect of diversity and quality on instruction tuning performance. From this study we draw two key insights (1) there is a natural tradeoff between dataset diversity and quality and (2) increasing dataset diversity significantly improves the worst case instruction following performance, therefore improving robustness. We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can improve worst case performance by 18% while maintaining or improving average performance compared to quality driven baselines.

摘要
<>将文本翻译成简化中文。<>大语言模型对 instrucion 调整已经成为关键的步骤。一个中心的挑战是 instrucion 调整 dataset 选择，因为选择的 dataset 可以对下游性能产生很大的影响。特别是，研究人员假设了 dataset 多样性和 dataset 质量是下游性能的重要指标。然而，不清楚如何自动选择高质量和多样的数据，或者怎样质量和多样性对 instrucion 遵循能力有什么影响。为解决这些问题，我们提出了一个新的算法，即 Quality-Diversity Instruction Tuning (QDIT)。QDIT 提供了一个原理惊动的方法来控制 dataset 多样性和质量，让我们可以进行深入的研究，了解多样性和质量对 instrucion 调整性能的影响。从这个研究中，我们获得了两个关键的结读：（1）多样性和质量之间存在自然的贡献和（2）增加 dataset 多样性可以提高最差情况 instrucion 遵循性能，因此提高了类型的稳定性。我们验证 QDIT 的性能在多个大规模 instrucion 调整 dataset 上，发现它可以提高最差情况性能 by 18% ，同时维持或提高平均性能相比于质量驱动基准。

A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift

paper_url: http://arxiv.org/abs/2311.14743
repo_url: None
paper_authors: Ben Pikus, Will LeVine, Tony Chen, Sean Hendryx
for: 这 paper 的目的是测试 Large Language Models (LLM) 中 reward model 的Robustness 性能是否受到分布Shift的影响。
methods: 这 paper 使用了 Reinforcement Learning with Human Feedback (RLHF) 方法，通过训练 reward model 来对 LLM 进行 aligning。在推理时，使用 reward model 来评估 LLM 的响应是否符合所需的行为。
results: 这 paper 的结果表明， reward model 在分布Shift 的情况下的性能会受到影响， Specifically, 通过对 prompts 和 responses 进行 OOD 探测，发现了新的折衔模式和准确率下降。此外，这 paper 还采用了一种常用于分类的 OOD 探测技术，并应用于 reward model Setting 中来探测 prompts 和 responses 的分布Shift。

Abstract
Foundation models, specifically Large Language Models (LLM's), have lately gained wide-spread attention and adoption. Reinforcement Learning with Human Feedback (RLHF) involves training a reward model to capture desired behaviors, which is then used to align an LLM. These reward models are additionally used at inference-time to estimate how well LLM responses adhere to those desired behaviors. However, there is little work measuring how robust these reward models are to distribution shifts. In this work, we evaluate how reward model performance - measured via accuracy and calibration (i.e. alignment between accuracy and confidence) - is affected by distribution shift. We show novel calibration patterns and accuracy drops due to OOD prompts and responses, and that the reward model is more sensitive to shifts in responses than prompts. Additionally, we adapt an OOD detection technique commonly used in classification to the reward model setting in order to detect these distribution shifts in prompts and responses.

摘要
Foundation models, specifically Large Language Models (LLM), 在最近受到广泛关注和应用。人类反馈学习强化学习（RLHF）通过训练一个奖励模型，以捕捉所需的行为，然后用于对LMM进行对齐。这些奖励模型在推理时还用于估计LLM响应是否符合所需的行为。然而，有很少关于这些奖励模型对分布变化的抗锋性的研究。在这个工作中，我们评估了奖励模型性能（通过准确率和自信度对接）受到分布变化的影响。我们发现了新的准确模式和自信度下降，它们是由外部提示和响应引起的。此外，我们还采用了一种常用于分类的 OUTSIDE（OOD）探测技术，以检测提示和响应中的分布变化。

LowResource at BLP-2023 Task 2: Leveraging BanglaBert for Low Resource Sentiment Analysis of Bangla Language

paper_url: http://arxiv.org/abs/2311.12735
repo_url: https://github.com/aunabil4602/bnlp-workshop-task2-2023
paper_authors: Aunabil Chakma, Masum Hasan
for: 本研究的目的是描述在BLP-2023任务2中进行 sentiment analysis 的系统，该任务涉及处理来自多种社交媒体平台的公共帖子和评论。
methods: 本研究使用了多种策略，包括细化、随机删除 tokens、以及使用多个外部数据集，以使用 BanglaBert 模型进行预处理。
results: 本研究的最终模型是 ensemble 的三个最佳 BanglaBert 变化，其中在 Test Set 中得分为 0.718，在 30 个参与队伍中排名第三。 Additionally, the paper discusses promising systems that did not perform well, including task-adaptive pertaining and paraphrasing using BanglaT5.

Abstract
This paper describes the system of the LowResource Team for Task 2 of BLP-2023, which involves conducting sentiment analysis on a dataset composed of public posts and comments from diverse social media platforms. Our primary aim is to utilize BanglaBert, a BERT model pre-trained on a large Bangla corpus, using various strategies including fine-tuning, dropping random tokens, and using several external datasets. Our final model is an ensemble of the three best BanglaBert variations. Our system has achieved overall 3rd in the Test Set among 30 participating teams with a score of 0.718. Additionally, we discuss the promising systems that didn't perform well namely task-adaptive pertaining and paraphrasing using BanglaT5. Training codes and external datasets which are used for our system are publicly available at https://github.com/Aunabil4602/bnlp-workshop-task2-2023

摘要
这份报告介绍了我们在BLP-2023任务2中的系统，该任务涉及对社交媒体平台上的公共帖子和评论进行情感分析。我们的主要目标是使用预先训练的BanglaBert模型，包括微调、随机drop tokens以及多个外部数据集，以实现最佳性能。我们的最终模型是 ensemble of 三个最佳 BanglaBert 变种。我们的系统在30个参与者队伍中的测试集中得到了总第三名，得分为0.718。此外，我们还讨论了未能表现好的系统，包括任务适应性的归并和重复使用 BanglaT5。我们使用的训练代码和外部数据集在https://github.com/Aunabil4602/bnlp-workshop-task2-2023上公开可用。

Soft Random Sampling: A Theoretical and Empirical Analysis

paper_url: http://arxiv.org/abs/2311.12727
repo_url: None
paper_authors: Xiaodong Cui, Ashish Mittal, Songtao Lu, Wei Zhang, George Saon, Brian Kingsbury
for: 大规模深度神经网络的高效训练
methods: 随机抽样法 (Soft Random Sampling, SRS)
results: 1. 数据覆盖率和占用率的分析; 2. 非 convex 目标函数的收敛率; 3. 泛化性性能Here’s a more detailed explanation of each point:1. What the paper is written for: The paper is written for training large-scale deep neural networks, specifically using the Soft Random Sampling (SRS) method for efficient training.2. What methods the paper uses: The paper uses SRS, a simple yet effective approach for efficient training of large-scale deep neural networks.3. What results the paper gets: The paper provides a theoretical and empirical analysis of SRS, including its sampling dynamics, convergence rate, and generalization performance. The results show that SRS offers a better accuracy-efficiency trade-off compared to existing coreset-based data selection methods, especially on real-world industrial scale data sets.

Abstract
Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost.

摘要
<>随机抽样（SRS）是一种简单 yet有效的方法，用于大规模深度神经网络的高效训练，特别是面临庞大数据时。SRS每 epoch 选择全数据集中的一 subset uniform random sampling with replacement。在这篇论文中，我们进行了样本动力学的分析，包括数据覆盖率和占用率。然后，我们研究了它对非对称目标函数的收敛性和收敛速率。最后，我们评估了它的泛化性。我们对 CIFAR10 和 Librispeech 等图像识别和自然语音识别 benchmark 进行了实验评估，以示其效果。相比现有的核心集基本数据选择方法，SRS 提供了更好的准确率-效率交易。特别是在实际工业规模的数据集上，它被证明是一种强大的训练策略，具有显著的速度减少和竞争性能，而无需额外计算成本。

Fair Text Classification with Wasserstein Independence

paper_url: http://arxiv.org/abs/2311.12689
repo_url: https://github.com/letenothibaud/wasserstein_fair_classification
paper_authors: Thibaud Leteno, Antoine Gourru, Charlotte Laclau, Rémi Emonet, Christophe Gravier
for: 这篇论文主要旨在提高文本分类模型的公平性，尤其是在敏感群体（例如男女）之间实现公平对待的挑战。
methods: 本论文提出了一种基于 Wasserstein 独立的方法，以避免在文本Encoder中吸收不公平的资讯。这种方法不需要训练和测试敏感特征的标签，与现有方法不同。
results: 本论文的方法可以实现与现有方法相同或更好的公平精度调和。

Abstract
Group fairness is a central research topic in text classification, where reaching fair treatment between sensitive groups (e.g. women vs. men) remains an open challenge. This paper presents a novel method for mitigating biases in neural text classification, agnostic to the model architecture. Considering the difficulty to distinguish fair from unfair information in a text encoder, we take inspiration from adversarial training to induce Wasserstein independence between representations learned to predict our target label and the ones learned to predict some sensitive attribute. Our approach provides two significant advantages. Firstly, it does not require annotations of sensitive attributes in both testing and training data. This is more suitable for real-life scenarios compared to existing methods that require annotations of sensitive attributes at train time. Second, our approach exhibits a comparable or better fairness-accuracy trade-off compared to existing methods.

摘要
“集体公平”是文本分类研究的中心话题，即在敏感群体（例如男女）之间实现公正对待仍然是一个未解决的挑战。本文提出了一种新的方法来减少文本分类模型中的偏见，不受模型体系影响。因为Difficult to distinguish fair from unfair information in a text encoder，我们从对抗训练中得到了灵感，通过 Wasserstein independence between representations learned to predict our target label and the ones learned to predict some sensitive attribute，来降低偏见。我们的方法具有两大优点：一是不需要在测试和训练数据中标注敏感属性。这比现有方法更适合实际场景，因为现有方法需要在训练时标注敏感属性。二是，我们的方法在公平精度质量之间具有相当或更好的质量比。

MathGloss: Building mathematical glossaries from text

paper_url: http://arxiv.org/abs/2311.12649
repo_url: None
paper_authors: Lucy Horowitz, Valeria de Paiva
for: 这个项目的目的是自动地使用现代自然语言处理工具和资源，创建一个高等数学知识图（KG），以便每名数学家可以根据自己的偏好自定义学习。
methods: 这个项目使用了五种资源：wikidata、大学科学课程的涵盖词、法国高等数学课程的讲义和自动证明工具lean4、一个多语言数学词典（MuLiMa），以及一个由数学家curate的类征论义Wiki（nLab）。
results: 这个项目的结果是一个联结了各种学习数学资源的知识图，可以让每名数学家根据自己的偏好自定义学习，并且可以使得数学家和正式工具专家更容易互相理解，缓解一些正式数学和计算机科学之间的障碍。

Abstract
MathGloss is a project to create a knowledge graph (KG) for undergraduate mathematics from text, automatically, using modern natural language processing (NLP) tools and resources already available on the web. MathGloss is a linked database of undergraduate concepts in mathematics. So far, it combines five resources: (i) Wikidata, a collaboratively edited, multilingual knowledge graph hosted by the Wikimedia Foundation, (ii) terms covered in mathematics courses at the University of Chicago, (iii) the syllabus of the French undergraduate mathematics curriculum which includes hyperlinks to the automated theorem prover Lean 4, (iv) MuLiMa, a multilingual dictionary of mathematics curated by mathematicians, and (v) the nLab, a wiki for category theory also curated by mathematicians. MathGloss's goal is to bring together resources for learning mathematics and to allow every mathematician to tailor their learning to their own preferences. Moreover, by organizing different resources for learning undergraduate mathematics alongside those for learning formal mathematics, we hope to make it easier for mathematicians and formal tools (theorem provers, computer algebra systems, etc) experts to "understand" each other and break down some of the barriers to formal math.

摘要
MathGloss是一个项目，旨在自动地使用现代自然语言处理工具和资源，创建高等数学知识 graphs（KG），以便为大学生数学学习提供一个链接数据库。MathGloss是一个将多种资源集成在一起的数学概念链接数据库。当前，MathGloss已经 combinated five resources：（i）Wikidata，一个由wikimedia基金会共同编辑的多语言知识图；（ii）University of Chicago的数学课程中覆盖的概念；（iii）法国大学数学课程的讲义，包括Lean 4自动证明器的链接；（iv）MuLiMa，一个由数学家维护的多语言数学词典；以及（v）nLab，由数学家维护的category theorywiki。MathGloss的目标是将学习数学资源集中起来，让每个数学家可以根据自己的喜好自定义学习。此外，通过将不同的学习数学资源与学习正式数学资源一起排序，我们希望可以让数学家和正式工具专家更好地“理解” each other，缓解一些正式数学和计算机科学之间的障碍。

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

paper_url: http://arxiv.org/abs/2311.12534
repo_url: None
paper_authors: Simone Filice, Jason Ingyu Choi, Giuseppe Castellucci, Eugene Agichtein, Oleg Rokhlenko
for: 这篇论文的目的是为了提出和评估语言生成（NLG）任务中的多个输出文本生成技术。
methods: 这篇论文使用了常见的NLG指标，如BLEU指标，来评估生成的文本质量。然而，这些指标并不适合用于评估生成的交通数据。因此，本文提出了一些适合的指标，并对它们进行了评估。
results: 本文的实验结果表明，提出的指标能够更好地评估生成的交通数据质量，并且与人工评估结果相吻合度提高了20%。这些结果表明，这些指标可以用于更好地估计生成的文本数据的代表性。

Abstract
Many Natural Language Generation (NLG) tasks aim to generate a single output text given an input prompt. Other settings require the generation of multiple texts, e.g., for Synthetic Traffic Generation (STG). This generation task is crucial for training and evaluating QA systems as well as conversational agents, where the goal is to generate multiple questions or utterances resembling the linguistic variability of real users. In this paper, we show that common NLG metrics, like BLEU, are not suitable for evaluating STG. We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts. We validate our metrics with an automatic procedure to verify whether they capture different types of quality issues of generated data; we also run human annotations to verify the correlation with human judgements. Experiments on three tasks, i.e., Shopping Utterance Generation, Product Question Generation and Query Auto Completion, demonstrate that our metrics are effective for evaluating STG tasks, and improve the agreement with human judgement up to 20% with respect to common NLG metrics. We believe these findings can pave the way towards better solutions for estimating the representativeness of synthetic text data.

摘要
很多自然语言生成（NLG）任务的目标是生成一个输入提示的唯一输出文本。然而，其他情况需要生成多个文本，例如神经网络交通生成（STG）。这种生成任务对于训练和评估问答系统以及对话代理人来说非常重要，因为它的目标是生成多个问题或谈话的语言多样性，类似于真实用户的语言变量。在这篇论文中，我们表明了常见的NLG指标，如BLEU，不适用于评估STG。我们提出并评估了一些适用于比较生成的交通与真实用户文本分布的指标。我们使用自动程序来验证这些指标是否捕捉了不同类型的质量问题，并进行人工标注来验证与人类判断的相关性。在购物问题生成、产品问题生成和查询自动完成三个任务中，我们的指标显示了对STG任务的有效性，并与常见NLG指标相比提高了人类判断的吻合率达20%。我们认为这些发现可能会推动更好的代理人数据的可 represencing性估计。

paper_url: http://arxiv.org/abs/2311.12489
repo_url: None
paper_authors: Viktor Hangya, Silvia Severini, Radoslav Ralev, Alexander Fraser, Hinrich Schütze
for: 本研究旨在提高低资源语言（<5M tokens）和中等资源语言（<50M）的多语言自然语言处理（NLP）性能。
methods: 我们提出了一种语言链基本方法，通过在资源充沛的源语言为起点，逐渐添加每种语言，直到达到目标语言。我们还扩展了半共同双语方法，以消除前一代工作中的主要弱点，即独立训练的单语言表示。
results: 我们在4种语言家族中进行了双语词典推导，包括4种very low-resource（<5M tokens）和4种中等资源（<50M）目标语言，并显示了这些方法的改进性能。此外，我们的分析表明，中间语言的质量也是重要的，以及在多语言空间中所有语言的引导点都很重要。

Abstract
Very low-resource languages, having only a few million tokens worth of data, are not well-supported by multilingual NLP approaches due to poor quality cross-lingual word representations. Recent work showed that good cross-lingual performance can be achieved if a source language is related to the low-resource target language. However, not all language pairs are related. In this paper, we propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach, that incorporates intermediate related languages to bridge the gap between the distant source and target. We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target. We extend a semi-joint bilingual approach to multiple languages in order to eliminate the main weakness of previous works, i.e., independently trained monolingual embeddings, by anchoring the target language around the multilingual space. We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (<5M tokens) and 4 moderately low-resource (<50M) target languages, showing improved performance in both categories. Additionally, our analysis reveals the importance of good quality embeddings for intermediate languages as well as the importance of leveraging anchor points from all languages in the multilingual space.

摘要
非常低资源语言，具有只有几百万个字的数据，由于跨语言word表示质量不高，不太适合多语言NLP方法。然而，recent work表明，如果源语言与目标语言之间存在关系，那么可以实现良好的跨语言性能。然而，不 всех语言对不是相关的。在这篇论文中，我们提议通过一种语言链模型来构建多语言词嵌入(MWEs)，该模型通过在资源充沛的源语言和目标语言之间添加一系列相关的语言来桥接这两个语言。我们逐一构建MWEs，从源语言开始，逐渐添加每种语言，直到达到目标语言。我们将多语言semi-联合方法扩展到多种语言，以消除之前工作中的主要弱点，即独立训练的单语言嵌入。我们对4种语言家族的双语词典生成进行评估，包括4种非常低资源（<5M tokens）和4种moderately low-resource（<50M）目标语言，并显示我们的方法在这两个类型的目标语言上都有提高的表现。此外，我们的分析表明，中间语言的质量也是非常重要，以及在多语言空间中使用所有语言的抓点。

Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish

paper_url: http://arxiv.org/abs/2311.12480
repo_url: None
paper_authors: David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
for: 本研究旨在提高视觉语音识别系统的质量，特别是在 speaker-dependent 的情况下。
methods: 本研究使用了 Spain LIP-RTVE 数据库，并提出了不同的适应策略，包括 fine-tuning 技术。基于 CTC/Attention 架构的预训练模型也被用作参考。
results: 研究发现，通过两步 fine-tuning 过程，首先将 VSR 系统适应到任务域，可以获得显著改善。即使只有有限的数据available，也可以达到与当前状态艺术水平相当的结果。

Abstract
Different studies have shown the importance of visual cues throughout the speech perception process. In fact, the development of audiovisual approaches has led to advances in the field of speech technologies. However, although noticeable results have recently been achieved, visual speech recognition remains an open research problem. It is a task in which, by dispensing with the auditory sense, challenges such as visual ambiguities and the complexity of modeling silence must be faced. Nonetheless, some of these challenges can be alleviated when the problem is approached from a speaker-dependent perspective. Thus, this paper studies, using the Spanish LIP-RTVE database, how the estimation of specialized end-to-end systems for a specific person could affect the quality of speech recognition. First, different adaptation strategies based on the fine-tuning technique were proposed. Then, a pre-trained CTC/Attention architecture was used as a baseline throughout our experiments. Our findings showed that a two-step fine-tuning process, where the VSR system is first adapted to the task domain, provided significant improvements when the speaker adaptation was addressed. Furthermore, results comparable to the current state of the art were reached even when only a limited amount of data was available.

摘要
不同的研究已经证明视觉cue在语音识别过程中的重要性。实际上，audiovisualapproach的发展在语音技术领域取得了进步。然而，虽然最近获得了显著的成果，视觉语音识别仍然是一个打开的研究问题。这是一个无听觉感的任务，面临着视觉歧义和模型 silence 的复杂性。然而，通过 speaker-dependent 的视角，一些这些挑战可以得到改善。因此，本文使用西班牙LIP-RTVE数据库，研究了特定人士的特циалиzed 终端系统的估计对语音识别质量的影响。首先，不同的适应策略基于细致调整技术被提议。然后，用作基线的预训练 CTC/Attention 架构被使用。我们的发现表明，在任务领域中首先对 VSR 系统进行了二步细致调整，可以提供显著改善。此外，只使用有限的数据也可以达到与当前状态的艺术水平。

CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

paper_url: http://arxiv.org/abs/2311.12474
repo_url: https://github.com/wojciechkusa/systematic-review-datasets
paper_authors: Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury
for: This paper aims to address the lack of standardized evaluation datasets for automated literature screening systems in systematic literature reviews (SLRs).
methods: The authors analyze citation screening evaluation datasets and introduce a new meta-dataset called CSMeD, which consolidates nine publicly released collections of SLRs from medicine and computer science.
results: The authors introduce a new dataset called CSMeD-FT for evaluating full text publication screening tasks and conduct experiments to demonstrate the utility of CSMeD.

Abstract
Systematic literature reviews (SLRs) play an essential role in summarising, synthesising and validating scientific evidence. In recent years, there has been a growing interest in using machine learning techniques to automate the identification of relevant studies for SLRs. However, the lack of standardised evaluation datasets makes comparing the performance of such automated literature screening systems difficult. In this paper, we analyse the citation screening evaluation datasets, revealing that many of the available datasets are either too small, suffer from data leakage or have limited applicability to systems treating automated literature screening as a classification task, as opposed to, for example, a retrieval or question-answering task. To address these challenges, we introduce CSMeD, a meta-dataset consolidating nine publicly released collections, providing unified access to 325 SLRs from the fields of medicine and computer science. CSMeD serves as a comprehensive resource for training and evaluating the performance of automated citation screening models. Additionally, we introduce CSMeD-FT, a new dataset designed explicitly for evaluating the full text publication screening task. To demonstrate the utility of CSMeD, we conduct experiments and establish baselines on new datasets.

摘要

Analysis of Visual Features for Continuous Lipreading in Spanish

paper_url: http://arxiv.org/abs/2311.12468
repo_url: None
paper_authors: David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
for: 本研究旨在提高自动视觉语音识别系统的性能，通过分析不同的视觉语音特征来选择最佳的视觉特征来捕捉舌面运动的本质。
methods: 本研究使用了传统的隐马尔可夫模型和 Gaussian Mixture Models，并使用了 eigenlips 和深度特征的组合来解决自动视觉语音识别任务。
results: 研究结果表明，尽管任务具有挑战性，但在限定条件下，使用 eigenlips 和深度特征的组合可以达到较高的识别精度。

Abstract
During a conversation, our brain is responsible for combining information obtained from multiple senses in order to improve our ability to understand the message we are perceiving. Different studies have shown the importance of presenting visual information in these situations. Nevertheless, lipreading is a complex task whose objective is to interpret speech when audio is not available. By dispensing with a sense as crucial as hearing, it will be necessary to be aware of the challenge that this lack presents. In this paper, we propose an analysis of different speech visual features with the intention of identifying which of them is the best approach to capture the nature of lip movements for natural Spanish and, in this way, dealing with the automatic visual speech recognition task. In order to estimate our system, we present an audiovisual corpus compiled from a subset of the RTVE database, which has been used in the Albayz\'in evaluations. We employ a traditional system based on Hidden Markov Models with Gaussian Mixture Models. Results show that, although the task is difficult, in restricted conditions we obtain recognition results which determine that using eigenlips in combination with deep features is the best visual approach.

摘要

LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild

paper_url: http://arxiv.org/abs/2311.12457
repo_url: https://github.com/david-gimeno/lip-rtve
paper_authors: David Gimeno-Gómez, Carlos-D. Martínez-Hinarejos
for: 这篇论文主要是为了提高自动语音识别系统的稳定性和准确性，通过结合音频和视觉cue来表征语音。
methods: 这篇论文使用了Hidden Markov Models（隐马尔可夫模型），这是传统的Speech Technologies领域中广泛使用的一种方法。
results: 这篇论文提出了一个 semi-自动标注的 audiovisual 数据库，提供了13小时的自然西班牙语话语数据，并在speaker-dependent和speaker-independent两种enario下报告了基准结果。

Abstract
Speech is considered as a multi-modal process where hearing and vision are two fundamentals pillars. In fact, several studies have demonstrated that the robustness of Automatic Speech Recognition systems can be improved when audio and visual cues are combined to represent the nature of speech. In addition, Visual Speech Recognition, an open research problem whose purpose is to interpret speech by reading the lips of the speaker, has been a focus of interest in the last decades. Nevertheless, in order to estimate these systems in the currently Deep Learning era, large-scale databases are required. On the other hand, while most of these databases are dedicated to English, other languages lack sufficient resources. Thus, this paper presents a semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish, providing 13 hours of data extracted from Spanish television. Furthermore, baseline results for both speaker-dependent and speaker-independent scenarios are reported using Hidden Markov Models, a traditional paradigm that has been widely used in the field of Speech Technologies.

摘要
《speech是一种多Modal的过程，听见和视觉是两个基本柱子。实际上，许多研究表明，自动听说识别系统的可靠性可以通过将音频和视觉提示结合来提高。此外，视觉听说识别，一个长期受到关注的问题，目前在深度学习时代仍然是一个开放的研究问题。然而，大多数数据库都是专门为英语而设计，其他语言则缺乏资源。因此，本文提供了一个半自动注释的audiovisual数据库，用于处理无结构化的自然西班牙语，提供13小时的数据，来自西班牙电视。此外，基线结果也被报告，包括 speaker-dependent和speaker-independent场景下的基线结果，使用传统的隐马尔可夫模型。

Visual Analytics for Generative Transformer Models

paper_url: http://arxiv.org/abs/2311.12418
repo_url: None
paper_authors: Raymond Li, Ruixin Yang, Wen Xiao, Ahmed AbuRaed, Gabriel Murray, Giuseppe Carenini
for: 这个论文旨在支持分析转换器基于模型的可读性。
methods: 该 Framework 使用可交互的视觉化方式，以便用户可以轻松地探索不同的模型方面。
results: 作者通过三个实际的 NLP 研究问题来证明该 Framework 的可行性和有用性。

Abstract
While transformer-based models have achieved state-of-the-art results in a variety of classification and generation tasks, their black-box nature makes them challenging for interpretability. In this work, we present a novel visual analytical framework to support the analysis of transformer-based generative networks. In contrast to previous work, which has mainly focused on encoder-based models, our framework is one of the first dedicated to supporting the analysis of transformer-based encoder-decoder models and decoder-only models for generative and classification tasks. Hence, we offer an intuitive overview that allows the user to explore different facets of the model through interactive visualization. To demonstrate the feasibility and usefulness of our framework, we present three detailed case studies based on real-world NLP research problems.

摘要
“transformer-based模型在多种分类和生成任务中实现了状态的最佳结果，但它们的黑盒特性使得它们对解释性不便。在这项工作中，我们提出了一种新的视觉分析框架，用于支持transformer-based生成网络的分析。与之前的工作主要集中在encoder-based模型上，我们的框架是对transformer-based encoder-decoder模型和decoder-only模型的生成和分类任务进行了首次专门的支持。因此，我们提供了一个直观的概述，让用户可以通过交互式视觉化来探索不同方面的模型。为证明我们的框架的可行性和实用性，我们在三个具体的案例研究中提供了真实的NL表示问题。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, I can provide that as well.

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

paper_url: http://arxiv.org/abs/2311.12405
repo_url: None
paper_authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Pascale Fung, Ayu Purwarianti
for: 研究INDONESIAN NLP中的混合现象
methods: 使用四种扩展语言（英文、 Sundanese、 Javanese 和 Malay）进行混合语言研究，并提出 IndoRobusta 框架来评估和改进混合语言稳定性
results: 研究发现预训练词汇偏见影响模型对INDONESIAN-ENGLISH混合语言的处理能力，即使INDONESIAN-ENGLISH混合语言在日常对话中更为常见Here’s the English version of the three key points for reference:
for: Exploring code-mixing in Indonesian with four embedded languages
methods: Using four extended languages (English, Sundanese, Javanese, and Malay) to study code-mixing, and introducing the IndoRobusta framework to evaluate and improve code-mixing robustness
results: Findings show that pre-training corpus bias affects the model’s ability to handle Indonesian-English code-mixing, despite higher language diversity in Indonesian-English code-mixing in daily conversation.

Abstract
Significant progress has been made on Indonesian NLP. Nevertheless, exploration of the code-mixing phenomenon in Indonesian is limited, despite many languages being frequently mixed with Indonesian in daily conversation. In this work, we explore code-mixing in Indonesian with four embedded languages, i.e., English, Sundanese, Javanese, and Malay; and introduce IndoRobusta, a framework to evaluate and improve the code-mixing robustness. Our analysis shows that the pre-training corpus bias affects the model's ability to better handle Indonesian-English code-mixing when compared to other local languages, despite having higher language diversity.

摘要
“印尼语言处理（NLP）做出了重要的进步。然而，研究印尼语言中的混合现象（code-mixing）还很有限，尽管在日常对话中许多语言与印尼语言混合使用。在这项工作中，我们研究了印尼语言中四种嵌入语言（英语、 Sundanese、 Javanese 和马来语）的混合现象，并介绍了 IndoRobusta 框架，用于评估和改进混合robustness。我们的分析表明，预训练词库偏见影响了模型对印尼语言-英语混合处理的能力，尽管印尼语言的语言多样性较高。”Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.

InterPrompt: Interpretable Prompting for Interrelated Interpersonal Risk Factors in Reddit Posts

paper_url: http://arxiv.org/abs/2311.12404
repo_url: None
paper_authors: MSVPJ Sathvik, Surjodeep Sarkar, Chandni Saxena, Sunghwan Sohn, Muskan Garg
for: 预测和早期识别心理健康问题的人工智能支持系统。
methods: 使用N-shot学习和GPT-3模型，并提出了一种可解释的提示方法（InterPrompt）来改进语言修改和注意力机制。
results: 研究结果表明，四种基于GPT-3模型的 variant，当使用InterPrompt进行微调，在分类和解释生成方面都有显著提高，而且系统水平的解释性和可信度也得到了提高。

Abstract
Mental health professionals and clinicians have observed the upsurge of mental disorders due to Interpersonal Risk Factors (IRFs). To simulate the human-in-the-loop triaging scenario for early detection of mental health disorders, we recognized textual indications to ascertain these IRFs : Thwarted Belongingness (TBe) and Perceived Burdensomeness (PBu) within personal narratives. In light of this, we use N-shot learning with GPT-3 model on the IRF dataset, and underscored the importance of fine-tuning GPT-3 model to incorporate the context-specific sensitivity and the interconnectedness of textual cues that represent both IRFs. In this paper, we introduce an Interpretable Prompting (InterPrompt)} method to boost the attention mechanism by fine-tuning the GPT-3 model. This allows a more sophisticated level of language modification by adjusting the pre-trained weights. Our model learns to detect usual patterns and underlying connections across both the IRFs, which leads to better system-level explainability and trustworthiness. The results of our research demonstrate that all four variants of GPT-3 model, when fine-tuned with InterPrompt, perform considerably better as compared to the baseline methods, both in terms of classification and explanation generation.

摘要
心理健康专业人士和临床医生观察到了因人际风险因素（IRF）引起的心理疾病的增加。为了模拟人类在循环诊断过程中的干预场景，我们认为文本指示可以识别IRF：受抑阻的归属感（TBe）和感受到的负担感（PBu）在个人故事中。在这基础上，我们使用N-shot学习方法和GPT-3模型在IRF数据集上，并强调了将GPT-3模型Context-specific敏感性和文本提示之间的相互连接纳入模型。在这篇论文中，我们介绍了一种可解释的提示方法（InterPrompt），用于提高注意力机制。这种方法通过微调GPT-3模型的预训练 веса来实现更细致的语言修改。我们的模型能够检测到IRFs中常见的模式和下面连接，从而提高系统级别的解释能力和信任性。研究结果表明，对GPT-3模型进行InterPrompt微调后，四种模型variants都表现出了明显的提高，比基eline方法都高于，包括分类和解释生成。

A Survey of Graph Meets Large Language Model: Progress and Future Directions

paper_url: http://arxiv.org/abs/2311.12399
repo_url: https://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks
paper_authors: Yuhan Li, Zhixun Li, Peisong Wang, Jia Li, Xiangguo Sun, Hong Cheng, Jeffrey Xu Yu
for: 本研究旨在对大语言模型（LLMs）在图像任务中的应用进行全面回顾和分析，并提出一种新的分类方法来组织现有的方法。
methods: 本文对现有的方法进行了系统的介绍和分析，并将其分为三类 based on the role（即增强组件、预测组件和对齐组件）。
results: 本文提出了一些未来研究的可能性，并将相关的论文维护在：https://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks。

Abstract
Graph plays a significant role in representing and analyzing complex relationships in real-world applications such as citation networks, social networks, and biological data. Recently, Large Language Models (LLMs), which have achieved tremendous success in various domains, have also been leveraged in graph-related tasks to surpass traditional Graph Neural Networks (GNNs) based methods and yield state-of-the-art performance. In this survey, we first present a comprehensive review and analysis of existing methods that integrate LLMs with graphs. First of all, we propose a new taxonomy, which organizes existing methods into three categories based on the role (i.e., enhancer, predictor, and alignment component) played by LLMs in graph-related tasks. Then we systematically survey the representative methods along the three categories of the taxonomy. Finally, we discuss the remaining limitations of existing studies and highlight promising avenues for future research. The relevant papers are summarized and will be consistently updated at: https://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks.

摘要
GRAPH 在各种实际应用中，如引用网络、社交网络和生物数据中，扮演了重要的角色，用于表示和分析复杂关系。最近，大型自然语言模型（LLMs），在不同领域中取得了很大成功，也在图像相关任务中被利用，以超越传统的图 neural network（GNNs）基于方法，实现状态的最佳性能。在本文中，我们首先提出了一个新的分类法，将现有方法分为三类，根据 LLMS 在图像相关任务中的角色（即优化器、预测器和对齐组件）。然后，我们系统地检查了代表性方法，并将其分类在三类中。最后，我们讨论了现有研究的限制，并指出了未来研究的潜在方向。相关论文将在: https://github.com/yhLeeee/Awesome-LLMs-in-Graph-tasks 中进行系统化更新。

Problems of Non-equivalent Words in Technical Translation

paper_url: http://arxiv.org/abs/2311.12395
repo_url: None
paper_authors: Mohammad Ibrahim Qani
For: This research paper focuses on the issue of non-equivalent words in translation, specifically from English to Russian. The authors aim to provide solutions and rules for rendering these words accurately in the target language.* Methods: The paper uses a combination of linguistic analysis and examples to illustrate the challenges of translating non-equivalent words. The authors also provide suggestions for how to overcome these challenges and find appropriate equivalents in the target language.* Results: The paper highlights the importance of understanding the cultural and historical context of non-equivalent words in order to accurately translate them. The authors also provide a list of common non-equivalent words and their equivalents in Russian, which can be useful for translators and linguists working in this field.

Abstract
Translating words which do not have equivalent in target language is not easy and finding proper equivalent of those words are very important to render correctly and understandably, the article defines some thoughts and ideas of scientists on the common problems of non-equivalent words from English to Russian language and includes English and Russian examples and ideas of certain scientist. The English language is worldwide spoken and there are 1.35 billion English speakers and over 258 million Russian speakers according to the 2021s statistics. Inevitably, these billions of speakers around the world have connection and they may have deal in different criteria. In order to understand one another they need to have a pure and fully-understood language. These pure languages understanding directly relates to translation knowledge where linguists and translators need to work and research to eradicate misunderstanding. Misunderstandings mostly appear in non-equivalent words because there are different local and internal words like food, garment, cultural and traditional words and others in every notion. Truly, most of these words do not have equivalent in the target language and these words need to be worked and find their equivalent in the target language to fully understand the both languages. However, some of these non-equivalent words are already professionally rendered to the target language but still there many other words to be rendered. Hence, this research paper includes different ways and rules of rendering non-equivalent words from source language to the target language.

摘要
英语是全球最广泛使用的语言之一，截至2021年统计，全球有135亿英语母语者和258万俄语母语者。由于这些亿万speakeraround the world有Connection，他们可能会有不同的评价标准。为了理解彼此，他们需要有一种纯净、完全理解的语言。这种语言理解直接与翻译知识相关，语言学家和翻译员需要努力工作和研究，以消除不同理解。英语和俄语之间的不同语言表达主要表现在非等效词上。每种语言都有自己的地方语言和内部词汇，如食品、服装、文化和传统词汇等。大多数这些词汇在目标语言中没有等效词，需要进行工作和找到其等效词。然而，一些这些非等效词已经在目标语言中得到了专业的翻译，但还有很多其他的词汇需要翻译。因此，这篇研究论文探讨了不同的方法和规则 для将非等效词从源语言翻译到目标语言。

The Obscure Limitation of Modular Multilingual Language Models

paper_url: http://arxiv.org/abs/2311.12375
repo_url: None
paper_authors: Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Ayu Purwarianti
for: 这个论文旨在探讨模块化多语言语言模型（MLM）在多语言推理场景中的限制，以及如何通过Language Identification（LID）模块来改善模块化MLM的多语言性能。
methods: 本文使用了现有的模块化MLM模型，并在其中添加了Language Identification（LID）模块，以评估模块化MLM在多语言推理场景中的性能。
results: 研究发现，在加入LID模块后，模块化MLM的多语言性能有所提高，但是由于LID和模块化MLM之间的管道式approach而导致的性能差距仍然存在。

Abstract
We expose the limitation of modular multilingual language models (MLMs) in multilingual inference scenarios with unknown languages. Existing evaluations of modular MLMs exclude the involvement of language identification (LID) modules, which obscures the performance of real-case multilingual scenarios of modular MLMs. In this work, we showcase the effect of adding LID on the multilingual evaluation of modular MLMs and provide discussions for closing the performance gap of caused by the pipelined approach of LID and modular MLMs.

摘要
我们暴露了模块化多语言语言模型（MLM）在多语言推理场景中的局限性。现有的模块化 MLM 评估 exclude 语言标识（LID）模块的参与，这会隐藏真实场景中模块化 MLM 的性能。在这个工作中，我们表明了将 LID 添加到多语言评估中的效果，并提供了关闭性能差距的讨论。

Beyond Turing: A Comparative Analysis of Approaches for Detecting Machine-Generated Text

paper_url: http://arxiv.org/abs/2311.12373
repo_url: None
paper_authors: Muhammad Farid Adilazuarda, Nikolaos Nektarios Arkoulis, Oleksii Chumakov
for: 本研究旨在评估三种方法用于分辨人类和机器生成文本：传统的板块学习、语言模型（LM）精度调整和多语言模型调整。
methods: 本研究使用了三种方法：传统的板块学习、LM精度调整和多语言模型调整。
results: 研究结果表明这三种方法在分辨人类和机器生成文本方面存在显著的差异，这表明这个领域仍然需要进一步的发展。

Abstract
Significant progress has been made on text generation by pre-trained language models (PLMs), yet distinguishing between human and machine-generated text poses an escalating challenge. This paper offers an in-depth evaluation of three distinct methods used to address this task: traditional shallow learning, Language Model (LM) fine-tuning, and Multilingual Model fine-tuning. These approaches are rigorously tested on a wide range of machine-generated texts, providing a benchmark of their competence in distinguishing between human-authored and machine-authored linguistic constructs. The results reveal considerable differences in performance across methods, thus emphasizing the continued need for advancement in this crucial area of NLP. This study offers valuable insights and paves the way for future research aimed at creating robust and highly discriminative models.

摘要
<>转换文本为简化中文。<>研究人员已取得了文本生成领域的重要进展，但分辨人类和机器生成的文本却成为一项困难的挑战。这篇论文提供了三种方法的深入评估：传统的浅学习、语言模型（LM）练化和多语言模型练化。这些方法在各种机器生成文本上进行了严格的测试，为分辨人类和机器生成语言结构的能力提供了标准。结果表明了不同方法之间的显著差异，从而强调了这一领域的持续发展需求。这篇研究提供了有价值的意见和未来研究的指导，推动了人工智能语言处理领域的进步。

Utilizing Language Models for Tour Itinerary Recommendation

paper_url: http://arxiv.org/abs/2311.12355
repo_url: None
paper_authors: Ngai Lam Ho, Kwan Hui Lim
For: The paper is written for researchers and practitioners in the fields of Operations Research and Recommendation Systems, specifically those interested in tour itinerary recommendation and planning.* Methods: The paper explores the use of language models, specifically Word2Vec and GloVe for learning POI embeddings, and transformer-based techniques like BERT for generating itineraries.* Results: The paper discusses the effectiveness of these approaches in recommending personalized POIs relevant to users and planning them as an itinerary that satisfies various constraints.Here is the same information in Simplified Chinese text:* For: 这篇论文是为了研究和实践操作研究和推荐系统领域的人们，具体来说是关于旅游路线规划和建议。* Methods: 论文使用语言模型，例如Word2Vec和GloVe来学习POI嵌入，以及基于转换器的技术如BERT来生成路线。* Results: 论文讨论了这些方法在建议个性化POI和遵循各种约束的情况下的效果。

Abstract
Tour itinerary recommendation involves planning a sequence of relevant Point-of-Interest (POIs), which combines challenges from the fields of both Operations Research (OR) and Recommendation Systems (RS). As an OR problem, there is the need to maximize a certain utility (e.g., popularity of POIs in the tour) while adhering to some constraints (e.g., maximum time for the tour). As a RS problem, it is heavily related to problem or filtering or ranking a subset of POIs that are relevant to a user and recommending it as part of an itinerary. In this paper, we explore the use of language models for the task of tour itinerary recommendation and planning. This task has the unique requirement of recommending personalized POIs relevant to users and planning these POIs as an itinerary that satisfies various constraints. We discuss some approaches in this area, such as using word embedding techniques like Word2Vec and GloVe for learning POI embeddings and transformer-based techniques like BERT for generating itineraries.

摘要
Translation in Simplified Chinese:旅游观光路线规划涉及到融合操作研究（OR）和推荐系统（RS）两个领域的挑战。作为OR问题，需要最大化一定的用户满意度（例如点击点击的人数），同时遵循一些硬性或软性的限制（例如旅游时间的最长）。作为RS问题，它和推荐或筛选点击点击的问题有相似的特点，需要推荐对用户相关的点击点击，并将其组合成一个路线。在这篇论文中，我们探讨使用语言模型来进行旅游观光路线规划和规划。这个任务有唯一的需求，即为用户推荐个性化适合的点击点击，并将其组合成一个满足多个限制的路线。我们讨论了一些在这个领域的方法，例如使用Word2Vec和GloVelearn POI嵌入，以及使用BERT生成路线。

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

paper_url: http://arxiv.org/abs/2311.12351
repo_url: https://github.com/strivin0311/long-llms-learning
paper_authors: Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma
for: 本研究旨在探讨基于Transformer的大型自然语言模型（LLM）在长 Context 下的提升，以优化在真实世界中遇到的长输入和输出问题。
methods: 本研究提出了一个总体的纲要，包括分析当前Transformer-based LLMs 中处理长 Context 输入和输出的问题，以及提出一个总体的 taxonomy 来探讨Transformer 的升级方法。
results: 本研究提供了一些常用的评价方法和工具包，包括数据集、评价指标和基eline 模型，以及一些提高 LLMS 效率和可行性的优化工具。

Abstract
With the bomb ignited by ChatGPT, Transformer-based Large Language Models (LLMs) have paved a revolutionary path toward Artificial General Intelligence (AGI) and have been applied in diverse areas as knowledge bases, human interfaces, and dynamic agents. However, a prevailing limitation exists: many current LLMs, constrained by resources, are primarily pre-trained on shorter texts, rendering them less effective for longer-context prompts, commonly encountered in real-world settings. In this paper, we present a comprehensive survey focusing on the advancement of model architecture in Transformer-based LLMs to optimize long-context capabilities across all stages from pre-training to inference. We firstly delineate and analyze the problems of handling long-context input and output with the current Transformer-based models. Then, we mainly offer a holistic taxonomy to navigate the landscape of Transformer upgrades on architecture to solve these problems. Afterward, we provide the investigation on wildly used evaluation necessities tailored for long-context LLMs, including datasets, metrics, and baseline models, as well as some amazing optimization toolkits like libraries, systems, and compilers to augment LLMs' efficiency and efficacy across different stages. Finally, we further discuss the predominant challenges and potential avenues for future research in this domain. Additionally, we have established a repository where we curate relevant literature with real-time updates at https://github.com/Strivin0311/long-llms-learning.

摘要
随着ChatGPT的爆发，基于Transformer的大型自然语言模型（LLMs）已经开辟出了一条革命性的道路，并在不同领域中应用，如知识库、人机界面和动态代理。然而，一个普遍的限制存在：许多当前的LLMs，受资源的限制，主要预训练在短文本上，导致它们在长文本上的效果较差。在这篇论文中，我们提供了一项全面的报告，探讨了Transformer基于模型的改进，以便在所有阶段从预训练到推理中优化长文本能力。我们首先明确和分析了当前Transformer基于模型处理长文本输入和输出的问题。然后，我们提供了一个总体的分类，以帮助读者在Transformer升级的建筑方面 navigation。接着，我们进行了评估野外使用的评价需求，包括数据集、度量和基线模型，以及一些惊喜的优化工具包，如库、系统和编译器，以提高LLMs的效率和效果在不同阶段。最后，我们进一步讨论了当前领域的主要挑战和未来研究的可能性。此外，我们已经建立了一个存储库，并在实时更新https://github.com/Strivin0311/long-llms-learning。

paper_url: http://arxiv.org/abs/2311.12323
repo_url: None
paper_authors: Sadia Kamal, Brenner Little, Jade Gullic, Trevor Harms, Kristin Olofsson, Arunkumar Bagavathi
for: 本研究旨在Characterizing political polarization on online social media platforms, specifically in the social media posts themselves.
methods: 我们提出了两种启发法，利用新闻媒体偏见和帖子内容来标注社交媒体帖子的政治方向。
results: 我们发现，使用我们提出的启发法可以生成高质量的标注数据，并且现有的机器学习模型可以在使用传统超参数学习和少数参数学习的情况下提高预测帖子政治方向的性能。

Abstract
Developing machine learning models to characterize political polarization on online social media presents significant challenges. These challenges mainly stem from various factors such as the lack of annotated data, presence of noise in social media datasets, and the sheer volume of data. The common research practice typically examines the biased structure of online user communities for a given topic or qualitatively measuring the impacts of polarized topics on social media. However, there is limited work focusing on analyzing polarization at the ground-level, specifically in the social media posts themselves. Such existing analysis heavily relies on annotated data, which often requires laborious human labeling, offers labels only to specific problems, and lacks the ability to determine the near-future bias state of a social media conversations. Understanding the degree of political orientation conveyed in social media posts is crucial for quantifying the bias of online user communities and investigating the spread of polarized content. In this work, we first introduce two heuristic methods that leverage on news media bias and post content to label social media posts. Next, we compare the efficacy and quality of heuristically labeled dataset with a randomly sampled human-annotated dataset. Additionally, we demonstrate that current machine learning models can exhibit improved performance in predicting political orientation of social media posts, employing both traditional supervised learning and few-shot learning setups. We conduct experiments using the proposed heuristic methods and machine learning approaches to predict the political orientation of posts collected from two social media forums with diverse political ideologies: Gab and Twitter.

摘要
发展机器学习模型来 caracterize online社交媒体上的政治偏见存在 significante挑战。这些挑战主要来自于各种因素，如数据集中的噪音、社交媒体数据的大量和缺乏标注数据等。现有研究通常会研究在某个话题上的在线用户群体的偏见结构，或者使用质量量表来衡量推特话题的影响。然而，尚有限的研究将注意力集中在社交媒体文章本身的偏见分析上。现有的分析方法往往依赖于人工标注，这需要劳动密集，同时只能为特定问题提供标签，而且无法确定社交媒体对话的近期偏见状态。理解社交媒体文章中具有政治倾向的程度是量化在线用户群体偏见的关键，以及推特偏见内容的传播。在这项工作中，我们首先介绍了两种归纳方法，利用新闻媒体偏见和文章内容来标注社交媒体文章。然后，我们比较了归纳方法生成的数据集和随机采样的人工标注数据集的效果和质量。此外，我们还证明了现有的机器学习模型可以通过使用传统的超vised学习和几shot学习方式，在预测社交媒体文章的政治倾向方面提高性能。我们在提posed归纳方法和机器学习方法的基础上进行实验，以预测来自Gab和Twitter两个社交媒体平台的不同政治意识型的文章的政治倾向。

AcademicGPT: Empowering Academic Research

paper_url: http://arxiv.org/abs/2311.12315
repo_url: None
paper_authors: Shufa Wei, Xiaolong Xu, Xianbiao Qi, Xi Yin, Jun Xia, Jingyi Ren, Peijun Tang, Yuxiang Zhong, Yihao Chen, Xiaoqin Ren, Yuxin Liang, Liankai Huang, Kai Xie, Weikang Gui, Wei Tan, Shuanglong Sun, Yongquan Hu, Qinxian Liu, Nanjin Li, Chihao Dai, Lihua Wang, Xiaohui Liu, Lei Zhang, Yutao Xie
for: 这份技术报告旨在介绍一种专门为学术研究设计的语言模型（AcademicGPT），以便促进学术研究的进步。
methods: 这种模型是基于LLaMA2-70B的 continual training模型，其训练集主要包括学术论文、硬件文献、一些学术领域的内容以及高质量的中文数据等。
results: 我们对AcademicGPT进行了多个公共benchmark测试，包括MMLU和CEval，以及一些专门的学术benchmark测试，如PubMedQA、SCIEval和我们自己创建的ComputerScienceQA，以显示其在通用知识、中文能力和学术能力方面的能力。此外，我们还基于AcademicGPT的基础模型开发了一些适用于学术领域的应用程序，如通用学术问答、AI助力读笔、论文审稿和AI助力标题和摘要生成等。

Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks. Yet, many of these advanced LLMs are tailored for broad, general-purpose applications. In this technical report, we introduce AcademicGPT, designed specifically to empower academic research. AcademicGPT is a continual training model derived from LLaMA2-70B. Our training corpus mainly consists of academic papers, thesis, content from some academic domain, high-quality Chinese data and others. While it may not be extensive in data scale, AcademicGPT marks our initial venture into a domain-specific GPT tailored for research area. We evaluate AcademicGPT on several established public benchmarks such as MMLU and CEval, as well as on some specialized academic benchmarks like PubMedQA, SCIEval, and our newly-created ComputerScienceQA, to demonstrate its ability from general knowledge ability, to Chinese ability, and to academic ability. Building upon AcademicGPT's foundation model, we also developed several applications catered to the academic area, including General Academic Question Answering, AI-assisted Paper Reading, Paper Review, and AI-assisted Title and Abstract Generation.

摘要
大型自然语言模型（LLM）已经展示出了广泛的应用场景，但许多这些高级LLM都是为普遍应用场景设计的。在这份技术报告中，我们介绍了 AcademicGPT，这是专门为学术研究设计的。AcademicGPT 是基于 LLaMA2-70B 的连续训练模型。我们的训练集主要包括学术论文、论文、学术领域内容、高质量中文数据和其他数据。虽然数据规模不是很广泛，但 AcademicGPT 是我们在域pecific GPT 领域的首次尝试。我们在一些已有的公共评估指标上如 MMLU 和 CEval 进行了评估，以及一些特殊的学术评估指标如 PubMedQA、SCIEval 和我们新创建的 ComputerScienceQA，以示其在通用知识、中文能力和学术能力方面的能力。基于 AcademicGPT 基本模型，我们还开发了一些针对学术领域的应用程序，包括通用学术问答、 AI 助力论文阅读、论文评审和 AI 助力标题和摘要生成。

Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis

paper_url: http://arxiv.org/abs/2311.12275
repo_url: None
paper_authors: Ruiyang Qin, Jun Xia, Zhenge Jia, Meng Jiang, Ahmed Abbasi, Peipei Zhou, Jingtong Hu, Yiyu Shi
for: 本研究提出了一个框架，用于在 edge 设备上实现个人化语言模型（LLM），考虑到稀疏标注和储存限制。
methods: 本研究使用了一个新的自我超vised练习框架，可以在 edge 设备上选择和储存最有代表性的数据，并通过多个 semantically 相似的问题文本和预期回答生成器来增强练习质量。
results: 本研究的实验结果显示，提出的框架可以实现最好的用户特定内容生成能力（精度）和练习速度（性能），并且比vanilla基线框架要好。

Abstract
After a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. However, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. While it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. In addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. It remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. In this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. Such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. To enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. Our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. To the best of our knowledge, this is the very first on-device LLM personalization framework.

摘要
after a large language model (LLM) is deployed on edge devices, it is desirable for these devices to learn from user-generated conversation data to generate user-specific and personalized responses in real-time. however, user-generated data usually contains sensitive and private information, and uploading such data to the cloud for annotation is not preferred if not prohibited. while it is possible to obtain annotation locally by directly asking users to provide preferred responses, such annotations have to be sparse to not affect user experience. in addition, the storage of edge devices is usually too limited to enable large-scale fine-tuning with full user-generated data. it remains an open question how to enable on-device LLM personalization, considering sparse annotation and limited on-device storage. in this paper, we propose a novel framework to select and store the most representative data online in a self-supervised way. such data has a small memory footprint and allows infrequent requests of user annotations for further fine-tuning. to enhance fine-tuning quality, multiple semantically similar pairs of question texts and expected responses are generated using the LLM. our experiments show that the proposed framework achieves the best user-specific content-generating capability (accuracy) and fine-tuning speed (performance) compared with vanilla baselines. to the best of our knowledge, this is the very first on-device LLM personalization framework.

2023-11-21

Attribution and Alignment: Effects of Local Context Repetition on Utterance Production and Comprehension in Dialogue

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark

Systematic word meta-sense extension

Data Diversity Matters for Robust Instruction Tuning

A Baseline Analysis of Reward Models’ Ability To Accurately Analyze Foundation Models Under Distribution Shift

LowResource at BLP-2023 Task 2: Leveraging BanglaBert for Low Resource Sentiment Analysis of Bangla Language

Soft Random Sampling: A Theoretical and Empirical Analysis

Fair Text Classification with Wasserstein Independence

MathGloss: Building mathematical glossaries from text

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages

Speaker-Adapted End-to-End Visual Speech Recognition for Continuous Spanish

CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews

Analysis of Visual Features for Continuous Lipreading in Spanish

LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild

Visual Analytics for Generative Transformer Models

IndoRobusta: Towards Robustness Against Diverse Code-Mixed Indonesian Local Languages

InterPrompt: Interpretable Prompting for Interrelated Interpersonal Risk Factors in Reddit Posts

A Survey of Graph Meets Large Language Model: Progress and Future Directions

Problems of Non-equivalent Words in Technical Translation

The Obscure Limitation of Modular Multilingual Language Models

Beyond Turing: A Comparative Analysis of Approaches for Detecting Machine-Generated Text

Utilizing Language Models for Tour Itinerary Recommendation

Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey

Modeling Political Orientation of Social Media Posts: An Extended Analysis

AcademicGPT: Empowering Academic Research

Enabling On-Device Large Language Model Personalization with Self-Supervised Data Selection and Synthesis