2023-11-08

cs.CL

cs.CL - 2023-11-08

Deep Learning Brasil at ABSAPT 2022: Portuguese Transformer Ensemble Approaches

paper_url: http://arxiv.org/abs/2311.05051
repo_url: https://github.com/ju-resplande/dlb_absapt2022
paper_authors: Juliana Resplande Santanna Gomes, Eduardo Augusto Santos Garcia, Adalberto Ferreira Barbosa Junior, Ruan Chaves Rodrigues, Diogo Fernandes Costa Silva, Dyonnatan Ferreira Maia, Nádia Félix Felipe da Silva, Arlindo Rodrigues Galvão Filho, Anderson da Silva Soares
for: 这篇论文的目的是提出一种基于方面的 sentiment 分析（ABSA）任务，以分类每个方面的 sentiment 偏好。
methods: 这篇论文使用了 two 个子任务：方面 термина抽取（ATE）和 sentiment 方向抽取（SOE），以实现 ABSA 任务。
results: 作者在 IberLEF 2022 上提交了最佳性系统，实现了两个子任务的新状态ppen-of-the-art 结果。

Abstract
Aspect-based Sentiment Analysis (ABSA) is a task whose objective is to classify the individual sentiment polarity of all entities, called aspects, in a sentence. The task is composed of two subtasks: Aspect Term Extraction (ATE), identify all aspect terms in a sentence; and Sentiment Orientation Extraction (SOE), given a sentence and its aspect terms, the task is to determine the sentiment polarity of each aspect term (positive, negative or neutral). This article presents we present our participation in Aspect-Based Sentiment Analysis in Portuguese (ABSAPT) 2022 at IberLEF 2022. We submitted the best performing systems, achieving new state-of-the-art results on both subtasks.

摘要
《方面基于情感分析（ABSA）的任务是将每个方面（即句子中的个体）的情感方向分类。这个任务包括两个子任务：方面词抽取（ATE）和情感方向抽取（SOE）。在这篇文章中，我们介绍了我们在“方面基于情感分析在葡萄牙语（ABSAPT）2022”中的参与，并提交了最佳性能的系统，创造了新的国际标准记录在两个子任务中。》Note that the translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need Traditional Chinese, please let me know.

paper_url: http://arxiv.org/abs/2311.05047
repo_url: None
paper_authors: Eduardo Garcia, Juliana Gomes, Adalberto Barbosa Júnior, Cardeque Borges, Nádia da Silva
for: 这份研究是为了描述DeepLearningBrasil队伍在DepSign-LT-EDI@RANLP-2023共同任务中的策略，以及他们在该任务中获得的47.0%的Macro F1-Score和2.4%的优势。
methods: 这份研究使用了RoBERTa和DeBERTa模型，并将其进一步预训在一个精心选择的Reddit dataset上，从而增强了对精细心理健康语言的理解。 truncation技术和样本重量技术被用来处理长文本数据，并且使用了样本分成和ensemble技术来结合多个实验中的模型。
results: 这份研究获得了47.0%的Macro F1-Score和2.4%的优势，表明了DeepLearningBrasil队伍在该任务中的成功。

Abstract
In this paper, we delineate the strategy employed by our team, DeepLearningBrasil, which secured us the first place in the shared task DepSign-LT-EDI@RANLP-2023, achieving a 47.0% Macro F1-Score and a notable 2.4% advantage. The task was to classify social media texts into three distinct levels of depression - "not depressed," "moderately depressed," and "severely depressed." Leveraging the power of the RoBERTa and DeBERTa models, we further pre-trained them on a collected Reddit dataset, specifically curated from mental health-related Reddit's communities (Subreddits), leading to an enhanced understanding of nuanced mental health discourse. To address lengthy textual data, we used truncation techniques that retained the essence of the content by focusing on its beginnings and endings. Our model was robust against unbalanced data by incorporating sample weights into the loss. Cross-validation and ensemble techniques were then employed to combine our k-fold trained models, delivering an optimal solution. The accompanying code is made available for transparency and further development.

摘要
在这篇论文中，我们描述了我们团队（DeepLearningBrasil）在分享任务DepSign-LT-EDI@RANLP-2023中采用的策略，这使得我们在Macro F1-Score上取得了47.0%的分数和一个显著的2.4%优势。任务是将社交媒体文本分为三种不同程度的抑郁症 - "不抑郁", "中度抑郁"和"严重抑郁"。我们利用RoBERTa和DeBERTa模型的力量，并对这些模型进行了进一步预训练，特意是在精心收集的Reddit数据集上（来自医疗社区（Subreddits）），从而提高了我们对精细的心理健康语言的理解。为了处理长文本数据，我们使用了误差截断技术，保留文本内容的核心部分。我们的模型对不均衡数据进行了Robust化，并使用交叉验证和ensemble技术来组合我们的k-fold训练模型，实现了优化的解决方案。 accompaning代码已经公开，以便透明度和进一步开发。

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

paper_url: http://arxiv.org/abs/2311.05020
repo_url: None
paper_authors: Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez
For: The paper aims to provide guidance for NLP researchers in the wake of the success of ChatGPT and other large language models (LLMs), and to identify areas where NLP researchers can continue to make meaningful contributions.* Methods: The paper takes a historical lens and looks back at the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation, to identify durable lessons and evergreen problems in NLP research.* Results: The paper argues that disparities in scale are transient, that data is still a bottleneck for many meaningful applications, that meaningful evaluation informed by actual use is still an open problem, and that there is still room for speculative approaches in NLP research.Here’s the same information in Simplified Chinese:* For: 这篇论文目的是为NLP研究人员提供指导，并identify在LLMs成功后的NLP研究领域中可以继续做出意义性贡献的领域。* Methods: 论文通过历史镜像来看待2005年开始的大语言模型（LLMs）的第一个时期，以 IdentifyNLP研究中可以 preserved的教训和永恒的问题。* Results: 论文 argue that scale disparities are temporary, data is still a bottleneck for many meaningful applications, meaningful evaluation informed by actual use is still an open problem, and there is still room for speculative approaches in NLP research.

Abstract
Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation. We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. Among these lessons, we discuss the primacy of hardware advancement in shaping the availability and importance of scale, as well as the urgent challenge of quality evaluation, both automated and human. We argue that disparities in scale are transient and that researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many meaningful applications; that meaningful evaluation informed by actual use is still an open problem; and that there is still room for speculative approaches.

摘要
很多NLP研究人员正经历一场存在危机，这被激进的成功所触发，包括ChatGPT等基于大语言模型（LLM）的系统。在这种突然改变我们理解的领域后，我们可以从历史的视角来寻找指导。我们回顾到2005年开始的大$n$-gram模型 для机器翻译的第一个时期。我们从这个时期中提取了持久的教训，更重要的是，我们标识了在LLM上升起的领域中，NLP研究人员可以继续做出有意义的贡献。我们认为，规模的可用性和重要性受到硬件的提高影响，而不是数据的限制。此外，我们还认为，评估质量是一个急需解决的问题，包括自动化和人类的评估。我们 argue that scale disparities are transient and that researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many meaningful applications; that meaningful evaluation informed by actual use is still an open problem; and that there is still room for speculative approaches.

On the steerability of large language models toward data-driven personas

paper_url: http://arxiv.org/abs/2311.04978
repo_url: None
paper_authors: Junyi Li, Ninareh Mehrabi, Charith Peris, Palash Goyal, Kai-Wei Chang, Aram Galstyan, Richard Zemel, Rahul Gupta
for: 这paper的目的是为了使语言模型更好地适应不同的用户群体和个人，以提高模型的应用性。
methods: 这paper使用了一种基于协同维度的人物定义方法，通过将用户嵌入一个维度空间中，并将其分为基于问题的共谱的群体，以更好地理解不同的社会群体。此外，paper还提出了一种高效的人物引导模型，可以将用户的continue representation映射成虚拟的TOKEN序列，以便使语言模型生成与用户相关的响应。
results: compared to一系列的基准值，paper的引导模型表现出色，能够更好地适应不同的用户群体。

Abstract
The recent surge in Large Language Model (LLM) related applications has led to a concurrent escalation in expectations for LLMs to accommodate a myriad of personas and encompass a broad spectrum of perspectives. An important first step towards addressing this demand is to align language models with specific personas, be it groups of users or individuals. Towards this goal, we first present a new conceptualization of a persona. Moving beyond the traditional reliance on demographics like age, gender, or political party affiliation, we introduce a data-driven persona definition methodology built on collaborative-filtering. In this methodology, users are embedded into a continuous vector space based on their opinions and clustered into cohorts that manifest coherent views across specific inquiries. This methodology allows for a more nuanced understanding of different latent social groups present in the overall population (as opposed to simply using demographic groups) and enhances the applicability of model steerability. Finally, we present an efficient method to steer LLMs towards a particular persona. We learn a soft-prompting model to map the continuous representation of users into sequences of virtual tokens which, when prepended to the LLM input, enables the LLM to produce responses aligned with a given user. Our results show that our steerability algorithm is superior in performance compared to a collection of baselines.

摘要
Moving beyond traditional demographics like age, gender, or political affiliation, our methodology embeds users in a continuous vector space based on their opinions and clusters them into cohorts with coherent views. This approach provides a more nuanced understanding of latent social groups in the population and enhances the applicability of model steerability.To steer LLMs towards a particular persona, we learn a soft-prompting model that maps continuous user representations into sequences of virtual tokens. When prepended to the LLM input, these tokens enable the model to produce responses aligned with the target user. Our results show that our steerability algorithm outperforms a collection of baselines.

How Abstract Is Linguistic Generalization in Large Language Models? Experiments with Argument Structure

paper_url: http://arxiv.org/abs/2311.04900
repo_url: https://github.com/clay-lab/structural-alternations
paper_authors: Michael Wilson, Jackson Petty, Robert Frank
for: 这个论文主要研究了大型自然语言处理器（LLM）是否能够表征语言知识中的关系，尤其是对话结构中的关系。
methods: 研究使用了预训练的Transformer型大型自然语言处理器（LLM），并测试其能够在不同的语言上扩展 Novel noun argument 的分布。
results: 研究发现，LLM 在已经在预训练中看到的相关上下文中的扩展性很强，但是在没有seen during pre-training的相关上下文中，LLM 呈现出 linear order 的偏好，这表明当前模型具有限制。I hope this helps! Let me know if you have any other questions.

Abstract
Language models are typically evaluated on their success at predicting the distribution of specific words in specific contexts. Yet linguistic knowledge also encodes relationships between contexts, allowing inferences between word distributions. We investigate the degree to which pre-trained Transformer-based large language models (LLMs) represent such relationships, focusing on the domain of argument structure. We find that LLMs perform well in generalizing the distribution of a novel noun argument between related contexts that were seen during pre-training (e.g., the active object and passive subject of the verb spray), succeeding by making use of the semantically-organized structure of the embedding space for word embeddings. However, LLMs fail at generalizations between related contexts that have not been observed during pre-training, but which instantiate more abstract, but well-attested structural generalizations (e.g., between the active object and passive subject of an arbitrary verb). Instead, in this case, LLMs show a bias to generalize based on linear order. This finding points to a limitation with current models and points to a reason for which their training is data-intensive.s reported here are available at https://github.com/clay-lab/structural-alternations.

摘要
语言模型通常通过预训练的特定词语分布来评估其表现。然而，语言知识还包含词语分布之间的关系，允许推理这些分布之间的关系。我们调查了大型转换器基于语言模型（LLM）在推理结构之间的表现，专注于语法结构领域。我们发现，LLM在已经在预训练中看到的相关上下文中的新名动词分布总是能够通过使用词语嵌入空间的semantic结构来成功推理。然而，在预训练中未经见过的相关上下文中，LLM则表现出线性推理的偏好，而不是基于更抽象的结构推理。这一结论指出了当前模型的局限性，并且提出了更多的数据训练的必要性。相关的结果可以在https://github.com/clay-lab/structural-alternations上查看。

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

paper_url: http://arxiv.org/abs/2311.04897
repo_url: https://github.com/KoyenaPal/future-lens
paper_authors: Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau
for: 这个论文目的是检验Transformer模型中每个输入token的隐藏状态 Vector是否含有可预测未来token的信息。
methods: 这篇论文使用了线性预测和 causal intervention 方法来评估Transformer模型中每个隐藏状态 Vector 是否含有可预测未来token的信息。
results: 研究发现，在某些层次上，可以使用单个隐藏状态 Vector 预测后续token的准确率高达48%以上。此外，研究还提出了一种“未来镜”可视化方法，可以使用这些方法创建一个新的Transformer状态视图。

Abstract
We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead. More concretely, in this paper we ask: Given a hidden (internal) representation of a single token at position $t$ in an input, can we reliably anticipate the tokens that will appear at positions $\geq t + 2$? To test this, we measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs. We find that, at some layers, we can approximate a model's output with more than 48% accuracy with respect to its prediction of subsequent tokens through a single hidden state. Finally we present a "Future Lens" visualization that uses these methods to create a new view of transformer states.

摘要
我们推测隐藏状态 вектор对单一输入 tokens 储存信息足够精确预测多个 tokens 之后。更加具体地说，在这篇研究中，我们询问：从一个隐藏（内部）表现单一 tokens 的位置 $t$ 中的隐藏状态，可以预测未来 tokens 的位置 $\geq t + 2$ 的精确性？为了试验这个问题，我们使用了线性推测和 causal intervention 方法在 GPT-J-6B 中评估隐藏状态是否含有可预测未来隐藏状态和 tokens 的信息。我们发现，在某些层次上，可以透过单一隐藏状态估计模型的输出精度高于 48%，从而证明隐藏状态内含有可预测未来隐藏状态和 tokens 的信息。最后，我们提出了一个“未来镜”可视化方法，用于创建一个新的类型Transformer 状态的视觉化。

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

paper_url: http://arxiv.org/abs/2311.04892
repo_url: https://github.com/allenai/persona-bias
paper_authors: Shashank Gupta, Vaishnavi Shrivastava, Ameet Deshpande, Ashwin Kalyan, Peter Clark, Ashish Sabharwal, Tushar Khot
for: This paper aims to study the unintended side effects of assigning personas to large-scale language models (LLMs) and how it affects their ability to perform basic reasoning tasks.
methods: The paper uses ChatGPT, a popular LLM, and experiments with 24 reasoning datasets and 16 diverse personas that span five socio-demographic groups: race, gender, religion, disability, and political affiliation.
results: The study finds that ChatGPT exhibits deep-rooted biases against various socio-demographic groups, resulting in a substantial drop in performance on reasoning tasks. The biases are ubiquitous, significant, and can be especially harmful for certain groups. Further analysis shows that these persona-induced errors can be hard to discern and avoid.

Abstract
Recent works have showcased the ability of large-scale language models (LLMs) to embody diverse personas in their responses, exemplified by prompts like 'You are Yoda. Explain the Theory of Relativity.' While this ability allows personalization of LLMs and enables human behavior simulation, its effect on LLMs' capabilities remain unclear. To fill this gap, we present the first extensive study of the unintended side-effects of persona assignment on the ability of LLMs, specifically ChatGPT, to perform basic reasoning tasks. Our study covers 24 reasoning datasets and 16 diverse personas spanning 5 socio-demographic groups: race, gender, religion, disability, and political affiliation. Our experiments unveil that ChatGPT carries deep rooted bias against various socio-demographics underneath a veneer of fairness. While it overtly rejects stereotypes when explicitly asked ('Are Black people less skilled at mathematics?'), it manifests stereotypical and often erroneous presumptions when prompted to answer questions while taking on a persona. These can be observed as abstentions in the model responses, e.g., 'As a Black person, I am unable to answer this question as it requires math knowledge', and generally result in a substantial drop in performance on reasoning tasks. We find that this inherent deep bias is ubiquitous - 80% of our personas demonstrated bias; it is significant - certain datasets had relative drops in performance of 70%+; and can be especially harmful for certain groups - certain personas had stat. sign. drops on more than 80% of the datasets. Further analysis shows that these persona-induced errors can be hard-to-discern and hard-to-avoid. Our findings serve as a cautionary tale that the practice of assigning personas to LLMs - a trend on the rise - can surface their deep-rooted biases and have unforeseeable and detrimental side-effects.

摘要
近期研究表明大型语言模型（LLM）可以体现多种人格，如在提示中表达“你是叶大卫。解释相对论”的类型。这种能力允许个性化LLM并模拟人类行为，但其影响LLM的能力还不清楚。为了填补这一空白，我们首次进行了大规模的LLM persona分配对基本逻辑能力的影响研究。我们的研究覆盖24个逻辑数据集和16种多样化的人格，涵盖了种族、 gender、宗教、残疾和政治性向。我们的实验发现，ChatGPT潜藏着各种社会demographic的偏见，尽管明确表达反对极限化（如“黑人不会干 mathematics”），但在具体的人格提示下，模型仍然表现出偏见和错误的假设。这些假设可以通过模型的回答中的缺失或违反预期的结果来识别，例如“作为黑人，我无法回答这个问题，因为它需要数学知识”。这些假设通常会导致模型在逻辑任务中表现出明显的下降。我们发现这种深层偏见是普遍的——80%的人格表现了偏见，它是重要的——某些数据集的相对下降率超过70%，并且可能对某些群体造成特别的危害——某些人格在超过80%的数据集上表现了 statistically significant drop。进一步的分析表明，这种人格塑造引起的错误可以很难以识别和避免。我们的发现作为警告，将 persona 分配给 LLM 是一种在升级的趋势，可能会暴露其深层偏见并导致不可预期的和有害的后果。

Profiling Irony & Stereotype: Exploring Sentiment, Topic, and Lexical Features

paper_url: http://arxiv.org/abs/2311.04885
repo_url: None
paper_authors: Tibor L. R. Krols, Marie Mortensen, Ninell Oldenburg
for: 这个研究旨在创建一个推断Twitter用户的讲话方式 Irony detection系统。
methods: 该研究使用了TF-IDF和主题模型，并从lexical feature, sentiment feature和对比方面进行了仔细的特征选择。
results: 模型达到了0.84的F1分数，比基准值高。 lexical features,特别是TF-IDF，对模型的性能做出了最大贡献，而sentiment和主题模型的特征则较少对模型的性能做出贡献。

Abstract
Social media has become a very popular source of information. With this popularity comes an interest in systems that can classify the information produced. This study tries to create such a system detecting irony in Twitter users. Recent work emphasize the importance of lexical features, sentiment features and the contrast herein along with TF-IDF and topic models. Based on a thorough feature selection process, the resulting model contains specific sub-features from these areas. Our model reaches an F1-score of 0.84, which is above the baseline. We find that lexical features, especially TF-IDF, contribute the most to our models while sentiment and topic modeling features contribute less to overall performance. Lastly, we highlight multiple interesting and important paths for further exploration.

摘要
社交媒体已成为许多人的信息来源。这种受欢迎性导致了对信息生成系统的兴趣。这项研究尝试创建一个检测推特用户的讲话风格的系统。最近的研究强调 lexical 特征、情感特征和对比特征的重要性，同时还包括 TF-IDF 和话题模型。经过仔细的特征选择过程，我们的模型包含特定的子特征。我们的模型达到了 0.84 的 F1 分数，超过了基准值。我们发现，TF-IDF 特征是模型中最重要的特征，而情感和话题模型特征对总性表现较少。最后，我们提出了多个有趣和重要的可能性的探索。Note: Please keep in mind that the translation is done by a machine and may not be perfect. If you need a more accurate translation, you may want to consider hiring a professional translator.

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

paper_url: http://arxiv.org/abs/2311.04823
repo_url: https://github.com/opennlplab/hgrn
paper_authors: Zhen Qin, Songlin Yang, Yiran Zhong
for: 这 paper 的目的是提出一种名为 Hierarchically Gated Recurrent Neural Network (HGRN) 的Linear RNN模型，以提高模型的效率和可靠性。
methods: 这 paper 使用了一种新的输出窗口 gates mechanism，并在每个层中添加了忘记门，以便模型更好地处理长期依赖关系。
results: 实验表明，HGRN 模型在语言模型、图像分类和长距离射频环境 benchmark 中表现出色，比 traditional RNN 模型更高效和可靠。

Abstract
Transformers have surpassed RNNs in popularity due to their superior abilities in parallel training and long-term dependency modeling. Recently, there has been a renewed interest in using linear RNNs for efficient sequence modeling. These linear RNNs often employ gating mechanisms in the output of the linear recurrence layer while ignoring the significance of using forget gates within the recurrence. In this paper, we propose a gated linear RNN model dubbed Hierarchically Gated Recurrent Neural Network (HGRN), which includes forget gates that are lower bounded by a learnable value. The lower bound increases monotonically when moving up layers. This allows the upper layers to model long-term dependencies and the lower layers to model more local, short-term dependencies. Experiments on language modeling, image classification, and long-range arena benchmarks showcase the efficiency and effectiveness of our proposed model. The source code is available at https://github.com/OpenNLPLab/HGRN.

摘要
带有精度的转换器已经超越了RNNs的流行程度，这主要归功于其在并行训练和长期依赖关系模型方面的优势。然而，在最近，使用线性RNNs进行高效序列模型的兴趣在再次升温。这些线性RNNs通常使用输出 Linear Recurrence 层的门控机制，而忽略了使用忘记门的重要性。在这篇论文中，我们提出了一种名为层次阈值 gates Recurrent Neural Network（HGRN）的模型，该模型包括一个可学习的下界值。这个下界值在层次上增长 monotonic 地。这使得上层模型可以模型长期依赖关系，而下层模型可以模型更本地、短期依赖关系。我们在语言模型、图像分类和长距离场景中进行了实验，并证明了我们提出的模型的效果和效率。代码可以在上获取。

Determination of toxic comments and unintended model bias minimization using Deep learning approach

paper_url: http://arxiv.org/abs/2311.04789
repo_url: None
paper_authors: Md Azim Khan
for: 这个研究的目的是检测恶意评论并减少基于身份特征（如性别、种族、宗教）的无意义偏见。
methods: 该研究使用了一种名为BERT（双向encoder表示Transformers）的注意力模型，并应用了权重损失来解决不均衡数据的问题。
results: 相比Logistic Regression模型（使用TFIDF vectorizer）的57.1%准确率， fine-tuned BERT模型的准确率达89%。I hope this helps! Let me know if you have any other questions.

Abstract
Online conversations can be toxic and subjected to threats, abuse, or harassment. To identify toxic text comments, several deep learning and machine learning models have been proposed throughout the years. However, recent studies demonstrate that because of the imbalances in the training data, some models are more likely to show unintended biases including gender bias and identity bias. In this research, our aim is to detect toxic comment and reduce the unintended bias concerning identity features such as race, gender, sex, religion by fine-tuning an attention based model called BERT(Bidirectional Encoder Representation from Transformers). We apply weighted loss to address the issue of unbalanced data and compare the performance of a fine-tuned BERT model with a traditional Logistic Regression model in terms of classification and bias minimization. The Logistic Regression model with the TFIDF vectorizer achieve 57.1% accuracy, and fine-tuned BERT model's accuracy is 89%. Code is available at https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git

摘要
在线对话可能会出现攻击性、违纪行为或骚扰行为。为了识别恶意评论，多种深度学习和机器学习模型已经被提出过多年。然而，最近的研究表明，由于训练数据的不均衡，一些模型可能会表现出不 INTENDED 的偏见，包括性别偏见和人类偏见。在这项研究中，我们的目标是检测恶意评论并减少基于人类特征（如种族、性别、信仰）的偏见。我们使用权重损失来解决不均衡数据的问题，并与传统的Logistic Regression模型进行比较，以确定最佳的分类和偏见减少策略。Logistic Regression模型与TF-IDF vectorizer实现了57.1%的准确率，而微调BERT模型的准确率为89%。代码可以在https://github.com/zim10/Determine_Toxic_comment_and_identity_bias.git中找到。

Using large language models to study human memory for meaningful narratives

paper_url: http://arxiv.org/abs/2311.04742
repo_url: https://github.com/mkatkov/llm-narrative-analysis
paper_authors: Antonios Georgiou Tankut Can, Mikhail Katkov, Misha Tsodyks
for: 这 paper 的目的是研究人类记忆的含义ful material。
methods: 这 paper 使用了语言模型作为科学工具，设计了大规模的记忆实验，并分析了获得的结果。
results: 研究发现，记忆性的表现与narritive length成直线关系，而且，在使用混乱版本的故事时，记忆性减退了许多，但是认知仍然 largely unaffected。这表明，记忆中的故事有一定的顺序排序，并且可能会通过 contextual reconstruction 来重建故事。

Abstract
One of the most impressive achievements of the AI revolution is the development of large language models that can generate meaningful text and respond to instructions in plain English with no additional training necessary. Here we show that language models can be used as a scientific instrument for studying human memory for meaningful material. We developed a pipeline for designing large scale memory experiments and analyzing the obtained results. We performed online memory experiments with a large number of participants and collected recognition and recall data for narratives of different lengths. We found that both recall and recognition performance scale linearly with narrative length. Furthermore, in order to investigate the role of narrative comprehension in memory, we repeated these experiments using scrambled versions of the presented stories. We found that even though recall performance declined significantly, recognition remained largely unaffected. Interestingly, recalls in this condition seem to follow the original narrative order rather than the scrambled presentation, pointing to a contextual reconstruction of the story in memory.

摘要
人工智能革命中一个最吸引人的成就是大语言模型的发展，可以生成有意义的文本并在普通英语中回答指令无需额外训练。在这里，我们示示了语言模型可以用作人memory的科学实验工具。我们开发了大规模记忆实验的管道并分析了获得的结果。我们在线上进行了大量参与者的记忆实验，收集了不同长度的narative的认知和回忆数据。我们发现， narative的长度与记忆和认知性能 Linearly correlated。此外，为了调查narative理解对记忆的作用，我们重复了这些实验，使用了扭曲版本的展示的故事。我们发现，尽管回忆性能明显下降，但recognition仍然几乎不受影响。此外，回忆中的顺序与原始故事顺序相似，表明内存中的故事是以Contextual重建的。

Evaluating Generative Ad Hoc Information Retrieval

paper_url: http://arxiv.org/abs/2311.04694
repo_url: None
paper_authors: Lukas Gienapp, Harrisen Scells, Niklas Deckers, Janek Bevendorff, Shuai Wang, Johannes Kiesel, Shahbaz Syed, Maik Fröbe, Guide Zucoon, Benno Stein, Matthias Hagen, Martin Potthast
for: This paper aims to provide a foundation and new insights for the evaluation of generative ad hoc retrieval systems.
methods: The paper surveys the relevant information retrieval and natural language processing literature, identifies search tasks and system architectures in generative retrieval, and develops a corresponding user model.
results: The paper provides a theoretical analysis of generative ad hoc retrieval systems and studies its operationalization.Here’s the Chinese translation of the three points:
for: 这篇论文目标是为Generative ad hoc retrieval系统评估提供基础和新想法。
methods: 论文对信息检索和自然语言处理领域的相关文献进行了检索，并在Generative retrieval系统中标识了搜寻任务和系统体系。
results: 论文提供了Generative ad hoc retrieval系统的理论分析，并对其操作化进行了研究。

Abstract
Recent advances in large language models have enabled the development of viable generative information retrieval systems. A generative retrieval system returns a grounded generated text in response to an information need instead of the traditional document ranking. Quantifying the utility of these types of responses is essential for evaluating generative retrieval systems. As the established evaluation methodology for ranking-based ad hoc retrieval may seem unsuitable for generative retrieval, new approaches for reliable, repeatable, and reproducible experimentation are required. In this paper, we survey the relevant information retrieval and natural language processing literature, identify search tasks and system architectures in generative retrieval, develop a corresponding user model, and study its operationalization. This theoretical analysis provides a foundation and new insights for the evaluation of generative ad hoc retrieval systems.

摘要
Translated into Simplified Chinese:最近的大语言模型技术发展已经使得生成式信息检索系统成为可能的。这些系统会根据信息需求返回一个固定的生成文本，而不是传统的文档排名。评估这类响应的用处是评估生成检索系统的重要组成部分。然而，已有的排名型随机检索评价方法可能不适用于生成检索，因此需要新的方法来确保可靠、重复和可重现的实验。本文通过审查相关的信息检索和自然语言处理文献，确定搜索任务和系统体系，开发用户模型，并研究其实现。这种理论分析提供了基础和新的视角，以便评估生成随机检索系统。

Speech language models lack important brain-relevant semantics

paper_url: http://arxiv.org/abs/2311.04664
repo_url: None
paper_authors: Subba Reddy Oota, Emin Çelik, Fatma Deniz, Mariya Toneva
for: 本研究旨在探索语言模型是如何预测大脑中的信息的。
methods: 研究人员使用了一种直接方法，即从语言模型表示中除去特定的低级刺激特征（文本、语音和视觉），然后观察这种干预如何影响脑电响应。
results: 研究发现，文本基于的语言模型可以很好地预测大脑中的语言处理活动，而不需要特定的刺激特征。相比之下，speech基于的语言模型在预测语音识别活动方面表现较差，并且可能需要进一步改进以更好地反映大脑中的语言处理。

Abstract
Despite known differences between reading and listening in the brain, recent work has shown that text-based language models predict both text-evoked and speech-evoked brain activity to an impressive degree. This poses the question of what types of information language models truly predict in the brain. We investigate this question via a direct approach, in which we eliminate information related to specific low-level stimulus features (textual, speech, and visual) in the language model representations, and observe how this intervention affects the alignment with fMRI brain recordings acquired while participants read versus listened to the same naturalistic stories. We further contrast our findings with speech-based language models, which would be expected to predict speech-evoked brain activity better, provided they model language processing in the brain well. Using our direct approach, we find that both text-based and speech-based language models align well with early sensory regions due to shared low-level features. Text-based models continue to align well with later language regions even after removing these features, while, surprisingly, speech-based models lose most of their alignment. These findings suggest that speech-based models can be further improved to better reflect brain-like language processing.

摘要
尽管已知阅读和听取在脑中的差异， latest work 显示，文本基于语言模型可以很准确地预测文本诱发和speech诱发的脑动态。这引发了问题，即语言模型在脑中预测哪些信息。我们通过直接方法来研究这个问题，即在语言模型表示中消除特定的低级别刺激特征（文本、语音和视觉）信息，然后观察这种干预对fMRI脑记录的影响。我们进一步与speech基于语言模型进行比较，这些模型应该更好地预测speech诱发的脑动态，如果它们正确地模型了脑中的语言处理。使用我们的直接方法，我们发现，文本基于语言模型在后期语言区域中继续保持良好的吻合，而speech基于语言模型则在消除特定的低级刺激特征后失去大部分的吻合。这些发现表示，speech基于语言模型可以进一步改进，以更好地反映脑中的语言处理。

Massive Editing for Large Language Models via Meta Learning

paper_url: http://arxiv.org/abs/2311.04661
repo_url: https://github.com/chenmientan/malmen
paper_authors: Chenmien Tan, Ge Zhang, Jie Fu
For: The paper aims to improve the ability of large language models (LLMs) to learn and retain knowledge over time, by proposing a new method called MAssive Language Model Editing Network (MALMEN).* Methods: The proposed method uses a hyper-network to generate parameter shift, and formulates the parameter shift aggregation as a least square problem. The method also separates the computation on the hyper-network and LM, allowing for arbitrary batch size on both networks.* Results: The proposed method is evaluated on several knowledge-intensive NLP tasks, including closed book fact-checking and question answering, and is shown to be capable of editing hundreds of times more facts than strong baselines with the same hyper-network architecture. The method also outperforms editor specifically designed for GPT.

Abstract
While large language models (LLMs) have enabled learning knowledge from the pre-training corpora, the acquired knowledge may be fundamentally incorrect or outdated over time, which necessitates rectifying the knowledge of the language model (LM) after the training. A promising approach involves employing a hyper-network to generate parameter shift, whereas existing hyper-networks suffer from inferior scalability in synchronous editing operation amount. To mitigate the problem, we propose the MAssive Language Model Editing Network (MALMEN), which formulates the parameter shift aggregation as the least square problem, subsequently updating the LM parameters using the normal equation. To accommodate editing multiple facts simultaneously with limited memory budgets, we separate the computation on the hyper-network and LM, enabling arbitrary batch size on both neural networks. Our method is evaluated by editing up to thousands of facts on LMs with different architectures, i.e., BERT-base, GPT-2, T5-XL (2.8B), and GPT-J (6B), across various knowledge-intensive NLP tasks, i.e., closed book fact-checking and question answering. Remarkably, MALMEN is capable of editing hundreds of times more facts than strong baselines with the identical hyper-network architecture and outperforms editor specifically designed for GPT. Our code is available at https://github.com/ChenmienTan/malmen.

摘要
large language models (LLMs) 可以从预训数据中学习知识，但取得的知识可能会随着时间的推移而变得不正确或过时，因此需要在训练后更正 language model (LM) 的知识。一种有前途的方法是使用 hyper-network 生成参数移动，但现有的 hyper-network 受到同步编译作业量的限制，导致Scalability问题。为了解决这个问题，我们提出了 MAssive Language Model Editing Network (MALMEN)，它将参数移动聚合形式化为最小二乘问题，然后使用normal equation更新 LM 参数。为了在有限内存预算下同时编译多个 факти，我们将 computation 在 hyper-network 和 LM 之间分开，允许任意批次大小在两个神经网络上。我们的方法在不同的 LM 架构（BERT-base、GPT-2、T5-XL (2.8B) 和 GPT-J (6B)）和不同的知识密集 NLP 任务（closed book fact-checking 和 question answering）中进行评估，结果显示 MALMEN 可以编译到千个 факти以上，比强基eline 高效，并且超过特别设计 для GPT 的编译器。我们的代码可以在 https://github.com/ChenmienTan/malmen 上找到。

Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case Study on the Abstractness-Concreteness Continuum

paper_url: http://arxiv.org/abs/2311.04563
repo_url: None
paper_authors: Urban Knupleš, Diego Frassinelli, Sabine Schulte im Walde
for: 这个研究旨在检查语言评分的可靠性，具体来说是检查人们对中等程度的单词评分是否存在差异。
methods: 该研究使用了 corrleation 和批处理等方法来检测中等程度单词的特征，并通过团 clustering 方法来发现评分者之间的系统性不一致。
results: 研究发现，对中等程度单词进行细化或过滤可以提高语言评分的可靠性。

Abstract
Humans tend to strongly agree on ratings on a scale for extreme cases (e.g., a CAT is judged as very concrete), but judgements on mid-scale words exhibit more disagreement. Yet, collected rating norms are heavily exploited across disciplines. Our study focuses on concreteness ratings and (i) implements correlations and supervised classification to identify salient multi-modal characteristics of mid-scale words, and (ii) applies a hard clustering to identify patterns of systematic disagreement across raters. Our results suggest to either fine-tune or filter mid-scale target words before utilising them.

摘要
人们通常对极端情况的评分 exhibit 强烈的一致性（例如，一只猫被评为非常具体），但对中等级词的评分存在更多的不一致。然而，收集的评分标准被广泛运用于不同领域。我们的研究将重点在具体性评分中，并（i）利用相关性和指导 классификация来特征化中等级词的多种模态特征，以及（ii）通过坚定分类来揭示评分人员之间的系统性不一致。我们的结果表明，在使用中等级目标词之前应该进行细化或过滤。

Assessing Distractors in Multiple-Choice Tests

paper_url: http://arxiv.org/abs/2311.04554
repo_url: None
paper_authors: Vatsal Raina, Adian Liusie, Mark Gales
for: 这篇论文目的是为了提出自动评估多选题考试中的各种选项质量的方法。
methods: 这篇论文使用了自动评估方法，包括类фика器模型和embedding-based equivalence metric，来评估多选题考试中的选项质量。
results: 论文提出的自动评估方法可以准确地评估多选题考试中的选项质量，并且可以提高考试评估的准确性和公正性。

Abstract
Multiple-choice tests are a common approach for assessing candidates' comprehension skills. Standard multiple-choice reading comprehension exams require candidates to select the correct answer option from a discrete set based on a question in relation to a contextual passage. For appropriate assessment, the distractor answer options must by definition be incorrect but plausible and diverse. However, generating good quality distractors satisfying these criteria is a challenging task for content creators. We propose automated assessment metrics for the quality of distractors in multiple-choice reading comprehension tests. Specifically, we define quality in terms of the incorrectness, plausibility and diversity of the distractor options. We assess incorrectness using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is assessed by considering the distractor confidence - the probability mass associated with the distractor options for a standard multi-class multiple-choice reading comprehension system. Diversity is assessed by pairwise comparison of an embedding-based equivalence metric between the distractors of a question. To further validate the plausibility metric we compare against candidate distributions over multiple-choice questions and agreement with a ChatGPT model's interpretation of distractor plausibility and diversity.

摘要
Incorrectness is assessed using the classification ability of a binary multiple-choice reading comprehension system. Plausibility is evaluated by considering the distractor confidence, or the probability mass associated with the distractor options for a standard multi-class multiple-choice reading comprehension system. Diversity is assessed by comparing the distractors using an embedding-based equivalence metric.To further validate the plausibility metric, we compare it against candidate distributions over multiple-choice questions and agreement with a ChatGPT model's interpretation of distractor plausibility and diversity.

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

paper_url: http://arxiv.org/abs/2311.04547
repo_url: None
paper_authors: Julius Steuer, Marius Mosbach, Dietrich Klakow
for: 本研究旨在探讨语言模型（LM）的认知可能性，并 investigate 如何使LM更加符合人类语言处理的规则和约束。
methods: 研究者使用了多种GPT-like语言模型，不同大小和深度，并对这些模型进行了训练和评估，以确定它们在不同任务中的表现。
results: 研究发现，LM的大小和表现之间存在正相关关系，其中模型宽度和深度在不同任务中有不同的偏好。此外，研究还发现，LM的大小和语言处理时间之间存在负相关关系，表明模型需要采用不同的方法来模型语言处理的负面效果。

Abstract
Research on the cognitive plausibility of language models (LMs) has so far mostly concentrated on modelling psycholinguistic response variables such as reading times, gaze durations and N400/P600 EEG signals, while mostly leaving out the dimension of what Mahowald et al. (2023) described as formal and functional linguistic competence, and developmental plausibility. We address this gap by training a series of GPT-like language models of different sizes on the strict version of the BabyLM pretraining corpus, evaluating on the challenge tasks (BLiMP, GLUE, MSGS) and an additional reading time prediction task. We find a positive correlation between LM size and performance on all three challenge tasks, with different preferences for model width and depth in each of the tasks. In contrast, a negative correlation was found between LM size and reading time fit of linear mixed-effects models using LM surprisal as a predictor, with the second-smallest LM achieving the largest log-likelihood reduction over a baseline model without surprisal. This suggests that modelling processing effort and linguistic competence may require an approach different from training GPT-like LMs on a developmentally plausible corpus.

摘要
Translation notes:* "psycholinguistic" is translated as "语言心理学的" (yǔ yán xīn lǐ xué de)* "response variables" is translated as "响应变量" (fāng biàn biàn xiàng)* "reading times" is translated as "阅读时间" (dòng dú shí jiān)* "gaze durations" is translated as "视线时间" (shì jian shí jiān)* "N400/P600 EEG signals" is translated as "N400/P600 EEG信号" (N400/P600 EEG xìn xiàng)* "challenge tasks" is translated as "挑战任务" (tiǎo zhàn zhì gōng)* "BLiMP" is translated as "BLiMP" (B López-Ibáñez et al., 2023)* "GLUE" is translated as "GLUE" (Wang et al., 2019)* "MSGS" is translated as "MSGS" (Zhang et al., 2020)* "linear mixed-effects models" is translated as "线性混合效应模型" (xiàng xìng hù he yìng xiàng mó del)* "LM surprisal" is translated as "LM难易度" (LM nán yì du)* "baseline model without surprisal" is translated as "无难易度基线模型" (wú nán yì du jī liào mó del)

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

paper_url: http://arxiv.org/abs/2311.04534
repo_url: None
paper_authors: Qian Chen, Wen Wang, Qinglin Zhang, Siqi Zheng, Shiliang Zhang, Chong Deng, Yukun Ma, Hai Yu, Jiaqing Liu, Chong Zhang
for: 这个论文主要针对的是基于普通话音频识别的单语言模型（SpeechGPT、VioLA、AudioPaLM等）的识别性能提高。
methods: 这些模型将连续的语音信号转换成精度的字符串（speech discretization），然后将语音和文本的字符串合并到共同词汇中。然后，它们使用单个decoder-only transformer进行训练，并使用loss masking进行ASR任务的训练。
results: 我们发现，使用传统的cross-entropy损失函数不一定能够提高ASR性能，而是可以使用smoothed label distillation（SLD）方法，该方法引入KL散度损失函数以有效地模型语音字符串。实验结果表明，我们的SLD方法可以超越loss masking，并在不同的speech discretization方法下提高ASR性能。

Abstract
Recently, unified speech-text models, such as SpeechGPT, VioLA, and AudioPaLM, have achieved remarkable performance on speech tasks. These models convert continuous speech signals into discrete tokens (speech discretization) and merge text and speech tokens into a shared vocabulary. Then they train a single decoder-only Transformer on a mixture of speech tasks. Specifically, all these models utilize Loss Masking on the input speech tokens for the ASR task, which means that these models do not explicitly model the dependency between the speech tokens. In this paper, we attempt to model the sequence of speech tokens in an autoregressive manner like text. However, we find that applying the conventional cross-entropy loss on input speech tokens does not consistently improve the ASR performance over Loss Masking. Therefore, we propose a novel approach denoted Smoothed Label Distillation (SLD), which introduces a KL divergence loss with smoothed labels on the input speech tokens to effectively model speech tokens. Experiments demonstrate that our SLD approach alleviates the limitations of the cross-entropy loss and consistently outperforms Loss Masking for decoder-only Transformer based ASR using different speech discretization methods.

摘要
最近，一些协调的语音-文本模型，如SpeechGPT、VioLA和AudioPaLM，在语音任务上实现了出色的性能。这些模型将连续的语音信号转换成精确的字符（语音精度），然后将语音和文本字符合并到共同词汇中。然后它们训练了单个解码器只的Transformer模型在语音任务中。具体来说，这些模型都使用输入语音token的损失压缩（Loss Masking）来进行ASR任务，这意味着这些模型不直接模型语音token之间的依赖关系。在这篇论文中，我们尝试了模elling语音token的自然顺序，但我们发现在输入语音token上应用普通的十字积分损失不一定能提高ASR性能，而是提出了一种新的方法，即Smoothed Label Distillation（SLD）。SLD方法通过在输入语音token上添加一个KL散度损失来有效地模型语音token。实验结果表明，我们的SLD方法可以减轻十字积分损失的局限性，并一直超越Loss Masking在单个Transformer模型基于ASR任务中。

Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction

paper_url: http://arxiv.org/abs/2311.04507
repo_url: None
paper_authors: Cam-Van Thi Nguyen, Anh-Tuan Mai, The-Son Le, Hai-Dang Kieu, Duc-Trong Le
for: 本研究旨在提高对话理解中的情感识别率，具体来说是利用对话级别的跨Modal交互以及发言人的时间信息来预测每个句子的情感标签。
methods: 本研究提出了一种名为CORECT的神经网络框架，该框架利用对话级别的跨Modal交互和每个句子的时间信息，同时还利用不同Modal的特定表示方法来提高对话理解。
results: 经过广泛的实验，CORECT在IEMOCAP和CMU-MOSEI数据集上的多模态ERC任务得到了state-of-the-art的结果，证明了CORECT的有效性。

Abstract
Emotion recognition is a crucial task for human conversation understanding. It becomes more challenging with the notion of multimodal data, e.g., language, voice, and facial expressions. As a typical solution, the global- and the local context information are exploited to predict the emotional label for every single sentence, i.e., utterance, in the dialogue. Specifically, the global representation could be captured via modeling of cross-modal interactions at the conversation level. The local one is often inferred using the temporal information of speakers or emotional shifts, which neglects vital factors at the utterance level. Additionally, most existing approaches take fused features of multiple modalities in an unified input without leveraging modality-specific representations. Motivating from these problems, we propose the Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), an novel neural network framework that effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with the modality-specific manner for conversation understanding. Extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.

摘要
情感识别是人工对话理解中的关键任务。在多模态数据下，这种任务变得更加挑战性，例如语言、声音和表情等。为解决这个问题，通常是利用对话级别的全局和本地上下文信息来预测每个句子的情感标签。特别是，全局表示可以通过对话级别的交互模型来捕捉到全局上下文信息。而本地上下文信息通常是通过说话者的时间信息或情感变化来得到，但是这些因素可能会忽略了句子级别的重要因素。此外，大多数现有的方法会将多Modalities的Feature进行混合，而不是利用特定的感知模式来进行处理。 inspirited by these problems, we propose a novel neural network framework called Relational Temporal Graph Neural Network with Auxiliary Cross-Modality Interaction (CORECT), which effectively captures conversation-level cross-modality interactions and utterance-level temporal dependencies with modality-specific manner for conversation understanding. Our extensive experiments demonstrate the effectiveness of CORECT via its state-of-the-art results on the IEMOCAP and CMU-MOSEI datasets for the multimodal ERC task.

Multi-label and Multi-target Sampling of Machine Annotation for Computational Stance Detection

paper_url: http://arxiv.org/abs/2311.04495
repo_url: https://github.com/seq-to-mind/Stance_MA
paper_authors: Zhengyuan Liu, Hai Leong Chieu, Nancy F. Chen
for: 本研究旨在探讨大语言模型是否可以取代人工标注 personnel 进行计算态度探测任务。
methods: 本研究使用了自动标注方法，并引入多标签多目标采样策略以提高标注质量。
results: 实验结果表明，我们的方法可以显著提高性能和学习效果。

Abstract
Data collection from manual labeling provides domain-specific and task-aligned supervision for data-driven approaches, and a critical mass of well-annotated resources is required to achieve reasonable performance in natural language processing tasks. However, manual annotations are often challenging to scale up in terms of time and budget, especially when domain knowledge, capturing subtle semantic features, and reasoning steps are needed. In this paper, we investigate the efficacy of leveraging large language models on automated labeling for computational stance detection. We empirically observe that while large language models show strong potential as an alternative to human annotators, their sensitivity to task-specific instructions and their intrinsic biases pose intriguing yet unique challenges in machine annotation. We introduce a multi-label and multi-target sampling strategy to optimize the annotation quality. Experimental results on the benchmark stance detection corpora show that our method can significantly improve performance and learning efficacy.

摘要
<> translate into Simplified Chinese文本收集自手动标注提供域 especific和任务aligned的监督，以实现自然语言处理任务中的合理性。然而，手动标注通常具有时间和预算上的挑战，特别是当需要域知识、捕捉微妙Semantic特征和推理步骤时。在这篇论文中，我们investigate大型自然语言模型的可用性，以便用于计算意见探测。我们发现，虽然大型自然语言模型显示出人工标注员的潜在 substitute，但是它们对任务特定的指令和自身偏见带来了有趣但唯一的挑战。我们提出了一种多标签多目标采样策略，以提高标注质量。实验结果表明，我们的方法可以显著提高性能和学习效果。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

CLearViD: Curriculum Learning for Video Description

paper_url: http://arxiv.org/abs/2311.04480
repo_url: https://github.com/yueyue0401/CLV
paper_authors: Cheng-Yu Chuang, Pooyan Fazli
For: The paper is written for generating coherent natural language sentences that narrate the content of a given video.* Methods: The paper proposes a transformer-based model called CLearViD, which leverages curriculum learning and the Mish activation function to accomplish video description generation. The model is trained using two curriculum strategies: progressively exposing the model to more challenging samples and gradually reducing the capacity of the network through dropout.* Results: The paper demonstrates the effectiveness of the proposed model through extensive experiments and ablation studies. The results show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics on two datasets, namely ActivityNet Captions and YouCook2.

Abstract
Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics.

摘要
视频描述文本生成模型CLearViD，利用变换器来自动生成 coherent的自然语言句子，描述视频内容。我们 investigate了两种课程学策略：（1）通过逐渐应用 Gaussian 噪声来提高模型对更复杂样本的抗衰假设能力，以及（2）在训练过程中逐渐减少网络的容量。这两种方法使得模型学习更加稳健和泛化。此外，CLearViD 还使用 Mish 激活函数，该函数提供了非线性和非准确性，帮助解决梯度消失问题。我们的广泛的实验和剥夺研究表明，提案的模型具有显著的效果。对 ActivityNet Captions 和 YouCook2 两个数据集进行了比较，CLearViD 与现有状态机的模型相比，在准确性和多样性指标上具有显著优势。

Twitter Sentiment Analysis of Covid Vacciness

paper_url: http://arxiv.org/abs/2311.04479
repo_url: None
paper_authors: Wenbo Zhu, Tiechuan Hu
for: 这些研究者想要使用Twitter上的 opinon sorting和ranking算法，以便更好地理解用户对 COVID-19 疫苗的看法，并帮助人们做出更加有信心的决策。
methods: 这些研究者使用了自然语言处理技术，包括 Sentiment Analysis 和 Topic Modeling，以分类和 categorize Twitter上的 opinon。
results: 这些研究者通过使用这些算法，成功地分类和 categorize Twitter上的 opinon，并且可以准确地了解用户对 COVID-19 疫苗的看法。

Abstract
In this paper, we look at a database of tweets sorted by various keywords that could indicate the users sentiment towards covid vaccines. With social media becoming such a prevalent source of opinion, sorting and ranking tweets that hold important information such as opinions on covid vaccines is of utmost importance. Two different ranking scales were used, and ranking a tweet in this way could represent the difference between an opinion being lost and an opinion being featured on the site, which affects the decisions and behavior of people, and why researchers were interested in it. Using natural language processing techniques, our aim is to determine and categorize opinions about covid vaccines with the highest accuracy possible.

摘要
在这篇论文中，我们分析了一个推特数据库，按照不同的关键词分类用户对covid疫苗的看法。随着社交媒体在意见形成中的重要地位，对于推特上的意见分类和排名是非常重要的。我们使用了两种不同的排名级别，排名一句话可能代表了意见被丢弃或被站点特有的，这会影响人们的决策和行为，因此研究人员对其非常感兴趣。使用自然语言处理技术，我们的目标是尽可能准确地确定和分类对covid疫苗的看法。

Lewis’s Signaling Game as beta-VAE For Natural Word Lengths and Segments

paper_url: http://arxiv.org/abs/2311.04453
repo_url: None
paper_authors: Ryo Ueda, Tadahiro Taniguchi
for: 这个论文的目的是研究emergent communication（EC）中的信号游戏，以及如何使用beta-VAE对其目标函数进行修改，以便更好地控制emergent languages的统计特性。
methods: 这篇论文使用了beta-VAE和ELBO来重新解释Lewis的信号游戏，并修改了其目标函数。
results: 实验表明，通过选择合适的先验分布，emergent languages可以更加接近自然语言的统计特性，包括Zipf的压缩法（ZLA）和Harris的词法分析（HAS）。

Abstract
As a sub-discipline of evolutionary and computational linguistics, emergent communication (EC) studies communication protocols, called emergent languages, arising in simulations where agents communicate. A key goal of EC is to give rise to languages that share statistical properties with natural languages. In this paper, we reinterpret Lewis's signaling game, a frequently used setting in EC, as beta-VAE and reformulate its objective function as ELBO. Consequently, we clarify the existence of prior distributions of emergent languages and show that the choice of the priors can influence their statistical properties. Specifically, we address the properties of word lengths and segmentation, known as Zipf's law of abbreviation (ZLA) and Harris's articulation scheme (HAS), respectively. It has been reported that the emergent languages do not follow them when using the conventional objective. We experimentally demonstrate that by selecting an appropriate prior distribution, more natural segments emerge, while suggesting that the conventional one prevents the languages from following ZLA and HAS.

摘要
为一种语言演化和计算语言学的子领域， emergent communication（EC）研究在模拟中 Agent 通信时出现的通信协议，称为 emergent language。EC 的一个关键目标是使 emergent language 具有自然语言的统计性质。在这篇论文中，我们将 Lewis 的信号游戏，常用于 EC，重新解释为 beta-VAE 并修改其目标函数为 ELBO。因此，我们可以清楚地说明 emergent language 的先前分布的存在和这些先前分布的选择可以影响其统计性质。特别是，我们研究 word length 和分 segmentation 的属性，即Zipf 法则简短化（ZLA）和Harris 词法分析（HAS）。报告表明，使用 conventional 目标函数时，emergent language 不会遵循 ZLA 和 HAS。我们通过选择合适的先前分布，在实验中证明可以生成更自然的分 segmentation，并建议 conventional 目标函数可以阻碍 language 遵循 ZLA 和 HAS。

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability

paper_url: http://arxiv.org/abs/2311.04449
repo_url: https://github.com/jrc1995/beamrecursionfamily
paper_authors: Jishnu Ray Chowdhury, Cornelia Caragea
for: 这 paper 是为了研究一种新的树状模型，它可以同时具备 BBT-RvNNs 的计算效率和 RvNNs 的结构敏感性。
methods: 这 paper 使用了一种名为 Recursion in Recursion (RIR) 的新框架，它使用两级嵌套循环，外层循环是一个 $k$-ary 平衡树模型，而内层循环是一个 Beam Tree RvNN。此外， authors 还提出了一种新的扩散策略，即 beam alignment，以调整 BT-RvNN 的性能。
results: 这 paper 的最佳模型可以在 ListOps 上实现高（大于 90%）的长度泛化性能，同时在 LRA 语言任务上保持与 Structured State Space Models (SSMs) 的竞争性。此外， authors 还证明了 RIR 可以在 LRA 语言任务上比 Transformers 更高的精度。

Abstract
Binary Balanced Tree RvNNs (BBT-RvNNs) enforce sequence composition according to a preset balanced binary tree structure. Thus, their non-linear recursion depth is just $\log_2 n$ ($n$ being the sequence length). Such logarithmic scaling makes BBT-RvNNs efficient and scalable on long sequence tasks such as Long Range Arena (LRA). However, such computational efficiency comes at a cost because BBT-RvNNs cannot solve simple arithmetic tasks like ListOps. On the flip side, RvNNs (e.g., Beam Tree RvNN) that do succeed on ListOps (and other structure-sensitive tasks like formal logical inference) are generally several times more expensive than even RNNs. In this paper, we introduce a novel framework -- Recursion in Recursion (RIR) to strike a balance between the two sides - getting some of the benefits from both worlds. In RIR, we use a form of two-level nested recursion - where the outer recursion is a $k$-ary balanced tree model with another recursive model (inner recursion) implementing its cell function. For the inner recursion, we choose Beam Tree RvNNs (BT-RvNN). To adjust BT-RvNNs within RIR we also propose a novel strategy of beam alignment. Overall, this entails that the total recursive depth in RIR is upper-bounded by $k \log_k n$. Our best RIR-based model is the first model that demonstrates high ($\geq 90\%$) length-generalization performance on ListOps while at the same time being scalable enough to be trainable on long sequence inputs from LRA. Moreover, in terms of accuracy in the LRA language tasks, it performs competitively with Structured State Space Models (SSMs) without any special initialization - outperforming Transformers by a large margin. On the other hand, while SSMs can marginally outperform RIR on LRA, they (SSMs) fail to length-generalize on ListOps. Our code is available at: \url{https://github.com/JRC1995/BeamRecursionFamily/}.

摘要
Binary 平衡树RvNNs (BBT-RvNNs) enforces 序列组合 according to a preset 平衡的二进制树结构。因此，它们的非线性循环深度只是 log2(n)（n 是序列长度）。这种对数循环深度使 BBT-RvNNs 高效和可扩展于长序列任务，如 Long Range Arena (LRA)。然而，这种计算效率来自于 BBT-RvNNs 无法解决简单的数学任务，如 ListOps。相反，使用 RvNNs (例如 Beam Tree RvNN) 可以在 ListOps 和其他结构敏感任务中取得更高的成功率，但这些模型通常比 RNNs 更加昂贵。在这篇论文中，我们介绍了一种新的框架--- Recursion in Recursion (RIR)，以达到这两个方面之间的平衡。在 RIR 中，我们使用一种 $k$-ary 平衡树模型，其中另一个嵌入的回归模型（内嵌回归）实现其细胞函数。为内嵌回归，我们选择 Beam Tree RvNNs (BT-RvNN)。为了调整 BT-RvNNs 在 RIR 中，我们也提出了一种新的抽象策略---排Alignment。总的来说，RIR 的总回归深度 upper-bounded 为 $k \log_k n$。我们的最佳 RIR-based 模型可以在 ListOps 上达到 length-generalization 性能 higher than 90% ，同时可以在 Long Range Arena 中训练长序列输入。此外，在 LRA 语言任务上，它的准确率与 Structured State Space Models (SSMs) 相当，而不需要特殊的初始化。相比之下，SSMs 可以 marginally 在 LRA 上超越 RIR，但它们无法 length-generalize 在 ListOps。我们的代码可以在以下链接中找到： \url{https://github.com/JRC1995/BeamRecursionFamily/}.

2023-11-08

Deep Learning Brasil at ABSAPT 2022: Portuguese Transformer Ensemble Approaches

DeepLearningBrasil@LT-EDI-2023: Exploring Deep Learning Techniques for Detecting Depression in Social Media Text

First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models

On the steerability of large language models toward data-driven personas

How Abstract Is Linguistic Generalization in Large Language Models? Experiments with Argument Structure

Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs

Profiling Irony & Stereotype: Exploring Sentiment, Topic, and Lexical Features

Hierarchically Gated Recurrent Neural Network for Sequence Modeling

Determination of toxic comments and unintended model bias minimization using Deep learning approach

Using large language models to study human memory for meaningful narratives

Evaluating Generative Ad Hoc Information Retrieval

Speech language models lack important brain-relevant semantics

Massive Editing for Large Language Models via Meta Learning

Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case Study on the Abstractness-Concreteness Continuum

Assessing Distractors in Multiple-Choice Tests

Large GPT-like Models are Bad Babies: A Closer Look at the Relationship between Linguistic Competence and Psycholinguistic Measures

Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token Based ASR

Conversation Understanding using Relational Temporal Graph Neural Networks with Auxiliary Cross-Modality Interaction

Multi-label and Multi-target Sampling of Machine Annotation for Computational Stance Detection

CLearViD: Curriculum Learning for Video Description

Twitter Sentiment Analysis of Covid Vacciness

Lewis’s Signaling Game as beta-VAE For Natural Word Lengths and Segments

Recursion in Recursion: Two-Level Nested Recursion for Length Generalization with Scalability