cs.CL - 2023-08-30

Vulgar Remarks Detection in Chittagonian Dialect of Bangla

paper_url: http://arxiv.org/abs/2308.15448
repo_url: None
paper_authors: Tanjim Mahmud, Michal Ptaszynski, Fumito Masui
for: 本研究旨在探讨社交媒体上的负面言语 automatic detection方法，尤其是在低资源语言如锡兰语方言上。
methods: 本研究使用了指导学习和深度学习算法来检测社交媒体上的侮辱言语。Logistic Regression实现了可观的准确率（0.91），而简单的RNN具有Word2vec和fastTex的组合实现了较低的准确率（0.84-0.90），这说明了NN算法需要更多的数据。
results: 本研究显示，使用指导学习和深度学习算法可以准确地检测社交媒体上的侮辱言语，但是NN算法需要更多的数据以实现更高的准确率。

Abstract
The negative effects of online bullying and harassment are increasing with Internet popularity, especially in social media. One solution is using natural language processing (NLP) and machine learning (ML) methods for the automatic detection of harmful remarks, but these methods are limited in low-resource languages like the Chittagonian dialect of Bangla.This study focuses on detecting vulgar remarks in social media using supervised ML and deep learning algorithms.Logistic Regression achieved promising accuracy (0.91) while simple RNN with Word2vec and fastTex had lower accuracy (0.84-0.90), highlighting the issue that NN algorithms require more data.

摘要
互联网欺凌和 Harrasment 的负面影响随着互联网的普及而增加，尤其在社交媒体上。一种解决方案是使用自然语言处理（NLP）和机器学习（ML）方法进行自动发现危险评论，但这些方法在低资源语言如锡兰语的拼写方言上有限。本研究关注社交媒体中的粗鄙评论使用监督式 ML 和深度学习算法探测。Logistic Regression 达到了可靠的准确率（0.91），而简单的 RNN 与 Word2vec 和 fastTex 的准确率（0.84-0.90）较低，表明 NN 算法需要更多的数据。

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

paper_url: http://arxiv.org/abs/2308.15419
repo_url: https://github.com/tylerachang/lm-learning-curves
paper_authors: Tyler A. Chang, Zhuowen Tu, Benjamin K. Bergen
for: 这些语言模型在预训练中学习的问题是什么？
methods: 这些语言模型在预训练中使用了什么方法？
results: 这些语言模型在预训练中得到了什么结果？Here are the answers in Simplified Chinese text:
for: 这些语言模型在预训练中学习的问题是如何快速预测语言模型的性能？
methods: 这些语言模型在预训练中使用了自适应语言模型的预训练方法？
results: 这些语言模型在预训练中得到了快速预测语言模型的稳定性和性能？

Abstract
How do language models learn to make predictions during pre-training? To study this question, we extract learning curves from five autoregressive English language model pre-training runs, for 1M tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context. More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be "forgotten" during pre-training. Higher n-gram probabilities further accentuate these effects. Independent of the target token, shorter and more frequent contexts correlate with marginally more stable and quickly acquired predictions. Effects of part-of-speech are also small, although nouns tend to be acquired later and less stably than verbs, adverbs, and adjectives. Our work contributes to a better understanding of language model pre-training dynamics and informs the deployment of stable language models in practice.

摘要
<>我们使用五个权重autoregressive英语语言模型的预训练运行来研究语言模型如何预测。我们从100万个字Context中提取学习曲线，并观察到语言模型在预训练过程中首先生成短重复的短语，然后学习 longer和更 coherent的文本。我们量化每个Token在Context中的最终难度、内Run变化、年龄 acquisition、忘记性和 Cross-Run变化。我们发现更常见的Token在Context中更容易预测， exhibit less variability within和across预训练运行，learn Earlier，并更可能被"忘记" during预训练。高 n-gram概率更加强调这些效果。独立于目标Token，更短和更频繁的Contexts呈现marginally more stable and quickly acquired预测。 parts of speech的影响也很小，although nouns tend to be acquired later and less stably than verbs, adverbs, and adjectives。我们的工作对语言模型预训练动力学的更好理解，并可以帮助实践中稳定地部署语言模型。