cs.CL - 2023-07-12

Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches

  • paper_url: http://arxiv.org/abs/2307.06218
  • repo_url: https://github.com/arbml/ashaar
  • paper_authors: Zaid Alyafeai, Maged S. Al-Shaibani, Moataz Ahmed
  • for: 本研究旨在开发一个名为\textit{Ashaar}的框架,用于分析和生成阿拉伯诗歌。
  • methods: 该框架包括了多种诗歌方面的数据集和预训练模型,可以进行诗歌的米特、主题和时期分类,以及自动生成诗歌的字符串。
  • results: 研究人员通过采用这个框架,可以自动检测和分类诗歌的不同方面,并生成符合不同主题和时期的诗歌。此外,还提供了四个数据集,用于诗歌生成、字符串分类、以及阿鲁迪风格预测。
    Abstract Poetry holds immense significance within the cultural and traditional fabric of any nation. It serves as a vehicle for poets to articulate their emotions, preserve customs, and convey the essence of their culture. Arabic poetry is no exception, having played a cherished role in the heritage of the Arabic community throughout history and maintaining its relevance in the present era. Typically, comprehending Arabic poetry necessitates the expertise of a linguist who can analyze its content and assess its quality. This paper presents the introduction of a framework called \textit{Ashaar} https://github.com/ARBML/Ashaar, which encompasses a collection of datasets and pre-trained models designed specifically for the analysis and generation of Arabic poetry. The pipeline established within our proposed approach encompasses various aspects of poetry, such as meter, theme, and era classification. It also incorporates automatic poetry diacritization, enabling more intricate analyses like automated extraction of the \textit{Arudi} style. Additionally, we explore the feasibility of generating conditional poetry through the pre-training of a character-based GPT model. Furthermore, as part of this endeavor, we provide four datasets: one for poetry generation, another for diacritization, and two for Arudi-style prediction. These datasets aim to facilitate research and development in the field of Arabic poetry by enabling researchers and enthusiasts to delve into the nuances of this rich literary tradition.
    摘要 文学在任何国家的文化和传统中具有极大的重要性。它作为诗人表达情感、保存习俗和传递文化精神的媒介。阿拉伯诗歌也不例外,历史上一直具有阿拉伯社会珍贵的地位,并在当今仍然保持其重要性。通常,理解阿拉伯诗歌需要语言专家的帮助,分析其内容和评价质量。本文提出了一个名为《Ashaar》的框架,包括特定 для阿拉伯诗歌分析和生成的数据集和预训练模型。我们的提posed方法包括诗歌的米特、主题和时期分类,以及自动诗歌 диакритика,以便进行更加细致的分析,如自动提取Arudi风格。此外,我们还探索了基于人物的GPT模型预训练 Conditional Poetry 的可能性。此外,作为这项努力的一部分,我们提供了四个数据集:一个用于诗歌生成,一个用于 диаcritization,以及两个用于 Arudi 风格预测。这些数据集的目的是促进阿拉伯诗歌研究和发展,让研究人员和爱好者可以更深入探索这一丰富的文学传统。

Detecting the Presence of COVID-19 Vaccination Hesitancy from South African Twitter Data Using Machine Learning

  • paper_url: http://arxiv.org/abs/2307.15072
  • repo_url: None
  • paper_authors: Nicholas Perikli, Srimoy Bhattacharya, Blessing Ogbuokiri, Zahra Movahedi Nia, Benjamin Lieberman, Nidhi Tripathi, Salah-Eddine Dahbi, Finn Stevenson, Nicola Bragazzi, Jude Kong, Bruce Mellado
    for: 这个研究的目的是使用 sentiment analysis 分析南非用户生成的 tweet 中对疫苗不确定性的看法,以培养基于 AI 的分类模型并评估其可靠性。methods: 这个研究使用了 LSTM、bi-LSTM、SVM、BERT-base-cased 和 RoBERTa-base 模型,并且在 WandB 平台上仔细调整了这些模型的超参数。研究还使用了两种不同的预处理方法来比较:一种是 semantics-based,另一种是 corpus-based。results: 研究发现所有模型都有 45%-55% 的低 F1 分数,只有 BERT 和 RoBERTa 两个模型达到了显著更高的 F1 分数(60% 和 61%)。对于 RoBERTa 模型的错分 tweet 进行了 LDA 主题分析,以了解如何进一步提高模型准确性。
    Abstract Very few social media studies have been done on South African user-generated content during the COVID-19 pandemic and even fewer using hand-labelling over automated methods. Vaccination is a major tool in the fight against the pandemic, but vaccine hesitancy jeopardizes any public health effort. In this study, sentiment analysis on South African tweets related to vaccine hesitancy was performed, with the aim of training AI-mediated classification models and assessing their reliability in categorizing UGC. A dataset of 30000 tweets from South Africa were extracted and hand-labelled into one of three sentiment classes: positive, negative, neutral. The machine learning models used were LSTM, bi-LSTM, SVM, BERT-base-cased and the RoBERTa-base models, whereby their hyperparameters were carefully chosen and tuned using the WandB platform. We used two different approaches when we pre-processed our data for comparison: one was semantics-based, while the other was corpus-based. The pre-processing of the tweets in our dataset was performed using both methods, respectively. All models were found to have low F1-scores within a range of 45$\%$-55$\%$, except for BERT and RoBERTa which both achieved significantly better measures with overall F1-scores of 60$\%$ and 61$\%$, respectively. Topic modelling using an LDA was performed on the miss-classified tweets of the RoBERTa model to gain insight on how to further improve model accuracy.
    摘要 很少有关于南非用户生成内容的社交媒体研究在COVID-19大流行期间进行,而且使用手动标注而不是自动方法更少。疫苗是战胜COVID-19的重要工具,但是疫苗拒绝会对公共健康努力产生威胁。本研究通过对南非推文中的疫苗拒绝 sentiment 分析,以训练 AI 承载分类模型并评估其可靠性。研究采集了30000条南非推文,并 manually 标注为一个sentiment类型:正面、负面或中性。使用的机器学习模型包括 LSTM、bi-LSTM、SVM、BERT-base-cased 和 RoBERTa-base 模型,其中 hyperparameter 通过WandB平台仔细调整。我们使用了两种不同的方法进行数据预处理,以便比较:一种是基于 semantics,另一种是基于 corpus。对于我们的数据集,我们使用了两种预处理方法,分别对应这两种方法。所有模型都显示了45%-55%的低 F1 分数,只有 BERT 和 RoBERTa 两个模型显示了明显更好的性能,其中 F1 分数分别为 60% 和 61%。使用 LDA 进行主题分析,以获取 RoBERTa 模型中错误分类的推文,以了解如何进一步改进模型准确性。

Sumformer: A Linear-Complexity Alternative to Self-Attention for Speech Recognition

  • paper_url: http://arxiv.org/abs/2307.07421
  • repo_url: None
  • paper_authors: Titouan Parcollet, Rogier van Dalen, Shucong Zhang, Sourav Bhattacharya
  • for: 提高Speech recognition系统的效率和可扩展性
  • methods: 提出了一种linear-time的自注意力alternative方法,通过计算整个语音段的均值来概括整个语音段,然后与时间特定信息相结合
  • results: 在state-of-the-art ASR模型中引入Summary Mixing后,可以保持或超越之前的语音识别性能,同时降低训练和推理时间和内存预算,相对降低27%和减少一半
    Abstract Modern speech recognition systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference as well as training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but fail to consistently reach the same level of accuracy. In practice, however, the self-attention weights of trained speech recognizers take the form of a global average over time. This paper, therefore, proposes a linear-time alternative to self-attention for speech recognition. It summarises a whole utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method ``Summary Mixing''. Introducing Summary Mixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while lowering the training and inference times by up to 27% and reducing the memory budget by a factor of two.
    摘要 现代语音识别系统倚靠自注意。不幸地,自注意与语音识别的混合计算时间平方增长,从而降低了推理和训练的速度以及内存占用。但是,尝试使用更便宜的自注意代替方法来实现语音识别,却无法一直稳定地达到同等级别的准确率。在实践中,已经训练过的语音识别模型中的自注意 веса呈全 UTC 的平均值。这篇论文因此提议一种 linear-time 的自注意替代方法,将整个声音utterance 概括为所有时间步骤的 vectors 的mean。这个概括然后与时间特定信息结合。我们称之为“概括混合”(Summary Mixing)。在现有的语音识别模型中引入概括混合后,可以保持或超越之前的语音识别性能,同时降低训练和推理时间(最多下降27%)以及内存预算(减半)。

Enhancing Portuguese Sign Language Animation with Dynamic Timing and Mouthing

  • paper_url: http://arxiv.org/abs/2307.06124
  • repo_url: None
  • paper_authors: Inês Lacerda, Hugo Nicolau, Luisa Coheur
  • for: 这篇论文的目的是提出一种新的动态方法来处理译注语言手势的过渡动画,尤其是葡萄牙手语的嘴部动画。
  • methods: 这篇论文使用了native signers的口语动画和没有动画的控制组进行比较,以评估动画的影响。
  • results: 研究发现,使用 mouthing 动画可以提高译注语言学习者对手势的理解和感知自然性,但没有显著的差异在Native signers中。这些结果有关于计算机语言学、人机交互和 sintetic 手势人物动画的应用。
    Abstract Current signing avatars are often described as unnatural as they cannot accurately reproduce all the subtleties of synchronized body behaviors of a human signer. In this paper, we propose a new dynamic approach for transitions between signs, focusing on mouthing animations for Portuguese Sign Language. Although native signers preferred animations with dynamic transitions, we did not find significant differences in comprehension and perceived naturalness scores. On the other hand, we show that including mouthing behaviors improved comprehension and perceived naturalness for novice sign language learners. Results have implications in computational linguistics, human-computer interaction, and synthetic animation of signing avatars.
    摘要 当前的签名人物通常被描述为不自然,因为它们无法准确地复制人类签名者的同步身体行为的细微变化。在这篇论文中,我们提出了一种新的动态方法,专注于葡萄牙手语的嘴部动画。虽然本地签名者喜欢使用动态过渡的动画,但我们没有发现显著的差异在理解和自然感 scores。然而,我们发现包含嘴部行为可以提高理解和自然感 scores for novice 手语学习者。结果有关计算语言学、人机交互和 sintetic 签名人物动画的应用。

Interpreting deep embeddings for disease progression clustering

  • paper_url: http://arxiv.org/abs/2307.06060
  • repo_url: None
  • paper_authors: Anna Munoz-Farre, Antonios Poulakakis-Daktylidis, Dilini Mahesha Kothalawala, Andrea Rodriguez-Martinez
  • for: 针对类型2糖尿病患者的 clustering 分析
  • methods: 使用深度嵌入进行解释,并在UK Biobank dataset上进行评估
  • results: 提供了临床有意义的疾病进程模式的洞察Translation:
  • for: Targeting patient clustering analysis for type 2 diabetes
  • methods: Using deep embeddings for explanation, and evaluated on the UK Biobank dataset
  • results: Provided clinically meaningful insights into disease progression patterns
    Abstract We propose a novel approach for interpreting deep embeddings in the context of patient clustering. We evaluate our approach on a dataset of participants with type 2 diabetes from the UK Biobank, and demonstrate clinically meaningful insights into disease progression patterns.
    摘要 我们提出了一种新的方法来解释深度嵌入在患者划分中的应用。我们在UK Biobank dataset上评估了我们的方法,并发现了对疾病进程的深刻理解。Note: "深度嵌入" (shēngrán zhù) in Chinese refers to deep learning models, specifically neural networks with multiple layers. "患者划分" (huàizěr bùfèn) means patient clustering, and "疾病进程" (jiàojiè jìnèsè) refers to the progression of a disease.

A Study on the Appropriate size of the Mongolian general corpus

  • paper_url: http://arxiv.org/abs/2307.06050
  • repo_url: None
  • paper_authors: Sunsoo Choi, Ganbat Tsend
  • For: This paper aims to determine the appropriate size of the Mongolian general corpus.* Methods: The study uses the Heaps function and Type Token Ratio (TTR) to determine the appropriate size of the corpus.* Results: The study found that an appropriate size for a Mongolian general corpus is from 39 to 42 million tokens, based on the observation of changes in the number of types and TTR values while increasing the number of tokens.Here is the same information in Simplified Chinese text:* For: 这个研究的目的是确定蒙古通用词库的合适大小。* Methods: 这个研究使用堆函数和类型Token比率(TTR)来确定词库的合适大小。* Results: 研究发现,蒙古通用词库的合适大小在39到42百万个字之间,根据增加字符数时类型和TTR值的变化观察结果。
    Abstract This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps function and Type Token Ratio to determine the appropriate size of the Mongolian general corpus. The sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded from 39 to 42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is from 39 to 42 million tokens.
    摘要 Simplified Chinese:这个研究的目标是确定蒙古通用词汇库的合适大小。这个研究使用堆函数和类型Token比率来确定蒙古通用词汇库的合适大小。样本句子集中包括10个领域的报纸政治、经济、社会、文化、体育、世界文章和法律、中学和高中文学教材、采访文章和Podcast脚本等10个领域的文本。首先,我们使用这个样本句子集来估算堆函数。然后,我们观察增加一百万个字时,类型和TTR值的变化,使用估算的堆函数来进行观察。结果发现,TTR值几乎不变化,当字符数超过39到42百万时。因此,我们认为一个合适的蒙古通用词汇库的大小是39到42百万个字。

Pluggable Neural Machine Translation Models via Memory-augmented Adapters

  • paper_url: http://arxiv.org/abs/2307.06029
  • repo_url: https://github.com/urvashik/knnmt
  • paper_authors: Yuzhuang Xu, Shuo Wang, Peng Li, Xuebo Liu, Xiaolong Wang, Weidong Liu, Yang Liu
  • for: 用于控制神经机器翻译模型的生成行为,满足不同用户需求。
  • methods: 使用内存增强器,可插入预训练的神经机器翻译模型,以便在不同用户需求下进行可控的生成。
  • results: 比 Representatives pluggable baseline 高效,在样式和领域特定 экспериментах中 validate our approach.
    Abstract Although neural machine translation (NMT) models perform well in the general domain, it remains rather challenging to control their generation behavior to satisfy the requirement of different users. Given the expensive training cost and the data scarcity challenge of learning a new model from scratch for each user requirement, we propose a memory-augmented adapter to steer pretrained NMT models in a pluggable manner. Specifically, we construct a multi-granular memory based on the user-provided text samples and propose a new adapter architecture to combine the model representations and the retrieved results. We also propose a training strategy using memory dropout to reduce spurious dependencies between the NMT model and the memory. We validate our approach on both style- and domain-specific experiments and the results indicate that our method can outperform several representative pluggable baselines.
    摘要 Note: The above text is in Traditional Chinese, which is one of the two standard forms of Chinese. Simplified Chinese is the other standard form, and it is used in mainland China. Here is the translation of the text into Simplified Chinese: Although neural machine translation (NMT) models perform well in the general domain, it remains challenging to control their generation behavior to satisfy the requirements of different users. Given the expensive training cost and the data scarcity challenge of learning a new model from scratch for each user requirement, we propose a memory-augmented adapter to steer pretrained NMT models in a pluggable manner. Specifically, we construct a multi-granular memory based on the user-provided text samples and propose a new adapter architecture to combine the model representations and the retrieved results. We also propose a training strategy using memory dropout to reduce spurious dependencies between the NMT model and the memory. We validate our approach on both style- and domain-specific experiments, and the results indicate that our method can outperform several representative pluggable baselines.

PolyLM: An Open Source Polyglot Large Language Model

  • paper_url: http://arxiv.org/abs/2307.06018
  • repo_url: None
  • paper_authors: Xiangpeng Wei, Haoran Wei, Huan Lin, Tianhao Li, Pei Zhang, Xingzhang Ren, Mei Li, Yu Wan, Zhiwei Cao, Binbin Xie, Tianxiang Hu, Shangjie Li, Binyuan Hui, Bowen Yu, Dayiheng Liu, Baosong Yang, Fei Huang, Jun Xie
  • for: 这个研究旨在提高大型自然语言模型(LLM)的多语言能力,并提供一个多语言模型 PolyLM,可以在640亿个字元的数据上进行训练。
  • methods: 这个研究使用了两种方法来增强多语言能力:1)整合双语数据到训练数据中;2)使用一种课程学习策略,将非英语数据的比例从30%提升到60%。
  • results: 实验结果显示,PolyLM在多语言任务上表现出色,比其他开源模型LLaMA和BLOOM更好,同时在英语任务中也维持相似的表现。
    Abstract Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.
    摘要 大型语言模型(LLM)表现出了惊人的理解、思维和生成能力,但是其发展受到了高resource语言,如英语的限制,因此它们的应用和研究在其他语言上受到了限制。为了解决这个问题,我们提出了PolyLM,一个多语言模型,在640亿token的训练数据中获得了两个模型大小:1.7B和13B。为提高其多语言能力,我们采用了以下两种方法:1. 将双语数据integrated到训练数据中;2. 采用一种学习策略,其中在第一个阶段,非英语数据占比为30%,在最后一个阶段升级到60%。此外,我们提出了一种多语言自我指导方法,可以自动生成132.7K多种多语言指导文本,用于模型细化。为评估模型的性能,我们收集了多种现有的多语言任务,包括多语言理解、问答、生成和翻译。广泛的实验表明,PolyLM在多语言任务上表现出优于其他开源模型,如LLaMA和BLOOM,同时在英语任务上保持相似的性能。我们的模型,以及指导数据和多语言准标,可以在以下链接中下载:

DDNAS: Discretized Differentiable Neural Architecture Search for Text Classification

  • paper_url: http://arxiv.org/abs/2307.06005
  • repo_url: https://github.com/ddnas/ddnas
  • paper_authors: Kuan-Chun Chen, Cheng-Te Li, Kuo-Jung Lee
  • for: 这篇论文是针对文本表现学ARNING的Neural Architecture Search(NAS)进行了创新的研究。
  • methods: 这篇论文提出了一种新的NAS方法,即粗粒度可微的神经建筑搜寻(DDNAS),它可以用梯度下降来优化搜寻。此外,论文还提出了一种新的粗粒度层,即互信息最大化层,用于模型文本表现中的隐藏顺序分类。
  • results: 实验结果显示,DDNAS可以在八个不同的实验数据集上连续性地超越现有的NAS方法。尽管DDNAS只使用了三种基本操作(即对缩、对缩和none)来组成NAS建筑块,但它的表现仍然很有 Promise和可以进一步提高。
    Abstract Neural Architecture Search (NAS) has shown promising capability in learning text representation. However, existing text-based NAS neither performs a learnable fusion of neural operations to optimize the architecture, nor encodes the latent hierarchical categorization behind text input. This paper presents a novel NAS method, Discretized Differentiable Neural Architecture Search (DDNAS), for text representation learning and classification. With the continuous relaxation of architecture representation, DDNAS can use gradient descent to optimize the search. We also propose a novel discretization layer via mutual information maximization, which is imposed on every search node to model the latent hierarchical categorization in text representation. Extensive experiments conducted on eight diverse real datasets exhibit that DDNAS can consistently outperform the state-of-the-art NAS methods. While DDNAS relies on only three basic operations, i.e., convolution, pooling, and none, to be the candidates of NAS building blocks, its promising performance is noticeable and extensible to obtain further improvement by adding more different operations.
    摘要 neural architecture search (NAS) 显示了可观的能力在文本表示学习中。然而,现有的文本基于 NAS neither performs a learnable fusion of neural operations to optimize the architecture,nor encodes the latent hierarchical categorization behind text input。这篇论文提出了一种新的 NAS 方法,Discretized Differentiable Neural Architecture Search (DDNAS),用于文本表示学习和分类。通过继续松散化架构表示,DDNAS 可以使用梯度下降优化搜索。我们还提出了一种新的笛卡尔层,通过最大化互信息来强制每个搜索节点模型文本表示中的隐藏层次分类。经验表明,DDNAS 可以在八种不同的实际数据集上具有稳定的高性能,并且可以进一步改进的添加更多不同的操作。Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you need the translation in Traditional Chinese, please let me know.

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

  • paper_url: http://arxiv.org/abs/2307.05972
  • repo_url: None
  • paper_authors: James O’ Neill, Sourav Dutta
  • for: 研究对Transformer语言模型的普适性的影响,并提出一种新的自适度量化法(SDQ)来减少积累量化错误。
  • methods: 使用SDQ法对多语言模型XLM-R-Base和InfoXLM-Base进行量化,并证明两种模型可以从32位浮点数 weights 降低到8位整数 weights 而保持高水平性在XGLUEbenchmark中。
  • results: 研究结果表明,量化多语言模型具有普适性问题,需要涵盖它们没有精心调整的语言。
    Abstract We investigate the effects of post-training quantization and quantization-aware training on the generalization of Transformer language models. We present a new method called self-distilled quantization (SDQ) that minimizes accumulative quantization errors and outperforms baselines. We apply SDQ to multilingual models XLM-R-Base and InfoXLM-Base and demonstrate that both models can be reduced from 32-bit floating point weights to 8-bit integer weights while maintaining a high level of performance on the XGLUE benchmark. Our results also highlight the challenges of quantizing multilingual models, which must generalize to languages they were not fine-tuned on.
    摘要 我们研究了培训后量化和量化感知训练对转移语言模型的泛化性能的影响。我们提出了一种新的方法called自适应量化(SDQ),可以减少积累量化错误,并超过基eline。我们对多语言模型XLM-R-Base和InfoXLM-Base进行应用,并证明这两个模型可以从32位浮点数 weights降低到8位整数 weights而保持高水平的性能在XGLUE测试套件中。我们的结果还探讨了量化多语言模型的挑战,它们需要泛化到它们没有精度调整的语言上。

Prototypical Contrastive Transfer Learning for Multimodal Language Understanding

  • paper_url: http://arxiv.org/abs/2307.05942
  • repo_url: None
  • paper_authors: Seitaro Otsuki, Shintaro Ishikawa, Komei Sugiura
  • for: 本研究旨在提高家庭服务机器人对自然语言指令的理解,使其能够更好地与人类进行交互。
  • methods: 本研究使用了一种新的传输学习方法,即Prototypical Contrastive Transfer Learning(PCTL),其中使用了一种新的对比损失函数名为双protoNCE。
  • results: 实验表明,PCTL比现有方法高效,其中PCTL的准确率为78.1%,而简单的精度调整只能达到73.4%。
    Abstract Although domestic service robots are expected to assist individuals who require support, they cannot currently interact smoothly with people through natural language. For example, given the instruction "Bring me a bottle from the kitchen," it is difficult for such robots to specify the bottle in an indoor environment. Most conventional models have been trained on real-world datasets that are labor-intensive to collect, and they have not fully leveraged simulation data through a transfer learning framework. In this study, we propose a novel transfer learning approach for multimodal language understanding called Prototypical Contrastive Transfer Learning (PCTL), which uses a new contrastive loss called Dual ProtoNCE. We introduce PCTL to the task of identifying target objects in domestic environments according to free-form natural language instructions. To validate PCTL, we built new real-world and simulation datasets. Our experiment demonstrated that PCTL outperformed existing methods. Specifically, PCTL achieved an accuracy of 78.1%, whereas simple fine-tuning achieved an accuracy of 73.4%.
    摘要 尽管家用服务机器人预期能够为需要支持的个人提供帮助,但目前它们无法通过自然语言与人们互动流畅。例如,接受“帮我取 kitchen 里的瓶子”的指令时,大多数传统模型很难准确指定室内环境中的瓶子。大多数传统模型需要大量劳动集成的实际数据来训练,而没有充分利用通过传输学习框架的 simulated 数据。在这种研究中,我们提出了一种新的转移学习方法,称为 Prototypical Contrastive Transfer Learning(PCTL),它使用了一种新的对比损失函数,称为 Dual ProtoNCE。我们将 PCTL 应用于根据自由形式的自然语言指令在家庭环境中标识目标物品。为验证 PCTL,我们创建了新的实际世界和模拟数据集。我们的实验表明,PCTL 的精度为 78.1%,而简单的练习只达到了 73.4%。

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

  • paper_url: http://arxiv.org/abs/2307.05908
  • repo_url: None
  • paper_authors: Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee
  • for: 这 paper 是为了提高 Large Language Models (LLMs) 的决策速度,而不会改变输出结果。
  • methods: 这 paper 使用了额外的计算资源,以并行起始后续的 token 解码过程,从而减少解码延迟。
  • results: 结果表明,通过使用额外的计算资源,可以加速 LLM 的决策速度,并且可以通过评估匹配率(p_correct)来估算减少的延迟量。
    Abstract This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This innovative method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as p_correct. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM greedy decoding.
    摘要

Exploring the Emotional and Mental Well-Being of Individuals with Long COVID Through Twitter Analysis

  • paper_url: http://arxiv.org/abs/2307.07558
  • repo_url: None
  • paper_authors: Guocheng Feng, Huaiyu Cai, Wei Quan
  • for: 了解长期 covid-19 患者的情绪和心理健康状况,以及他们关注的话题。
  • methods: 分析了 tweets 的内容,检测了六种基本情感的存在,并提取了主导话题。
  • results: 发现 throughout 研究时间段,负面情感占据了主导地位,并在一些关键时间点出现两次峰值,如新 covid 变种爆发时。
    Abstract The COVID-19 pandemic has led to the emergence of Long COVID, a cluster of symptoms that persist after infection. Long COVID patients may also experience mental health challenges, making it essential to understand individuals' emotional and mental well-being. This study aims to gain a deeper understanding of Long COVID individuals' emotional and mental well-being, identify the topics that most concern them, and explore potential correlations between their emotions and social media activity. Specifically, we classify tweets into four categories based on the content, detect the presence of six basic emotions, and extract prevalent topics. Our analyses reveal that negative emotions dominated throughout the study period, with two peaks during critical periods, such as the outbreak of new COVID variants. The findings of this study have implications for policy and measures for addressing the mental health challenges of individuals with Long COVID and provide a foundation for future work.
    摘要 COVID-19 大流行导致长期 COVID 出现,一群表现出持续性症状的患者。长期 COVID 患者可能也会经历心理健康挑战,因此了解个人情感和心理健康状况非常重要。本研究的目的是深入了解长期 COVID 个人情感和心理健康状况,确定他们最关心的话题,并探索他们情绪与社交媒体活动之间的可能相关性。我们将微博分为四类基于内容,检测表示六种基本情感的存在,并提取最常见的话题。我们的分析发现,研究期间全程具有负情感占据优势,有两个关键时期的高峰,如新冠变种爆发。本研究的发现对政策和addressing长期 COVID 患者的心理健康挑战提供了依据,并为未来工作提供了基础。

Improved POS tagging for spontaneous, clinical speech using data augmentation

  • paper_url: http://arxiv.org/abs/2307.05796
  • repo_url: None
  • paper_authors: Seth Kulick, Neville Ryant, David J. Irwin, Naomi Nevler, Sunghye Cho
  • for: 本研究旨在提高临床人群口语讲解词法标注的精度。
  • methods: 我们不使用域内treebank进行训练,而是使用新闻报道的out of domain treebank,并使用数据增强技术来使这些结构更像自然的口语。
  • results: 我们通过使用手动验证的POS标签测试并证实了使用数据增强技术训练的parser的性能提高。
    Abstract This paper addresses the problem of improving POS tagging of transcripts of speech from clinical populations. In contrast to prior work on parsing and POS tagging of transcribed speech, we do not make use of an in domain treebank for training. Instead, we train on an out of domain treebank of newswire using data augmentation techniques to make these structures resemble natural, spontaneous speech. We trained a parser with and without the augmented data and tested its performance using manually validated POS tags in clinical speech produced by patients with various types of neurodegenerative conditions.
    摘要 Translation notes:* "POS tagging" 被翻译为 "Part-of-speech 标注" ( particle 标注 )* "transcripts of speech" 被翻译为 " spoken language 笔记" ( 口语 笔记 )* "clinical populations" 被翻译为 "医学人群" ( 医学 人群 )* "prior work" 被翻译为 "先前的研究" ( 先前的 研究 )* "an in-domain treebank" 被翻译为 "一个领域 treebank" ( 一个 领域 treebank )* "out-of-domain treebank" 被翻译为 "外领域 treebank" ( 外领域 treebank )* "data augmentation techniques" 被翻译为 "数据扩充技术" ( 数据 扩充 技术 )* " manually validated POS tags" 被翻译为 "手动验证的 Part-of-speech 标注" ( 手动 验证 的 particle 标注 )

Large Language Models

  • paper_url: http://arxiv.org/abs/2307.05782
  • repo_url: https://github.com/lm-sys/FastChat
  • paper_authors: Michael R. Douglas
  • for: 这篇论文是为了介绍大语言模型(LLM)的发展和现状,以及这些模型在完成其他任务时的工作原理。
  • methods: 这篇论文使用了transformer架构,并详细介绍了这种架构的实现。
  • results: 论文介绍了LLM的发展历史和当前状况,以及模型如何在预测下一个单词的基础上完成其他任务。
    Abstract Artificial intelligence is making spectacular progress, and one of the best examples is the development of large language models (LLMs) such as OpenAI's GPT series. In these lectures, written for readers with a background in mathematics or physics, we give a brief history and survey of the state of the art, and describe the underlying transformer architecture in detail. We then explore some current ideas on how LLMs work and how models trained to predict the next word in a text are able to perform other tasks displaying intelligence.
    摘要 人工智能正在做出各种各样的进步,其中一个最出色的例子就是大型语言模型(LLM),如OpenAI的GPT系列。在这些讲座中,我们为有数学或物理背景的读者提供了简短的历史和现状概述,并对transformer架构进行详细介绍。然后,我们会详细介绍一些当前LLM工作原理的想法,以及如何通过预测下一个文本字符来实现其他智能任务的能力。

Neural Machine Translation Data Generation and Augmentation using ChatGPT

  • paper_url: http://arxiv.org/abs/2307.05779
  • repo_url: None
  • paper_authors: Wayne Yang, Garrett Nicolai
  • for: 用于替代手动创建的平行 corpora,以便更快速地和更cost-effectively进行机器翻译模型的训练。
  • methods: 利用生成语言模型创建的幻想平行 corpora,这些模型本身是在平行数据上训练的。
  • results: 实验发现,幻想的数据可以提高翻译信号,即使Domain clashes with the original dataset。
    Abstract Neural models have revolutionized the field of machine translation, but creating parallel corpora is expensive and time-consuming. We investigate an alternative to manual parallel corpora - hallucinated parallel corpora created by generative language models. Although these models are themselves trained on parallel data, they can leverage a multilingual vector space to create data, and may be able to supplement small manually-procured corpora. Our experiments highlight two key findings - despite a lack of diversity in their output, the hallucinated data improves the translation signal, even when the domain clashes with the original dataset.
    摘要 neural models 已经革命化了机器翻译领域,但创建平行资料是昂贵的和时间consuming。我们研究一种 altenative 到手动平行资料 - 由生成语言模型生成的幻想平行资料。这些模型本身已经在平行数据上训练,但它们可以利用多语言向量空间创建数据,并可能能够补充小型手动获取的资料。我们的实验发现了两个关键发现:尽管生成的输出没有多样性,但这些幻想数据可以提高翻译信号,即使领域与原始数据不符。

Towards Robust and Efficient Continual Language Learning

  • paper_url: http://arxiv.org/abs/2307.05741
  • repo_url: None
  • paper_authors: Adam Fisch, Amal Rannen-Triki, Razvan Pascanu, Jörg Bornschein, Angeliki Lazaridou, Elena Gribovskaya, Marc’Aurelio Ranzato
  • for: 本文旨在研究如何快速适应新任务,通过 continual learning 的视角,即继续使模型在过去任务的基础上进行微调,以 Transfer 有用的知识。
  • methods: 作者提出了一个新的任务序列审核标准,该标准包括了不同的转移enario,如高可能性转移、高可能性逆转移、无预期效果和混合等。理想的学习者应该能够充分利用所有可能带来积极转移的任务中的信息,同时避免任务的干扰。
  • results: 作者提出了一种简单 yet effective 的学习者,通过选择性地使用过去任务的检查点初始化新模型来实现。然而,这些学习者仍然存在限制,希望这个审核标准可以帮助社区建立和分析更好的学习者。
    Abstract As the application space of language models continues to evolve, a natural question to ask is how we can quickly adapt models to new tasks. We approach this classic question from a continual learning perspective, in which we aim to continue fine-tuning models trained on past tasks on new tasks, with the goal of "transferring" relevant knowledge. However, this strategy also runs the risk of doing more harm than good, i.e., negative transfer. In this paper, we construct a new benchmark of task sequences that target different possible transfer scenarios one might face, such as a sequence of tasks with high potential of positive transfer, high potential for negative transfer, no expected effect, or a mixture of each. An ideal learner should be able to maximally exploit information from all tasks that have any potential for positive transfer, while also avoiding the negative effects of any distracting tasks that may confuse it. We then propose a simple, yet effective, learner that satisfies many of our desiderata simply by leveraging a selective strategy for initializing new models from past task checkpoints. Still, limitations remain, and we hope this benchmark can help the community to further build and analyze such learners.
    摘要 (翻译中)随着语言模型的应用空间不断演化,人们 naturall 会问到如何快速适应新任务。我们从 kontinual learning 的视角出发,即继续在过去任务的基础上练习新任务,以实现知识传递。然而,这种策略也存在危险,即负面传递。在这篇论文中,我们构建了一个新的任务序列底层,targeting 不同的可能的传递enario,如高可能性正向传递、高可能性负面传递、无预期效果、或一种混合。理想的学习者应该能够充分利用所有可能实现正向传递的任务中的信息,而无需被任务所折衣。我们提议一种简单 yet effective 的学习者,通过选择性地从过去任务的检查点初始化新模型来满足许多我们的需求。然而,限制仍然存在,我们希望这个底层可以帮助社区进一步建立和分析这类学习者。

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

  • paper_url: http://arxiv.org/abs/2307.05695
  • repo_url: https://github.com/guitaricet/peft_pretraining
  • paper_authors: Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, Anna Rumshisky
  • for: 本文旨在探讨底层训练技术的可行性,以减少训练大型神经网络所需的计算资源。
  • methods: 本文提出了一种新的low-rank训练方法,称为ReLoRA,可以高效地训练大型神经网络。ReLoRA使用低级别更新来训练高级别网络。
  • results: 作者通过应用ReLoRA方法训练 pré-training transformer语言模型,并证明了与常规神经网络训练相比,ReLoRA可以具有相同的性能。此外,作者发现,随着模型的大小增加,ReLoRA的效率会随着增加。这些发现有助于理解低级别训练技术的潜在优势和扩展法律。
    Abstract Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparametrized models remains poorly understood, and alternative approaches do not necessarily make it cheaper to train high-performance models. In this paper, we explore low-rank training techniques as an alternative approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to pre-training transformer language models with up to 350M parameters and demonstrate comparable performance to regular neural network training. Furthermore, we observe that the efficiency of ReLoRA increases with model size, making it a promising approach for training multi-billion-parameter networks efficiently. Our findings shed light on the potential of low-rank training techniques and their implications for scaling laws.
    摘要 尽管拓扑和效果性的拓扑带来了大型神经网络中有百亿个参数的情况,但是训练过参数化模型的必要性并不很理解,而且寻求更便宜的训练方法并不一定能够提高高性能模型的训练成本。在这篇论文中,我们探索了低级别训练技术作为大神经网络训练的替代方法。我们提出了一种名为ReLoRA的新方法,它利用低级别更新来训练高级别网络。我们在预训练转换器语言模型中应用ReLoRA,并达到了与正常神经网络训练相同的性能。此外,我们发现ReLoRA的效率随着模型的大小而增长,这表明它是训练多亿参数网络的有效方法。我们的发现反映了低级别训练技术的潜在优势和拓扑法则的影响。

Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features

  • paper_url: http://arxiv.org/abs/2307.05454
  • repo_url: https://github.com/google-research/multi-morph-checklist
  • paper_authors: Ester Hlavnova, Sebastian Ruder
  • for: 本研究旨在探讨如何使NLG系统在不同语言类型的语言中进行普适化。
  • methods: 本研究提出了一种基于形态意识的测试框架M2C,用于测试NLG模型在12种语言中的行为。
  • results: 研究发现,现有语言模型在英语等语言中表现良好,但在一些语言特点上存在一些缺陷,如斯瓦希利语的时间表达和芬兰语的复合所有格。这些发现鼓励了开发更加具有抗逆势能力的模型。
    Abstract A challenge towards developing NLP systems for the world's languages is understanding how they generalize to typological differences relevant for real-world applications. To this end, we propose M2C, a morphologically-aware framework for behavioral testing of NLP models. We use M2C to generate tests that probe models' behavior in light of specific linguistic features in 12 typologically diverse languages. We evaluate state-of-the-art language models on the generated tests. While models excel at most tests in English, we highlight generalization failures to specific typological characteristics such as temporal expressions in Swahili and compounding possessives in Finish. Our findings motivate the development of models that address these blind spots.
    摘要 “世界各语言的自然语言处理(NLP)系统的开发受到了语言特征之间的通用性问题的挑战。为解决这个问题,我们提出了M2C框架,这是一个具有 morphological awareness的行为测试框架。我们使用M2C框架生成了12种语言中的特殊语言特征测试,并评估了现有的语言模型。结果发现,这些模型在英语上表现出色,但在其他语言中存在一些缺乏通用性的特征,例如在斯瓦希利语中的时间表达和在芬兰语中的复合所有格。我们的发现将推动开发更加通用的模型。”

Duncode Characters Shorter

  • paper_url: http://arxiv.org/abs/2307.05414
  • repo_url: https://github.com/laohur/duncode
  • paper_authors: Changshang Xue
  • for: 本研究探讨了文本转换中不同编码器的使用,将字符转换为字节。
  • methods: 本文讨论了本地编码器如ASCII和GB-2312,可以将特定字符转换为更短的字节;以及通用编码器如UTF-8和UTF-16,可以对 Unicode 集合进行更好的编码,但需要更多的空间。此外,文中还介绍了 SCSU、BOCU-1 和 binary 编码器,但它们缺乏自适应同步功能。
  • results: 本文引入了一种新的编码方法 called Duncode,可以高效地编码 Unicode 字符集,与本地编码器相似。它可以将多个字符串中的多个字符编码为 Duncode 单元,使用更少的字节。虽然 Duncode 缺乏自适应同步功能,但它在空间效率方面超过了 UTF8。应用程序可以在 \url{https://github.com/laohur/duncode} 上下载。此外,文中还开发了一个评估不同语言下编码器性能的 benchmark,可以在 \url{https://github.com/laohur/wiki2txt} 上下载。
    Abstract This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is available at \url{https://github.com/laohur/duncode}. Additionally, we have developed a benchmark for evaluating character encoders across different languages. It encompasses 179 languages and can be accessed at \url{https://github.com/laohur/wiki2txt}.
    摘要

BLUEX: A benchmark based on Brazilian Leading Universities Entrance eXams

  • paper_url: http://arxiv.org/abs/2307.05410
  • repo_url: https://github.com/portuguese-benchmark-datasets/bluex
  • paper_authors: Thales Sales Almeida, Thiago Laitz, Giovana K. Bonás, Rodrigo Nogueira
  • for: The paper is written to address the lack of high-quality datasets for evaluating natural language processing (NLP) models in Portuguese, and to provide a new dataset called BLUEX for advancing the state-of-the-art in NLP in Portuguese.
  • methods: The paper introduces the BLUEX dataset, which consists of entrance exams from two leading universities in Brazil, and includes annotated metadata for evaluating NLP models on a variety of subjects. The dataset also includes recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023.
  • results: The paper establishes a benchmark for NLP models using BLUEX, and demonstrates the potential of the dataset for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The results show that state-of-the-art LMs can be improved by fine-tuning them on BLUEX.Here is the simplified Chinese text for the three key information points:
  • for: 本研究目的是为了解决葡萄牙语自然语言处理(NLP)模型的评估数据缺乏问题,并提供了一个新的数据集 called BLUEX,以提高葡萄牙语 NLP 的状态。
  • methods: 本研究引入了 BLUEX 数据集,该数据集包括两所巴西领先大学的入学考试,并包含了多个主题的注解metadata,用于评估 NLP 模型的性能。此外,BLUEX 还包括了2023年春季administered的考试,这些考试 unlikely to be included in the training data of many popular LMs as of 2023。
  • results: 本研究 estabilishes a benchmark for NLP models using BLUEX,并示出了该数据集可以提高葡萄牙语 NLP 的状态。结果显示,可以通过 fine-tuning state-of-the-art LMs on BLUEX 来提高其性能。
    Abstract One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX
    摘要 一些最新的语言模型(LM)研究中的趋势是使用标准化测试来评估。然而,虽然葡萄牙语是全球第五大语言,但是对于葡萄牙语的评估测试却相对罕见。这主要是因为葡萄牙语的高质量数据集没有被社区广泛使用。为了解决这个问题,我们介绍了巴西领先大学入学考试(BLUEX),这是来自巴西两所领先大学(UNICAMP和USP)的入学考试数据集。该数据集包括评注的metadata,用于评估语言模型在多种主题上的表现。此外,BLUEX还包括最近进行的考试,这些考试可能不会包含在许多流行的LM的训练数据中,特别是在2023年。数据集还被评注,以便确定每个问题中图像的位置,提供了进一步推进多模态语言理解和逻辑的 valuable 资源。我们描述了BLUEX的创建和特点,并通过对当前状态的LM进行实验,确立了它的潜在价值,以提高葡萄牙语自然语言理解和逻辑的状态。数据和相关代码可以在https://github.com/Portuguese-Benchmark-Datasets/BLUEX 找到。