2023-11-12

cs.CL

cs.CL - 2023-11-12

SELF-EXPLAIN: Teaching Large Language Models to Reason Complex Questions by Themselves

paper_url: http://arxiv.org/abs/2311.06985
repo_url: None
paper_authors: Jiachen Zhao, Zonghai Yao, Zhichao Yang, Hong Yu
for: 提高大语言模型的可靠逻辑能力
methods: 使用自我解释生成链条例子
results: 使用自我解释可以提高大语言模型的自信度、准确率和不偏率，并在复杂问题 answering 任务上达到或超过人工制作的 CoT 例子表现。

Abstract
Large language models (LLMs) can generate intermediate reasoning steps. To elicit the reliable reasoning, the common practice is to employ few-shot chain-of-thought prompting, where several in-context demonstrations for reasoning are prepended to the question. However, such chain-of-thought examples are expensive to craft, especially for professional domains, and can have high variance depending on human annotators. Therefore, this work investigates whether LLMs can teach themselves to reason without human-crafted demonstrations. We propose SELF-EXPLAIN to generate CoT examples by LLMs inspired by "encoding specificity" in human memory retrieval. We find using self-explanations makes LLMs more confident, more calibrated and less biased when answering complex questions. Moreover, we find prompting with self-explanations can even significantly outperform using human-crafted CoTs on several complex question answering dataset.

摘要

Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

paper_url: http://arxiv.org/abs/2311.06898
repo_url: None
paper_authors: Sujan Poudel, Nabin Ghimire, Bipesh Subedi, Saugat Singh
for: 这个研究旨在开发一个基于自然语言处理（NLP）技术的医疗域聊天机器人，提供有关怀孕信息。
methods: 这个研究使用了两种不同的NLP基本方法，一种是基于BERT的多类分类 Retrieval Approach，另一种是基于Transformer的生成型聊天机器人。
results: 实验结果表明，BERT基本模型在非分词数据上表现良好，而自建Transformer模型在分词数据上表现更好。在非分词数据上，DistilBERT模型 achieved highest training和验证精度，testing精度为0.9165。在生成方法中，使用transformer 1 gram BLEU和2 gram BLEU得分为0.3570和0.1413。

Abstract
The field of Natural Language Processing which involves the use of artificial intelligence to support human languages has seen tremendous growth due to its high-quality features. Its applications such as language translation, chatbots, virtual assistants, search autocomplete, and autocorrect are widely used in various domains including healthcare, advertising, customer service, and target advertising. To provide pregnancy-related information a health domain chatbot has been proposed and this work explores two different NLP-based approaches for developing the chatbot. The first approach is a multiclass classification-based retrieval approach using BERTbased multilingual BERT and multilingual DistilBERT while the other approach employs a transformer-based generative chatbot for pregnancy-related information. The performance of both stemmed and non-stemmed datasets in Nepali language has been analyzed for each approach. The experimented results indicate that BERT-based pre-trained models perform well on non-stemmed data whereas scratch transformer models have better performance on stemmed data. Among the models tested the DistilBERT model achieved the highest training and validation accuracy and testing accuracy of 0.9165 on the retrieval-based model architecture implementation on the non-stemmed dataset. Similarly, in the generative approach architecture implementation with transformer 1 gram BLEU and 2 gram BLEU scores of 0.3570 and 0.1413 respectively were achieved.

摘要
自然语言处理（NLP）领域，利用人工智能支持人类语言的应用有很大的发展 potential，这主要是因为它的高质量特性。其应用包括语言翻译、chatbot、虚拟助手、搜索自动完成和自动修改等，在医疗、广告、客服等领域都有广泛的应用。为了提供妊娠相关信息，这种工作提出了一个医学领域chatbot，本文探讨了两种不同的NLP基于的方法来开发chatbot。第一种方法是基于BERT的多类分类 retrieve Approach，使用BERT和DistilBERT multilingual模型；第二种方法是基于transformer的生成chatbot Approach。对于尼泊尔语的分ensed和不分ensed数据进行了分析。实验结果显示，BERT基于预训练模型在不分ensed数据上表现良好，而凿transformer模型在分ensed数据上表现更好。在多个模型中，DistilBERT模型在非分ensed数据上达到了0.9165的训练和验证精度，以及0.3570和0.1413的1 gram BLEU和2 gram BLEU分数。

DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training

paper_url: http://arxiv.org/abs/2311.06855
repo_url: https://github.com/keio-smilab23/dialmat
paper_authors: Kanta Kaneda, Ryosuke Korekata, Yuiga Wada, Shunya Nagashima, Motonari Kambara, Yui Iioka, Haruka Matsuo, Yuto Imai, Takayuki Nishimura, Komei Sugiura
for: 本研究 targets the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task.
methods: 本研究提出了 DialMAT，它使用了瞬时对抗训练，将对抗扰动添加到语言、图像和动作的幂论空间中。此外，它还引入了跨模态平行特征提取机制，通过基础模型对语言和图像进行同时学习。
results: 我们使用了基于DialFRED dataset构建的数据集进行评估，并与基线方法进行比较。结果显示，我们的模型在成功率和路径权重成功率上表现了superiority。此外，我们的模型在CVPR 2023 Embodied AI工作坊上举行的DialFRED Challenge中获得了第一名。

Abstract
This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. To address this task, we propose DialMAT. DialMAT introduces Moment-based Adversarial Training, which incorporates adversarial perturbations into the latent space of language, image, and action. Additionally, it introduces a crossmodal parallel feature extraction mechanism that applies foundation models to both language and image. We evaluated our model using a dataset constructed from the DialFRED dataset and demonstrated superior performance compared to the baseline method in terms of success rate and path weighted success rate. The model secured the top position in the DialFRED Challenge, which took place at the CVPR 2023 Embodied AI workshop.

摘要
这篇论文关注的是DialogFRED任务，即在一个 Setting 中，智能机器可以主动提问任务的 Embodied instruction following 任务。为解决这个任务，我们提议了DialMAT。DialMAT 引入了时刻基于的对抗训练，将对抗扰动包含在语言、图像和动作的含义空间中。此外，它还引入了跨Modal 平行特征提取机制，使用基础模型来处理语言和图像。我们使用了基于 DialogFRED dataset constructed 的数据集进行评估，并证明了与基线方法相比，我们的模型在成功率和路径权重成功率方面具有显著优势。我们的模型在 CVPR 2023 Embodied AI 工作坊举行的 DialFRED 挑战中占据了第一名。

Automatic Textual Normalization for Hate Speech Detection

paper_url: http://arxiv.org/abs/2311.06851
repo_url: https://github.com/anhhoang0529/small-lexnormvihsd
paper_authors: Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Nguyet Thi Nguyen, Khanh Thanh-Duy Ho, Kiet Van Nguyen
for: 本研究旨在提高社交媒体数据中非标pecific字（NSW）的处理能力，以便进行更好的自然语言处理（NLP）任务。
methods: 本研究使用单一的sequence-to-sequence（Seq2Seq）模型进行文本正常化，并提供了2,181个人注解的评论数据集，它们的间接标注协调率为0.9014。
results: 研究表明，通过使用Seq2Seq模型进行文本正常化，可以提高针对社交媒体数据的仇恨言语检测（HSD）任务的准确率约2%。此外，文本正常化还可以提高NLP任务的总表现水平。数据集可供研究用途。

Abstract
Social media data is a valuable resource for research, yet it contains a wide range of non-standard words (NSW). These irregularities hinder the effective operation of NLP tools. Current state-of-the-art methods for the Vietnamese language address this issue as a problem of lexical normalization, involving the creation of manual rules or the implementation of multi-staged deep learning frameworks, which necessitate extensive efforts to craft intricate rules. In contrast, our approach is straightforward, employing solely a sequence-to-sequence (Seq2Seq) model. In this research, we provide a dataset for textual normalization, comprising 2,181 human-annotated comments with an inter-annotator agreement of 0.9014. By leveraging the Seq2Seq model for textual normalization, our results reveal that the accuracy achieved falls slightly short of 70%. Nevertheless, textual normalization enhances the accuracy of the Hate Speech Detection (HSD) task by approximately 2%, demonstrating its potential to improve the performance of complex NLP tasks. Our dataset is accessible for research purposes.

摘要
社交媒体数据是一种有价值的资源 для研究，但它包含了广泛的非标准词 (NSW)。这些异常会阻碍NLP工具的有效运行。现有的state-of-the-art方法对越南语言问题解决这个问题为词语正常化问题，需要创建手动规则或实施多stage深度学习框架，这需要很大的努力来编写复杂的规则。与之相反，我们的方法是简单的，只使用序列到序列（Seq2Seq）模型。在这项研究中，我们提供了文本正常化的数据集，包括2,181个人注释的评论，其间的间隔注释协调度为0.9014。通过利用Seq2Seq模型进行文本正常化，我们的结果表明，准确率达到了约70%。虽然文本正常化提高了仇恨言语检测（HSD）任务的准确率约2%，这表明它有可能改善复杂NLP任务的性能。我们的数据集可以用于研究用途。

GIELLM: Japanese General Information Extraction Large Language Model Utilizing Mutual Reinforcement Effect

paper_url: http://arxiv.org/abs/2311.06838
repo_url: None
paper_authors: Chengguang Gan, Qinghao Zhang, Tatsunori Mori
for: 本研究旨在开发一个能同时处理多种自然语言处理（NLP）子任务的通用语言模型（GIELLM），以提高现有的专门化模型的性能。methods: 本研究使用了一个统一的输入-输出架构， integrate了文本分类、情感分析、名称实体识别、关系EXTRACTION和事件EXTRACTION等多种自然语言处理子任务。此外，研究还利用了相互强制效应（MRE），从而提高了统一任务中的性能。results: 实验结果显示，GIELLM在日本混合数据集上取得了State-of-the-Art（SOTA）的成绩，较GPT-3.5-Turbo有明显的改善。此外，在新的文本分类关系和事件EXTRACTION数据集上进行独立评估，也获得了相互强制效应的优化。这个突破破坏了传统的NLP子任务特化模型，并将大多数IE子任务整合到了一个通用语言模型架构中。

Abstract
Information Extraction (IE) stands as a cornerstone in natural language processing, traditionally segmented into distinct sub-tasks. The advent of Large Language Models (LLMs) heralds a paradigm shift, suggesting the feasibility of a singular model addressing multiple IE subtasks. In this vein, we introduce the General Information Extraction Large Language Model (GIELLM), which integrates text Classification, Sentiment Analysis, Named Entity Recognition, Relation Extraction, and Event Extraction using a uniform input-output schema. This innovation marks the first instance of a model simultaneously handling such a diverse array of IE subtasks. Notably, the GIELLM leverages the Mutual Reinforcement Effect (MRE), enhancing performance in integrated tasks compared to their isolated counterparts. Our experiments demonstrate State-of-the-Art (SOTA) results in five out of six Japanese mixed datasets, significantly surpassing GPT-3.5-Turbo. Further, an independent evaluation using the novel Text Classification Relation and Event Extraction(TCREE) dataset corroborates the synergistic advantages of MRE in text and word classification. This breakthrough paves the way for most IE subtasks to be subsumed under a singular LLM framework. Specialized fine-tune task-specific models are no longer needed.

摘要
信息提取（IE）作为自然语言处理的基石，曾经分为多个子任务。现在大语言模型（LLM）的出现，标志着一种新的 парадиг shift，提出了一个单一模型可以处理多个 IE 子任务。为此，我们介绍了通用信息提取大语言模型（GIELLM），它将文本分类、情感分析、名实Recognition、关系提取和事件提取 integrates into a uniform input-output schema.这是首次一个模型同时处理这些多样化的 IE 子任务。值得注意的是，GIELLM 利用了相互强制效应（MRE），在集成任务中提高了性能，相比独立的任务。我们的实验表明，GIELLM 在日本混合数据集上达到了状态之最（SOTA）水平，明显超过 GPT-3.5-Turbo。此外，一个独立的评估使用新的文本类别关系和事件抽象（TCREE）数据集也证明了MRE在文本和单词分类中的共同优势。这一突破可能使得大多数 IE 子任务被纳入单一 LLM 框架中，不再需要专门的精化任务模型。

Cricket Player Profiling: Unraveling Strengths and Weaknesses Using Text Commentary Data

paper_url: http://arxiv.org/abs/2311.06818
repo_url: None
paper_authors: Swarup Ranjan Behera, Vijaya V. Saradhi
For: This paper aims to develop computational models to extract the rules governing cricket players’ strengths and weaknesses, with the goal of devising player-specific strategies.* Methods: The paper utilizes unstructured data from cricket text commentary to construct comprehensive strength and weakness rules for cricket players, and employs dimensionality reduction techniques to simplify the rule-building process.* Results: The paper conducts an in-depth analysis of cricket player strengths and weaknesses using a vast corpus of over one million text commentaries, and validates the constructed rules through two distinct methodologies: intrinsic and extrinsic. The results are made openly accessible, including the collected data, source code, and results for over 250 cricket players.

Abstract
Devising player-specific strategies in cricket necessitates a meticulous understanding of each player's unique strengths and weaknesses. Nevertheless, the absence of a definitive computational approach to extract such insights from cricket players poses a significant challenge. This paper seeks to address this gap by establishing computational models designed to extract the rules governing player strengths and weaknesses, thereby facilitating the development of tailored strategies for individual players. The complexity of this endeavor lies in several key areas: the selection of a suitable dataset, the precise definition of strength and weakness rules, the identification of an appropriate learning algorithm, and the validation of the derived rules. To tackle these challenges, we propose the utilization of unstructured data, specifically cricket text commentary, as a valuable resource for constructing comprehensive strength and weakness rules for cricket players. We also introduce computationally feasible definitions for the construction of these rules, and present a dimensionality reduction technique for the rule-building process. In order to showcase the practicality of this approach, we conduct an in-depth analysis of cricket player strengths and weaknesses using a vast corpus of more than one million text commentaries. Furthermore, we validate the constructed rules through two distinct methodologies: intrinsic and extrinsic. The outcomes of this research are made openly accessible, including the collected data, source code, and results for over 250 cricket players, which can be accessed at https://bit.ly/2PKuzx8.

摘要
制定玩家特定策略在板球需要非常细致地理解每名球员的独特优势和劣势。然而，没有一种确定的计算方法可以从板球球员中提取这些洞察。这篇论文希望通过建立计算模型，从板球球员中提取规则，以便为每名球员制定特定策略。这个复杂的任务存在多个关键领域：选择合适的数据集、准确定义优势和劣势规则、选择适当的学习算法和验证 derivated 规则。为了解决这些挑战，我们提议使用无结构数据，具体是板球文字评论，作为建立全面优势和劣势规则的 valuabel 资源。我们还介绍了计算可行的规则定义方法，并提出了维度减少技术来进行规则建立过程。为了证明这种方法的实用性，我们对板球球员的优势和劣势进行了深入分析，使用了超过一百万个文字评论。此外，我们还验证了建立的规则，通过两种不同的方法：内在和外在。研究结果将公开访问，包括收集的数据、源代码和结果，可以在https://bit.ly/2PKuzx8 中获取。

Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

paper_url: http://arxiv.org/abs/2311.06815
repo_url: None
paper_authors: Sebastian Ziegelmayer, Alexander W. Marka, Nicolas Lenhart, Nadja Nehls, Stefan Reischl, Felix Harder, Andreas Sauter, Marcus Makowski, Markus Graf, Joshua Gawlitza
for: 这个研究用于检查GPT-4模型是否可以生成高质量的胸部X射影印象。
methods: 研究使用了GPT-4模型，给它提供了图像、文本、图文三种不同的输入模式，然后让它生成对应的印象。 radiologist then blindly评分了这些印象，并将它们分为人类写的和AI生成的两类。
results: 研究发现，人类写的印象得分最高，但与文本基于印象的分数相似。自动评分指标与评分分数之间存在显著的相关性，但输入模式对检测AI生成印象的能力产生了差异。 AI生成的印象的评分比人类写的印象更差，即使用 radiologist 写的。

Abstract
The remarkable generative capabilities of multimodal foundation models are currently being explored for a variety of applications. Generating radiological impressions is a challenging task that could significantly reduce the workload of radiologists. In our study we explored and analyzed the generative abilities of GPT-4 for Chest X-ray impression generation. To generate and evaluate impressions of chest X-rays based on different input modalities (image, text, text and image), a blinded radiological report was written for 25-cases of the publicly available NIH-dataset. GPT-4 was given image, finding section or both sequentially to generate an input dependent impression. In a blind randomized reading, 4-radiologists rated the impressions and were asked to classify the impression origin (Human, AI), providing justification for their decision. Lastly text model evaluation metrics and their correlation with the radiological score (summation of the 4 dimensions) was assessed. According to the radiological score, the human-written impression was rated highest, although not significantly different to text-based impressions. The automated evaluation metrics showed moderate to substantial correlations to the radiological score for the image impressions, however individual scores were highly divergent among inputs, indicating insufficient representation of radiological quality. Detection of AI-generated impressions varied by input and was 61% for text-based impressions. Impressions classified as AI-generated had significantly worse radiological scores even when written by a radiologist, indicating potential bias. Our study revealed significant discrepancies between a radiological assessment and common automatic evaluation metrics depending on the model input. The detection of AI-generated findings is subject to bias that highly rated impressions are perceived as human-written.

摘要
“研究现在正在探索多modal基础模型的生成能力，以应用于各种领域。生成骨质影像是一个具有挑战性的任务，可以帮助骨科医生优化工作效率。我们在这个研究中探索了GPT-4模型的生成能力，并分析了它对骨质影像的生成。为了生成和评估不同输入模式（影像、文本、文本和影像）的骨质影像，我们将医学报告撰写为25例NIH数据集的隐藏标签。GPT-4模型获得了影像、发现部分或两者来生成输入对应的印象。在隐藏随机读取中，4名医生评估了印象，并被要求根据印象的来源（人类、AI）进行分类，并提供详细的评论。我们发现文本模型评估度和医学评分（四个维度的和）之间存在 Moderate to substantial 的相互相关性，但个别输入的评分存在很大的差异，这表明医学质量的抽象不够。我们发现自动生成印象的检测存在偏见，对于文本印象而言，检测率为61%。我们发现，即使由医生生成的AI印象，也存在偏见。我们的研究表明，医学评分和自动评估度之间存在差异，尤其是在不同的输入模式下。”

On the Robustness of Question Rewriting Systems to Questions of Varying Hardness

paper_url: http://arxiv.org/abs/2311.06807
repo_url: https://github.com/nusnlp/diffqre
paper_authors: Hai Ye, Hwee Tou Ng, Wenjuan Han
for: 本文关注在异 Reformulation 系统对 вопро题的灵活性进行扩展，以提高问题的 rewrite 灵活性。
methods: 本文提出了一种自动将问题分类为不同困难度的方法，并通过人工评估确定问题的 rewrite 困难度。 finally, 本文提出了一种新的学习框架，通过独立地在不同困难度的问题上训练 QR 模型，然后将这些模型组合成一个joint模型进行推理。
results: 实验结果表明，本文提出的方法可以提高问题的 rewrite 性能，并且在两个数据集上达到了比基eline更高的性能。

Abstract
In conversational question answering (CQA), the task of question rewriting~(QR) in context aims to rewrite a context-dependent question into an equivalent self-contained question that gives the same answer. In this paper, we are interested in the robustness of a QR system to questions varying in rewriting hardness or difficulty. Since there is a lack of questions classified based on their rewriting hardness, we first propose a heuristic method to automatically classify questions into subsets of varying hardness, by measuring the discrepancy between a question and its rewrite. To find out what makes questions hard or easy for rewriting, we then conduct a human evaluation to annotate the rewriting hardness of questions. Finally, to enhance the robustness of QR systems to questions of varying hardness, we propose a novel learning framework for QR that first trains a QR model independently on each subset of questions of a certain level of hardness, then combines these QR models as one joint model for inference. Experimental results on two datasets show that our framework improves the overall performance compared to the baselines.

摘要
在对话式问答（CQA）任务中，问题重写（QR）任务的目标是将上下文相依的问题重写成等效的自包含问题，以便得到相同的答案。在这篇论文中，我们对Question重写系统的稳定性具有兴趣。因为没有按 Rewrite 难度分类的问题，我们首先提出了一种euristic方法，使用问题和重写之间的差异来自动分类问题，并将其分为不同难度的子集。然后，我们进行了人工评估，以标注问题的重写难度。最后，我们提出了一种新的学习框架，用于增强Question重写系统对问题难度的Robustness。我们首先在每个难度水平上独立训练了QR模型，然后将这些QR模型组合成一个共同模型进行推理。实验结果表明，我们的框架可以提高对比基eline的总性能。

Tunable Soft Prompts are Messengers in Federated Learning

paper_url: http://arxiv.org/abs/2311.06805
repo_url: https://github.com/alibaba/federatedscope
paper_authors: Chenhe Dong, Yuexiang Xie, Bolin Ding, Ying Shen, Yaliang Li
for: 这个论文的目的是提出一种基于联合学习的新训练方法，以保护模型隐私并提高联合学习的效率。
methods: 该论文使用了软提示的技术，通过在服务器和客户端之间更新和传输软提示来实现信息交换。这些软提示将担任全球模型参数的角色，将本地数据和全球模型的有用知识传递给客户端进行训练。
results: 对比多个基eline，实验结果显示了提出的方法的效果，包括降低了联合学习的通信和计算成本，同时保护了全球模型的隐私。

Abstract
Federated learning (FL) enables multiple participants to collaboratively train machine learning models using decentralized data sources, alleviating privacy concerns that arise from directly sharing local data. However, the lack of model privacy protection in FL becomes an unneglectable challenge, especially when people want to federally finetune models based on a proprietary large language model. In this study, we propose a novel FL training approach that accomplishes information exchange among participants via tunable soft prompts. These soft prompts, updated and transmitted between the server and clients, assume the role of the global model parameters and serve as messengers to deliver useful knowledge from the local data and global model. As the global model itself is not required to be shared and the local training is conducted based on an auxiliary model with fewer parameters than the global model, the proposed approach provides protection for the global model while reducing communication and computation costs in FL. Extensive experiments show the effectiveness of the proposed approach compared to several baselines. We have released the source code at \url{https://github.com/alibaba/FederatedScope/tree/fedsp/federatedscope/nlp/fedsp}.

摘要
federated learning (FL) 允许多个参与者共同训练机器学习模型使用分散数据源，从直接分享本地数据中减轻隐私问题。然而，FL中模型隐私保护的缺失成为一个不可忽略的挑战，特别是当人们想 federally finetune 模型基于专有大型语言模型时。在这种研究中，我们提出了一种新的 FL 训练方法，通过可调软提示来实现参与者之间的信息交换。这些软提示在服务器和客户端之间往返更新和传输，担任全球模型参数的角色，将本地数据和全球模型中的有用知识传递给其他参与者。由于全球模型本身不需要直接分享，并且基于副本模型（具有较少参数）进行本地训练，我们的方法提供了全球模型的保护，同时降低了 FL 的通信和计算成本。我们的实验表明，我们的方法与多个基准方法进行比较，显著超出了这些基准方法。我们已经在 \url{https://github.com/alibaba/FederatedScope/tree/fedsp/federatedscope/nlp/fedsp} 上发布了源代码。

CLAMP: A Contrastive Language And Molecule Pre-training Network

paper_url: http://arxiv.org/abs/2311.07617
repo_url: https://github.com/neelr/clamp
paper_authors: Neel Redkar
for: 这篇论文探讨了一种新的材料生成方法，即语言到材料生成架构，利用了数百万个untapped数据点。
methods: 该方法使用了一种对偶模型，通过用一个 convolutional graph neural network encoder 和一个语言encoder来训练。这allow了无监督的零例试验分类，可以利用语言结构的特征。
results: 在实验中，该方法可以达到了约82%的准确率和约75%的光催化剂预测率，使用了一个非常小的数据集。这种新的网络可以应用于任何可以通过文本描述的反应，开启了完全新的方法来思考3D化学结构生成。

Abstract
This paper highlights a shift in how to approach material generation. Instead of material-to-material, we propose a language-to-material generation architecture that utilizes millions of untapped data points. Using a web scraper to collect crystal text pairs from open-source research papers, a contrastive model can be trained using a convolutional graph neural network encoder and a language encoder. This would allow unsupervised zero-shot classification which can be trained by taking advantage of linguistic structure. Without any specific training data, an ~82\% accuracy was achieved and ~75\% accuracy for photocatalyst prediction with an extremely small dataset. This novel network could ideally be cross-applied to any reaction that can be described via text, opening completely new methods to think about 3D chemical framework generation. In the full experiment diffusion models would likely be incorporated to fully exploit the latent space.

摘要
这篇论文描述了一种新的材料生成方法的shift。而不是传统的材料到材料的方法，我们提议使用语言到材料生成架构，利用了数百万个未利用的数据点。通过使用网络抓取器收集开源研究论文中的晶体文本对，我们可以使用一种对比模型来训练一个 convolutional graph neural network 编码器和一个语言编码器。这将允许无监督零shot分类训练，利用语言结构来学习。无需任何特定的训练数据，我们可以达到了~82%的准确率和~75%的光吸catalyst预测准确率，只使用了一个非常小的数据集。这种新的网络可以理论上应用于任何可以通过文本描述的反应，打开了 Completely new方法来思考3D化学框架生成。在实验中，扩散模型可能会被integrated以全面利用潜在空间。

Learning Knowledge-Enhanced Contextual Language Representations for Domain Natural Language Understanding

paper_url: http://arxiv.org/abs/2311.06761
repo_url: None
paper_authors: Ruyao Xu, Taolin Zhang, Chengyu Wang, Zhongjie Duan, Cen Chen, Minghui Qiu, Dawei Cheng, Xiaofeng He, Weining Qian
for: 提高闭区域NLPTasks的性能（包括知识感知和普通NLPTasks）
methods: 使用知识图中的隐式图结构，以及深度层次entity-class结构的卷积编码来充分融合知识；同时使用subgraph contrastive learning来提高数据训练的质量
results: 在闭区域NLPTasks中显著超过其他KEPLM训练方法的性能，包括全shot和少shot学习设置

Abstract
Knowledge-Enhanced Pre-trained Language Models (KEPLMs) improve the performance of various downstream NLP tasks by injecting knowledge facts from large-scale Knowledge Graphs (KGs). However, existing methods for pre-training KEPLMs with relational triples are difficult to be adapted to close domains due to the lack of sufficient domain graph semantics. In this paper, we propose a Knowledge-enhanced lANGuAge Representation learning framework for various clOsed dOmains (KANGAROO) via capturing the implicit graph structure among the entities. Specifically, since the entity coverage rates of closed-domain KGs can be relatively low and may exhibit the global sparsity phenomenon for knowledge injection, we consider not only the shallow relational representations of triples but also the hyperbolic embeddings of deep hierarchical entity-class structures for effective knowledge fusion.Moreover, as two closed-domain entities under the same entity-class often have locally dense neighbor subgraphs counted by max point biconnected component, we further propose a data augmentation strategy based on contrastive learning over subgraphs to construct hard negative samples of higher quality. It makes the underlying KELPMs better distinguish the semantics of these neighboring entities to further complement the global semantic sparsity. In the experiments, we evaluate KANGAROO over various knowledge-aware and general NLP tasks in both full and few-shot learning settings, outperforming various KEPLM training paradigms performance in closed-domains significantly.

摘要
知识增强预训练语言模型（KEPLM）可以提高下游NLPTask的性能，通过在大规模知识图（KG）中插入知识事实。然而，现有的KEPLM预训练方法难以适应封闭领域，因为封闭领域知识图的 semantics 缺乏。在这篇论文中，我们提出了一种名为 Knowledge-enhanced lANGuAge Representation learning framework for various clOsed dOmains（KANGAROO），通过捕捉实体之间的隐式图结构来提高 KEPLM 的性能。具体来说，关闭领域知识图中实体的覆盖率可能比较低，同时可能出现全球稀缺现象，因此我们不仅考虑了 triple 的浅层关系表示，还考虑了深层Entity-class结构的质量 embeddings。此外，在关闭领域知识图中，两个相同Entity-class的实体通常有本地稠密的邻居子图，我们提出了基于对比学习的数据增强策略，以生成更高质量的硬性负样本。这使得下面的KELPM更好地了解这些邻近实体的 semantics，并且进一步补偿全球semantic稀缺。在实验中，我们评估了 KANGAROO 在不同知识感知和通用 NLP 任务上的性能，在封闭领域内显著超越了不同的KEPLM 训练方法。

paper_url: http://arxiv.org/abs/2311.06758
repo_url: None
paper_authors: Tingfeng Cao, Chengyu Wang, Chuanqi Tan, Jun Huang, Jinhui Zhu
for: 这 paper 的目的是提出一种新的跨语言 Machine Reading Comprehension (MRC) 方法，以增强跨语言模型之间的转移性。
methods: 这 paper 使用了一种名为 X-STA 的新方法，包括一个抑制 teachert 来细致地传递源语言中的答案块到目标语言的答案输出空间，以及一种 Gradient-Disentangled Knowledge Sharing 技术来提高跨语言转移性。
results: 根据 experiments 表明，X-STA 方法可以准确地捕捉多种语言的答案块，并在三个多语言 MRC 数据集上表现出色，超越了当前的state-of-the-art 方法。

Abstract
In cross-lingual language understanding, machine translation is often utilized to enhance the transferability of models across languages, either by translating the training data from the source language to the target, or from the target to the source to aid inference. However, in cross-lingual machine reading comprehension (MRC), it is difficult to perform a deep level of assistance to enhance cross-lingual transfer because of the variation of answer span positions in different languages. In this paper, we propose X-STA, a new approach for cross-lingual MRC. Specifically, we leverage an attentive teacher to subtly transfer the answer spans of the source language to the answer output space of the target. A Gradient-Disentangled Knowledge Sharing technique is proposed as an improved cross-attention block. In addition, we force the model to learn semantic alignments from multiple granularities and calibrate the model outputs with teacher guidance to enhance cross-lingual transferability. Experiments on three multi-lingual MRC datasets show the effectiveness of our method, outperforming state-of-the-art approaches.

摘要
在语言跨越机器理解中，机器翻译经常被使用来提高语言之间模型的传输性，例如将源语言的训练数据翻译成目标语言，或将目标语言的数据翻译回源语言以帮助推理。然而，在跨语言机器阅读理解（MRC）中，因为答案范围位置在不同语言中存在差异，因此很难进行深度的帮助来提高跨语言传输性。在这篇论文中，我们提出了X-STA，一种新的跨语言MRC方法。具体来说，我们利用了一个注意力教师，通过细致地将源语言的答案范围转移到目标语言的答案输出空间中来帮助学习。此外，我们还提出了一种 Gradient-Disentangled Knowledge Sharing 技术，用于改进交叉注意力块。此外，我们还强制模型学习多级别的 semantic alignments，并使用教师指导来调整模型输出以增强跨语言传输性。实验结果表明，我们的方法可以备受效果，比过去的方法更高。

From Complex to Simple: Unraveling the Cognitive Tree for Reasoning with Small Language Models

paper_url: http://arxiv.org/abs/2311.06754
repo_url: None
paper_authors: Junbing Yan, Chengyu Wang, Taolin Zhang, Xiaofeng He, Jun Huang, Wei Zhang
for: 这个论文旨在探讨语言模型如何实现复杂的逻辑推理能力，以及如何使用双 processtheory来解释语言模型的认知过程。
methods: 该论文采用了一种迭代的方法来构建一个认知树（CogTree），该树的根节点表示初始查询，叶节点则表示可以直接回答的简单问题。该方法包括两个主要组成部分：即潜意EXTRACTION模块（Intuitive System）和EXPLICIT reasoning模块（Reflective System）。
results: 实验结果表明，使用这种方法可以达到与GPT-3.5（具有175B参数）的性能水平，使用的语言模型只有 <=7B 参数，这比GPT-3.5的5% fewer parameters。

Abstract
Reasoning is a distinctive human capacity, enabling us to address complex problems by breaking them down into a series of manageable cognitive steps. Yet, complex logical reasoning is still cumbersome for language models. Based on the dual process theory in cognitive science, we are the first to unravel the cognitive reasoning abilities of language models. Our framework employs an iterative methodology to construct a Cognitive Tree (CogTree). The root node of this tree represents the initial query, while the leaf nodes consist of straightforward questions that can be answered directly. This construction involves two main components: the implicit extraction module (referred to as the intuitive system) and the explicit reasoning module (referred to as the reflective system). The intuitive system rapidly generates multiple responses by utilizing in-context examples, while the reflective system scores these responses using comparative learning. The scores guide the intuitive system in its subsequent generation step. Our experimental results on two popular and challenging reasoning tasks indicate that it is possible to achieve a performance level comparable to that of GPT-3.5 (with 175B parameters), using a significantly smaller language model that contains fewer parameters (<=7B) than 5% of GPT-3.5.

摘要
人类具有特殊的理智能力，可以将复杂问题分解成一系列可管理的认知步骤来解决。然而，复杂逻辑理解仍然是语言模型的瓶颈。根据认知科学中的双 процесс理论，我们是第一个揭示语言模型的认知逻辑能力的研究。我们的框架采用迭代方法构建认知树（CogTree）。树的根节点表示初始查询，叶节点包含直接回答的简单问题。这个构建过程包括两个主要组成部分：印象EXTRACT模块（被称为直觉系统）和显式逻辑理解模块（被称为反思系统）。直觉系统快速生成多个回答，利用上下文例子，而显式逻辑理解模块使用比较学习评分这些回答。这些分数导引直觉系统在下一步生成过程中。我们对两个知名和具有挑战性的逻辑任务进行实验，结果表明，可以使用比GPT-3.5（具有175B参数）更小的语言模型（<=7B）达到相似的性能水平。

BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis

paper_url: http://arxiv.org/abs/2311.06752
repo_url: https://github.com/NicholasCao/BeautifulPrompt
paper_authors: Tingfeng Cao, Chengyu Wang, Bingyan Liu, Ziheng Wu, Jinhui Zhu, Jun Huang
for: 提高 diffusion-based deep generative models 的 text-to-image Synthesis 质量
methods: 使用 BeautifulPrompt 模型从简单描述生成高质量的 prompts，并通过人工智能反馈循环优化模型
results: 通过学习视觉 AI 反馈，可以提高生成的 prompts 和图像质量，并将 BeautifulPrompt 集成到云端 AI 平台以提供更好的 text-to-image 生成服务

Abstract
Recently, diffusion-based deep generative models (e.g., Stable Diffusion) have shown impressive results in text-to-image synthesis. However, current text-to-image models often require multiple passes of prompt engineering by humans in order to produce satisfactory results for real-world applications. We propose BeautifulPrompt, a deep generative model to produce high-quality prompts from very simple raw descriptions, which enables diffusion-based models to generate more beautiful images. In our work, we first fine-tuned the BeautifulPrompt model over low-quality and high-quality collecting prompt pairs. Then, to ensure that our generated prompts can generate more beautiful images, we further propose a Reinforcement Learning with Visual AI Feedback technique to fine-tune our model to maximize the reward values of the generated prompts, where the reward values are calculated based on the PickScore and the Aesthetic Scores. Our results demonstrate that learning from visual AI feedback promises the potential to improve the quality of generated prompts and images significantly. We further showcase the integration of BeautifulPrompt to a cloud-native AI platform to provide better text-to-image generation service in the cloud.

摘要

Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding

paper_url: http://arxiv.org/abs/2311.06736
repo_url: None
paper_authors: Ying Su, Xiaojin Fu, Mingwen Liu, Zhijiang Guo
for: 这个研究旨在评估大型自然语言模型（LLM）在逻辑推理任务中的表现，特别是使用链式思维（CoT）策略。
methods: 研究者使用了减少规模的语言模型，并引入了分解证明目标为更可管理的子目标、以及使用反例推导来强化模型的逻辑推理能力。
results: 实验结果表明，使用研究者提出的方法可以增强LLM在逻辑推理任务中的表现，特别是在复杂的逻辑推理链中。

Abstract
Logical reasoning remains a pivotal component within the realm of artificial intelligence. The recent evolution of large language models (LLMs) has marked significant progress in this domain. The adoption of strategies like chain-of-thought (CoT) has enhanced the performance of LLMs across diverse reasoning tasks. Nonetheless, logical reasoning that involves proof planning, specifically those that necessitate the validation of explanation accuracy, continues to present stumbling blocks. In this study, we first evaluate the efficacy of LLMs with advanced CoT strategies concerning such tasks. Our analysis reveals that LLMs still struggle to navigate complex reasoning chains, which demand the meticulous linkage of premises to derive a cogent conclusion. To address this issue, we finetune a smaller-scale language model, equipping it to decompose proof objectives into more manageable subgoals. We also introduce contrastive decoding to stepwise proof generation, making use of negative reasoning paths to strengthen the model's capacity for logical deduction. Experiments on EntailmentBank underscore the success of our method in augmenting the proof planning abilities of language models.

摘要
<> translate the given text into Simplified Chinese.>ilogical reasoning remains a crucial component within the realm of artificial intelligence. The recent evolution of large language models (LLMs) has marked significant progress in this domain. The adoption of strategies like chain-of-thought (CoT) has enhanced the performance of LLMs across diverse reasoning tasks. However, logical reasoning that involves proof planning, specifically those that require the validation of explanation accuracy, continues to present challenges. In this study, we first evaluate the efficacy of LLMs with advanced CoT strategies concerning such tasks. Our analysis reveals that LLMs still struggle to navigate complex reasoning chains, which demand the meticulous linkage of premises to derive a cogent conclusion. To address this issue, we fine-tune a smaller-scale language model, equipping it to decompose proof objectives into more manageable subgoals. We also introduce contrastive decoding to stepwise proof generation, making use of negative reasoning paths to strengthen the model's capacity for logical deduction. Experiments on EntailmentBank underscore the success of our method in augmenting the proof planning abilities of language models.Here's the text in Traditional Chinese:<>转换文本为简化字体。>ilogical reasoning remains a crucial component within the realm of artificial intelligence. The recent evolution of large language models (LLMs) has marked significant progress in this domain. The adoption of strategies like chain-of-thought (CoT) has enhanced the performance of LLMs across diverse reasoning tasks. However, logical reasoning that involves proof planning, specifically those that require the validation of explanation accuracy, continues to present challenges. In this study, we first evaluate the efficacy of LLMs with advanced CoT strategies concerning such tasks. Our analysis reveals that LLMs still struggle to navigate complex reasoning chains, which demand the meticulous linkage of premises to derive a cogent conclusion. To address this issue, we fine-tune a smaller-scale language model, equipping it to decompose proof objectives into more manageable subgoals. We also introduce contrastive decoding to stepwise proof generation, making use of negative reasoning paths to strengthen the model's capacity for logical deduction. Experiments on EntailmentBank underscore the success of our method in augmenting the proof planning abilities of language models.

paper_url: http://arxiv.org/abs/2311.06729
repo_url: None
paper_authors: Salim Sazzed
for:This study aims to understand the linguistic and socio-demographic features of online social media reviews, including English language styles, conveyed sentiments, and lexical diversity.methods:The study uses a case study approach, extracting and examining statistical, grammatical, and sentimental features from two demographically diverse groups. Machine learning (ML) classifiers are then leveraged to differentiate between the groups based on these features.results:The study finds significant disparities in linguistic attributes between the two groups, which can be effectively used to distinguish them with a macro F1 score of approximately 0.85. Additionally, the study compares the performance of linguistic features with word n-gram-based lexical features and finds that the latter, combined with fine-tuned transformer-based models, achieve higher accuracy (over 95%) and macro F1 scores (over 0.96). The findings provide valuable guidelines for future research on analyzing demographic patterns in textual content across social media platforms.

Abstract
This study aims to comprehend linguistic and socio-demographic features, encompassing English language styles, conveyed sentiments, and lexical diversity within spatial online social media review data. To this end, we undertake a case study that scrutinizes reviews composed by two distinct and demographically diverse groups. Our analysis entails the extraction and examination of various statistical, grammatical, and sentimental features from these two groups. Subsequently, we leverage these features with machine learning (ML) classifiers to discern their potential in effectively differentiating between the groups. Our investigation unveils substantial disparities in certain linguistic attributes between the two groups. When integrated into ML classifiers, these attributes exhibit a marked efficacy in distinguishing the groups, yielding a macro F1 score of approximately 0.85. Furthermore, we conduct a comparative evaluation of these linguistic features with word n-gram-based lexical features in discerning demographically diverse review data. As expected, the n-gram lexical features, coupled with fine-tuned transformer-based models, show superior performance, attaining accuracies surpassing 95\% and macro F1 scores exceeding 0.96. Our meticulous analysis and comprehensive evaluations substantiate the efficacy of linguistic and sentimental features in effectively discerning demographically diverse review data. The findings of this study provide valuable guidelines for future research endeavors concerning the analysis of demographic patterns in textual content across various social media platforms.

摘要

Controllable Topic-Focused Abstractive Summarization

paper_url: http://arxiv.org/abs/2311.06724
repo_url: None
paper_authors: Seyed Ali Bahrainian, Martin Jaggi, Carsten Eickhoff
for: 这个论文目的是提出一种基于Transformer架构的新方法，用于生成关注特定主题的摘要。
methods: 这个方法修改了Transformer模型中的跨注意力机制，以实现控制生成过程中的主题强调。这不添加任何额外参数到模型中。
results: 我们的模型在NEWTS dataset上实现了关注特定主题的摘要的新州OF艺。此外，我们通过广泛的实验表明，我们的提议的主题跨注意力机制可以在不需要重新训练的情况下，将BART和T5模型改进到CNN/Dailymail和XSum数据集上的摘要生成任务上。

Abstract
Controlled abstractive summarization focuses on producing condensed versions of a source article to cover specific aspects by shifting the distribution of generated text towards a desired style, e.g., a set of topics. Subsequently, the resulting summaries may be tailored to user-defined requirements. This paper presents a new Transformer-based architecture capable of producing topic-focused summaries. The architecture modifies the cross-attention mechanism of the Transformer to bring topic-focus control to the generation process while not adding any further parameters to the model. We show that our model sets a new state of the art on the NEWTS dataset in terms of topic-focused abstractive summarization as well as a topic-prevalence score. Moreover, we show via extensive experiments that our proposed topical cross-attention mechanism can be plugged into various Transformer models, such as BART and T5, improving their performance on the CNN/Dailymail and XSum benchmark datasets for abstractive summarization. This is achieved via fine-tuning, without requiring training from scratch. Finally, we show through human evaluation that our model generates more faithful summaries outperforming the state-of-the-art Frost model.

摘要
控制抽象摘要的研究集中焦点在生成受控的摘要，以掌控生成文本的分布，例如将摘要集中在某些主题上。这篇论文提出了一种基于Transformer架构的新型摘要生成模型，可以控制生成过程中的话题强调。我们修改了Transformer模型中的cross-attention机制，以实现话题强调控制，而无需添加任何参数。我们的模型在NEWTS数据集上实现了话题抽象摘要的新状态之冠，同时也在话题强制分布上达到了新的高水平。此外，我们通过广泛的实验表明，我们的提议的话题跨注意力机制可以在不同的Transformer模型中应用，如BART和T5，提高其在CNN/Dailymail和XSum数据集上的抽象摘要性能。这是通过微调，不需要从scratch retrained。最后，我们通过人工评估表明，我们的模型生成的摘要更 faithful，比出现在状态之冠的Frost模型。

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

paper_url: http://arxiv.org/abs/2311.06720
repo_url: https://github.com/tanyuqian/cappy
paper_authors: Bowen Tan, Yun Zhu, Lijuan Liu, Eric Xing, Zhiting Hu, Jindong Chen
for: 提高多任务大语言模型（LLMs）的性能和效率，并且可以方便地适应下游应用程序。
methods: 引入一个预训练小评分器（Cappy），可以独立地完成分类任务或者作为 LLMs 的辅助组件，提高其性能。
results: Cappy 可以在 11 种语言理解任务上表现出色，并且可以与其他 LLM 的适应方法（如 finetuning 和 in-context learning）相互协作，提供更高的性能提升。

Abstract
Large language models (LLMs) such as T0, FLAN, and OPT-IML, excel in multi-tasking under a unified instruction-following paradigm, where they also exhibit remarkable generalization abilities to unseen tasks. Despite their impressive performance, these LLMs, with sizes ranging from several billion to hundreds of billions of parameters, demand substantial computational resources, making their training and inference expensive and inefficient. Furthermore, adapting these models to downstream applications, particularly complex tasks, is often unfeasible due to the extensive hardware requirements for finetuning, even when utilizing parameter-efficient approaches such as prompt tuning. Additionally, the most powerful multi-task LLMs, such as OPT-IML-175B and FLAN-PaLM-540B, are not publicly accessible, severely limiting their customization potential. To address these challenges, we introduce a pretrained small scorer, Cappy, designed to enhance the performance and efficiency of multi-task LLMs. With merely 360 million parameters, Cappy functions either independently on classification tasks or serve as an auxiliary component for LLMs, boosting their performance. Moreover, Cappy enables efficiently integrating downstream supervision without requiring LLM finetuning nor the access to their parameters. Our experiments demonstrate that, when working independently on 11 language understanding tasks from PromptSource, Cappy outperforms LLMs that are several orders of magnitude larger. Besides, on 45 complex tasks from BIG-Bench, Cappy boosts the performance of the advanced multi-task LLM, FLAN-T5, by a large margin. Furthermore, Cappy is flexible to cooperate with other LLM adaptations, including finetuning and in-context learning, offering additional performance enhancement.

摘要
大型语言模型（LLM）如T0、FLAN和OPT-IML，在一体化指令遵循模式下表现出色，并具有卓越的应用扩展能力。尽管它们在表现方面卓越，但这些LLM对于训练和测试而言非常耗费资源，导致它们的训练和测试成本高昂不可持续。此外，对于下游应用的适配也是困难的，特别是面对复杂的任务时。这些最强的多任务LLM之一，如OPT-IML-175B和FLAN-PaLM-540B，则不公开 accessible，严重限制了它们的自定义潜力。为解决这些挑战，我们引入了一个预训小评分器，Cappy，用于增强多任务LLM的表现和效率。Cappy仅有360亿个参数，可以独立进行分类任务，或者作为LLM的辅助元件，提高其表现。此外，Cappy可以效率地 интеграate下游监督，不需要LLM的调整对应，也不需要存取LLM的参数。我们的实验显示，在11种语言理解任务上，Cappy在与许多个项目上表现出色，而且在45个复杂任务上，Cappy将FLAN-T5进步大幅。此外，Cappy还可以与其他LLM的适配方法，包括调整和在 контекスト中学习，提供进一步的表现提升。

What factors influence the popularity of user-generated text in the creative domain? A case study of book reviews

paper_url: http://arxiv.org/abs/2311.06714
repo_url: None
paper_authors: Salim Sazzed
for: 本研究探究了书评的各种心理、语言、 semantics 和可读性特征，以揭示书评的评价因素。
methods: 我们对书评中的各种特征进行统计分析，包括观点和情感表达的类型和频率、连接词、人物提及、单词独特性、通用性、句子结构等。此外，我们还使用两种可读性测试来探究是否存在评价媒体和评价媒体之间的相关性。
results: 我们的发现表明，除了一些特征（如评论长度、情感和单词独特性）之外，大多数特征没有显著的差异 между受欢迎和不受欢迎的评论组。此外，使用单词 n-gram 特征的机器学习分类器表现糟糕，这反映了在创造性领域中评价困难的问题。总之，本研究提供了各种评论受欢迎的因素的启示，并强调了在创造性领域进一步研究的必要性。

Abstract
This study investigates a range of psychological, lexical, semantic, and readability features of book reviews to elucidate the factors underlying their perceived popularity. To this end, we conduct statistical analyses of various features, including the types and frequency of opinion and emotion-conveying terms, connectives, character mentions, word uniqueness, commonness, and sentence structure, among others. Additionally, we utilize two readability tests to explore whether reading ease is positively associated with review popularity. Finally, we employ traditional machine learning classifiers and transformer-based fine-tuned language models with n-gram features to automatically determine review popularity. Our findings indicate that, with the exception of a few features (e.g., review length, emotions, and word uniqueness), most attributes do not exhibit significant differences between popular and non-popular review groups. Furthermore, the poor performance of machine learning classifiers using the word n-gram feature highlights the challenges associated with determining popularity in creative domains. Overall, our study provides insights into the factors underlying review popularity and highlights the need for further research in this area, particularly in the creative realm.

摘要
(Simplified Chinese)这个研究 investigate 一系列心理、语言、Semantic 和可读性特征，以探索书评的可读性的因素。为此，我们进行了各种统计分析，包括评论和情感表达的类型和频率、连接词、人物提及、单词独特性、常见性和句子结构等。此外，我们还使用了两种可读性测试，以探究评论的阅读易懂性是否与评论的流行性相关。最后，我们使用传统的机器学习分类器和基于 transformer 的优化语言模型，使用 n-gram 特征来自动确定评论的流行程度。我们的发现表明，除了一些特征（如评论的长度、情感和单词独特性），大多数特征没有显著的差异 между 流行和不流行的评论组。此外，使用 word n-gram 特征的机器学习分类器表现不佳，反映了在创造性领域中决定流行性的挑战。总的来说，我们的研究提供了关于评论流行性的因素的启示，并高亮了在创造性领域进一步研究的需要。

Trusted Source Alignment in Large Language Models

paper_url: http://arxiv.org/abs/2311.06697
repo_url: None
paper_authors: Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, Edward Clifford, Yennie Jun, William W. Cohen, Simon Baumgartner
for: This paper is written to evaluate the trusted source alignment (TSA) property of large language models (LLMs) and to present a dataset called FactCheckQA for evaluating TSA.
methods: The paper proposes a simple protocol for evaluating TSA, which includes response extraction, claim contextualization, and bias in prompt formulation.
results: The authors find that as they scale up the model size, the model performance on FactCheckQA improves from near-random to up to 80% balanced accuracy in aligning with trusted sources.Here’s the same information in Simplified Chinese text:
for: 这篇论文是为评估大语言模型（LLM）中的可靠来源对应性（TSA）而写的。
methods: 论文提出了一种简单的评估TSA的协议，包括响应EXTRACTION、CLAIM CONTEXTUALIZATION和提问表达中的偏见。
results: 作者发现，随着模型大小的增加，模型在FactCheckQA上的性能从near-random提高到了80%的权衡精度，与可靠来源进行对应。

Abstract
Large language models (LLMs) are trained on web-scale corpora that inevitably include contradictory factual information from sources of varying reliability. In this paper, we propose measuring an LLM property called trusted source alignment (TSA): the model's propensity to align with content produced by trusted publishers in the face of uncertainty or controversy. We present FactCheckQA, a TSA evaluation dataset based on a corpus of fact checking articles. We describe a simple protocol for evaluating TSA and offer a detailed analysis of design considerations including response extraction, claim contextualization, and bias in prompt formulation. Applying the protocol to PaLM-2, we find that as we scale up the model size, the model performance on FactCheckQA improves from near-random to up to 80% balanced accuracy in aligning with trusted sources.

摘要

Simple and Effective Input Reformulations for Translation

paper_url: http://arxiv.org/abs/2311.06696
repo_url: https://github.com/bri25yu/languagemodelexperimentation
paper_authors: Brian Yu, Hansen Lillemark, Kurt Keutzer
for: This paper aims to improve the performance of language models on challenging translation tasks through reformulating inputs during finetuning.
methods: The paper proposes simple data-level modifications to the input data during finetuning, which do not require additional training data or modifications at inference time.
results: The proposed methods achieve significant performance improvements of up to $\textbf{3.5 chrF++}$ on the Flores200 translation benchmark.Here’s the full Chinese text:
for: 这篇论文目标是通过在finetuning过程中对输入数据进行修改，提高语言模型在具有挑战性的翻译任务中的性能。
methods: 论文提出了一种简单的数据层修改方法，不需要额外收集训练数据或在推理时进行修改。
results: 提议的方法在Flores200翻译 benchmark上实现了显著的性能提升，达到了 $\textbf{3.5 chrF++}$ 的最佳性能。

Abstract
Foundation language models learn from their finetuning input context in different ways. In this paper, we reformulate inputs during finetuning for challenging translation tasks, leveraging model strengths from pretraining in novel ways to improve downstream performance. These reformulations are simple data level modifications, require no additional collection of training data or modification of data at inference time. They can be applied either on single language pair translation tasks or massively multilingual translation tasks. Experiments with these techniques demonstrate significant performance improvements up to $\textbf{3.5 chrF++ on the Flores200 translation benchmark}$. We hope our research accessibly improves finetuning data efficiency, enabling more effective training to scalably improve state-of-the-art performance. Our code is released $\href{https://github.com/bri25yu/LanguageModelExperimentation}{here}.$

摘要
基础语言模型从finetuning输入上学习的方式不同。在这篇论文中，我们将finetuning输入重新编写，以利用模型在预训练中的优势，以提高下游性能。这些重新编写是单纯的数据层次修改，无需额外收集训练数据或在推理时修改数据。它们可以应用于单语言对翻译任务或大规模多语言翻译任务。实验结果显示，使用这些技术可以获得$\textbf{3.5 chrF++在Flores200翻译标准 bencmark}$中的显著性能提升。我们希望我们的研究能够提高finetuning数据效率，以便更有效地训练，以拓宽状态之巅表现。我们的代码可以在 $\href{https://github.com/bri25yu/LanguageModelExperimentation}{这里}$ 获取。

2023-11-12

SELF-EXPLAIN: Teaching Large Language Models to Reason Complex Questions by Themselves

Retrieval and Generative Approaches for a Pregnancy Chatbot in Nepali with Stemmed and Non-Stemmed Data : A Comparative Study

DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training

Automatic Textual Normalization for Hate Speech Detection

GIELLM: Japanese General Information Extraction Large Language Model Utilizing Mutual Reinforcement Effect

Cricket Player Profiling: Unraveling Strengths and Weaknesses Using Text Commentary Data

Evaluation of GPT-4 for chest X-ray impression generation: A reader study on performance and perception

On the Robustness of Question Rewriting Systems to Questions of Varying Hardness

Tunable Soft Prompts are Messengers in Federated Learning

CLAMP: A Contrastive Language And Molecule Pre-training Network

Learning Knowledge-Enhanced Contextual Language Representations for Domain Natural Language Understanding

Sharing, Teaching and Aligning: Knowledgeable Transfer Learning for Cross-Lingual Machine Reading Comprehension

From Complex to Simple: Unraveling the Cognitive Tree for Reasoning with Small Language Models

BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis

Are LLMs Rigorous Logical Reasoner? Empowering Natural Language Proof Generation with Contrastive Stepwise Decoding

Comprehending Lexical and Affective Ontologies in the Demographically Diverse Spatial Social Media Discourse

Controllable Topic-Focused Abstractive Summarization

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

What factors influence the popularity of user-generated text in the creative domain? A case study of book reviews

Trusted Source Alignment in Large Language Models

Simple and Effective Input Reformulations for Translation