2023-11-28

cs.CL

cs.CL - 2023-11-28

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

paper_url: http://arxiv.org/abs/2311.17280
repo_url: None
paper_authors: Wang Zhu, Ishika Singh, Yuan Huang, Robin Jia, Jesse Thomason
for: 这个论文是为了研究视觉语言导航（VLN）模型在pretraining阶段中数据增强的影响。
methods: 论文使用了back-translation数据增强法，但发现生成的指令有噪音。
results: 研究发现，在下游性能测试中，使用噪音指令的pretraining模型并不影响下游性能，而且仍然比只使用人工清晰指令的模型更好。这些结果表明，在VLN R2R预训练中，量的视觉轨迹更重要于指令质量。

Abstract
Data augmentation via back-translation is common when pretraining Vision-and-Language Navigation (VLN) models, even though the generated instructions are noisy. But: does that noise matter? We find that nonsensical or irrelevant language instructions during pretraining can have little effect on downstream performance for both HAMT and VLN-BERT on R2R, and is still better than only using clean, human data. To underscore these results, we concoct an efficient augmentation method, Unigram + Object, which generates nonsensical instructions that nonetheless improve downstream performance. Our findings suggest that what matters for VLN R2R pretraining is the quantity of visual trajectories, not the quality of instructions.

摘要
<>将文本翻译成简化中文。>通常在预训练视觉语言导航（VLN）模型时使用数据扩充via back-translation，即使生成的指令有噪音。但：这种噪音是重要的吗？我们发现在下游性能方面，无关或无用的语言指令在预训练期间具有非常小的影响，而且使用仅净的人工数据仍然比较好。为了强调这些结果，我们开发了一种高效的扩充方法，Unigram + Object，该方法生成无关的指令，却可以提高下游性能。我们的发现表明，对于VLN R2R预训练来说，重要的是视觉轨迹的量，而不是指令的质量。

RETSim: Resilient and Efficient Text Similarity

paper_url: http://arxiv.org/abs/2311.17264
repo_url: None
paper_authors: Marina Zhang, Owen Vallis, Aysegul Bumin, Tanay Vakharia, Elie Bursztein
for: 这篇论文是为了提出一种轻量级、多语言的深度学习模型，用于实现鲁棒和高效的文本相似度计算，以便于文本检索、归类和数据减少等任务。
methods: 该模型使用了 RETSim（鲁检 similarity），一种基于深度学习的多语言模型，通过对文本数据进行归一化和缩放来提高计算效率和准确性。
results: 根据实验结果，RETSim在 dataset deduplication、敌意文本检索和垃圾 clustering 等任务中表现出色，比 MinHash 和神经文本嵌入更加鲁检和准确。此外， authors 还提出了一个新的 W4NT3D 数据集，用于评估多语言、鲁检文本检索能力在敌意 Setting 下。

Abstract
This paper introduces RETSim (Resilient and Efficient Text Similarity), a lightweight, multilingual deep learning model trained to produce robust metric embeddings for near-duplicate text retrieval, clustering, and dataset deduplication tasks. We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings, achieving new state-of-the-art performance on dataset deduplication, adversarial text retrieval benchmarks, and spam clustering tasks. We also introduce the W4NT3D benchmark (Wiki-40B 4dversarial Near-T3xt Dataset) for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings. RETSim and the W4NT3D benchmark are open-sourced under the MIT License at https://github.com/google/unisim.

摘要
这份论文介绍了RETSim（抗强度和高效的文本相似性），一种轻量级、多语言深度学习模型，用于生成鲁棒的度量嵌入，用于近似文本检索、归一化和数据集去重任务。我们示出了RETSim在 dataset deduplication、 adversarial text retrieval bencmarks 和垃圾 clustering 任务中的显著性和准确性，超过了 MinHash 和神经文本嵌入。我们还介绍了 W4NT3D 测试集（ Wiki-40B 4dversarial Near-T3xt Dataset），用于评估多语言、近似文本检索能力在阴谋设置下。 RETSim 和 W4NT3D 测试集都是根据 MIT 许可证开源在 GitHub 上，请参考。

General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Data from Thoracic Radiology Reports

paper_url: http://arxiv.org/abs/2311.17213
repo_url: None
paper_authors: Ali H. Dhanaliwala, Rikhiya Ghosh, Sanjeev Kumar Karn, Poikavila Ullaskrishnan, Oladimeji Farri, Dorin Comaniciu, Charles E. Kahn
for: EXTRACTING COMMON DATA ELEMENTS (CDEs) FROM RADIOLOGY REPORTS
methods: USED DOMAIN-ADAPTED LANGUAGE MODEL (RadLing) AND GENERAL-PURPOSE LARGE LANGUAGE MODEL (GPT-4) TO IDENTIFY CDEs AND ASSIGN VALUES
results: RADLING SYSTEM OUTPERFORMED GPT-4 SYSTEM IN EXTRACTING CDEs, WITH HIGHER PRECISION (96% VS 99%) AND RECALL (94% VS 70%), AND OFFERED OPERATIONAL ADVANTAGES SUCH AS LOCAL DEPLOYMENT AND REDUCED RUNTIME COSTS.

Abstract
Radiologists produce unstructured data that could be valuable for clinical care when consumed by information systems. However, variability in style limits usage. Study compares performance of system using domain-adapted language model (RadLing) and general-purpose large language model (GPT-4) in extracting common data elements (CDE) from thoracic radiology reports. Three radiologists annotated a retrospective dataset of 1300 thoracic reports (900 training, 400 test) and mapped to 21 pre-selected relevant CDEs. RadLing was used to generate embeddings for sentences and identify CDEs using cosine-similarity, which were mapped to values using light-weight mapper. GPT-4 system used OpenAI's general-purpose embeddings to identify relevant CDEs and used GPT-4 to map to values. The output CDE:value pairs were compared to the reference standard; an identical match was considered true positive. Precision (positive predictive value) was 96% (2700/2824) for RadLing and 99% (2034/2047) for GPT-4. Recall (sensitivity) was 94% (2700/2876) for RadLing and 70% (2034/2887) for GPT-4; the difference was statistically significant (P<.001). RadLing's domain-adapted embeddings were more sensitive in CDE identification (95% vs 71%) and its light-weight mapper had comparable precision in value assignment (95.4% vs 95.0%). RadLing system exhibited higher performance than GPT-4 system in extracting CDEs from radiology reports. RadLing system's domain-adapted embeddings outperform general-purpose embeddings from OpenAI in CDE identification and its light-weight value mapper achieves comparable precision to large GPT-4. RadLing system offers operational advantages including local deployment and reduced runtime costs. Domain-adapted RadLing system surpasses GPT-4 system in extracting common data elements from radiology reports, while providing benefits of local deployment and lower costs.

摘要
医生生成的不结构化数据可能对临床护理有价值，但是样式的变化限制了它的使用。这项研究比较了使用域 adaptive 语言模型（RadLing）和通用大型语言模型（GPT-4）在解剖 radiology 报告中提取常见数据元素（CDE）的性能。三位医生对一个退回数据集中的1300份 thoracic 报告（900份训练数据，400份测试数据）进行了标注，并将其映射到21个预先选择的相关CDE中。RadLing使用域 adaptive 语言模型生成句子嵌入，并使用cosine相似性来标识CDE，然后将其映射到值。GPT-4系统使用OpenAI的通用嵌入来标识相关CDE，并使用GPT-4来映射到值。输出的CDE:值对与参考标准进行比较，true positive 的匹配情况下为正确的匹配。RadLing系统的精度（正确率）为96%（2700/2824），而GPT-4系统的精度为99%（2034/2047）。RadLing系统的敏感性（敏感率）为94%（2700/2876），而GPT-4系统的敏感性为70%（2034/2887），这种差异是 statistically significant（P<0.001）。RadLing的域 adaptive 嵌入表现出色地识别CDE（95% vs 71%），并且它的轻量级映射器具有相当的精度（95.4% vs 95.0%）。RadLing系统在解剖 radiology 报告中提取CDE的性能高于GPT-4系统，而且提供了本地部署和较低的运行成本。域 adaptive RadLing系统在解剖 radiology 报告中提取常见数据元素的性能高于GPT-4系统，而且提供了本地部署和较低的运行成本。

paper_url: http://arxiv.org/abs/2311.17049
repo_url: None
paper_authors: Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel
for: This paper aims to improve the efficiency of CLIP models for deployment on mobile devices while maintaining their zero-shot performance.
methods: The proposed method, called Multi-modal Reinforced Training, leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. The approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset.
results: The proposed MobileCLIP model sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. The MobileCLIP-S2 variant is 2.3 times faster while more accurate compared to the previous best CLIP model based on ViT-B/16. The multi-modal reinforced training approach achieves +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best.

Abstract
Contrastive pretraining of image-text foundation models, such as CLIP, demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However, these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work, we introduce MobileCLIP -- a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach, namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3$\times$ faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover, we show that the proposed approach achieves 10$\times$-1000$\times$ improved learning efficiency when compared with non-reinforced CLIP training.

摘要
Translate the given text into Simplified Chinese.研究人员们已经提出了一种新的image-text基础模型预训练方法，称为MobileCLIP，以提高runtime性能并降低运行在移动设备上的缓存和延迟开销。这种预训练方法基于一种多模态强化训练方法，该方法利用了图像描述模型和一个 ensemble of strong CLIP encoders来提高效率的模型准确性。我们的方法可以避免了训练时的计算开销，通过将额外知识存储在一个强化数据集中。MobileCLIP在零shot分类和检索任务上设置了新的状态态-精度质量平衡。我们的MobileCLIP-S2变体比前一个最佳CLIP模型基于ViT-B/16更快2.3倍，并且更准确。我们进一步验证了我们的多模态强化训练方法，通过训练基于ViT-B/16图像基础模型的CLIP模型，在38个评估标准中提高了+2.9%的平均性能。此外，我们还表明了我们的方法可以在非强化CLIP训练中实现10-1000倍的学习效率提升。

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

paper_url: http://arxiv.org/abs/2311.17043
repo_url: https://github.com/dvlab-research/llama-vid
paper_authors: Yanwei Li, Chengyao Wang, Jiaya Jia
for: 本研究旨在解决视觉语言模型（VLM）中的固有Token生成挑战，以提高视频和图像理解能力。
methods: LLama-VID方法使用两种不同的token来表示每帧图像，即上下文token和内容token。上下文token基于用户输入表示整个图像上下文，而内容token则捕捉每帧视频中的视觉特征。这种双Token策略有效减轻长视频的计算负担，保留重要信息。
results: LLama-VID方法在大多数视频或图像基准测试上表现出色，超越过去的方法。代码可以在https://github.com/dvlab-research/LLaMA-VID中找到。

Abstract
In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

摘要
在这项研究中，我们提出了一种新的方法来解决视觉语言模型（VLM）中的令符生成挑战，称为LLaMA-VID。现有的VLM在处理长视频时因为过度的视觉令符而面临计算沉重的问题。LLaMA-VID通过将每帧分为两个不同的令符，即上下文令符和内容令符来解决这个问题。上下文令符基于用户输入来编码整个图像的上下文，而内容令符则捕捉每帧视觉特征。这种双令符策略可以有效减轻长视频的计算负担，同时保留重要信息。总之，LLaMA-VID使得现有的框架可以支持多达一小时的视频，并且对现有的上限进行了加强。我们已经证明了LLaMA-VID在大多数视频或图像基准测试上的优势。代码可以在 GitHub 上找到：https://github.com/dvlab-research/LLaMA-VID

Scalable Extraction of Training Data from (Production) Language Models

paper_url: http://arxiv.org/abs/2311.17035
repo_url: None
paper_authors: Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee
for: 这篇论文研究了可提取的记忆力：对机器学习模型的训练数据进行高效提取，无需先知道训练集。
methods: 我们使用现有的文献技术进行攻击不对齐的模型，并开发了一种新的分岔攻击，使模型偏离正常的讲话生成并以150倍的速率释出训练数据。
results: 我们的方法显示，现有的对齐技术不能消除记忆力，并且可以在很短的时间内从开源语言模型、半开源模型和关闭模型中提取大量的训练数据。

Abstract
This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

摘要

ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?

paper_url: http://arxiv.org/abs/2311.16989
repo_url: None
paper_authors: Hailin Chen, Fangkai Jiao, Xingxuan Li, Chengwei Qin, Mathieu Ravaut, Ruochen Zhao, Caiming Xiong, Shafiq Joty
for: This paper provides an exhaustive overview of the success of open-source large language models (LLMs) in achieving parity or better performance than closed-source models like ChatGPT on various tasks.
methods: The paper surveys all tasks where open-source LLMs have claimed to be on par or better than ChatGPT, using supervised fine-tuning and reinforcement learning from human feedback.
results: The paper shows that open-source LLMs have made rapid progress in the past year, with some achieving parity or better performance than ChatGPT on certain tasks, and highlights the crucial implications of this progress on both research and business.

Abstract
Upon its release in late 2022, ChatGPT has brought a seismic shift in the entire landscape of AI, both in research and commerce. Through instruction-tuning a large language model (LLM) with supervised fine-tuning and reinforcement learning from human feedback, it showed that a model could answer human questions and follow instructions on a broad panel of tasks. Following this success, interests in LLMs have intensified, with new LLMs flourishing at frequent interval across academia and industry, including many start-ups focused on LLMs. While closed-source LLMs (e.g., OpenAI's GPT, Anthropic's Claude) generally outperform their open-source counterparts, the progress on the latter has been rapid with claims of achieving parity or even better on certain tasks. This has crucial implications not only on research but also on business. In this work, on the first anniversary of ChatGPT, we provide an exhaustive overview of this success, surveying all tasks where an open-source LLM has claimed to be on par or better than ChatGPT.

摘要
在2022年下半年发布后，ChatGPT引发了人工智能领域内的一场地震，both在研究和商业方面。通过对大语言模型（LLM）进行指导微调和监督学习，它证明了一个模型可以回答人类问题并完成广泛的任务。随后，关于LLM的兴趣加剧，在学术和产业界出现了数量多的新LLM，包括许多专门关注LLM的 startup。虽然关闭源的LLM（如OpenAI的GPT和Anthropic的Claude）通常表现更好，但开源LLM的进步快速，有些任务上 even better 的表现。这对研究和商业都具有重要意义。在ChatGPT发布一周年之际，我们提供了这些成功的全面概述，对所有由开源LLM宣称与ChatGPT相当或更好的任务进行了检查。

Assessing the influence of attractor-verb distance on grammatical agreement in humans and language models

paper_url: http://arxiv.org/abs/2311.16978
repo_url: None
paper_authors: Christos-Nikolaos Zacharopoulos, Théo Desbordes, Mathias Sablé-Meyer
for: investigate the effect of subject-verb agreement on the presence of an attractor noun
methods: parametrically modulate the distance between the attractor and the verb while keeping the length of the sentence equal, and evaluate the performance of both humans and two artificial neural network models
results: humans make more mistakes when the attractor is closer to the verb, while neural networks get close to the chance level; additionally, there is a linear effect of attractor distance on reaction times.

Abstract
Subject-verb agreement in the presence of an attractor noun located between the main noun and the verb elicits complex behavior: judgments of grammaticality are modulated by the grammatical features of the attractor. For example, in the sentence "The girl near the boys likes climbing", the attractor (boys) disagrees in grammatical number with the verb (likes), creating a locally implausible transition probability. Here, we parametrically modulate the distance between the attractor and the verb while keeping the length of the sentence equal. We evaluate the performance of both humans and two artificial neural network models: both make more mistakes when the attractor is closer to the verb, but neural networks get close to the chance level while humans are mostly able to overcome the attractor interference. Additionally, we report a linear effect of attractor distance on reaction times. We hypothesize that a possible reason for the proximity effect is the calculation of transition probabilities between adjacent words. Nevertheless, classical models of attraction such as the cue-based model might suffice to explain this phenomenon, thus paving the way for new research. Data and analyses available at https://osf.io/d4g6k

摘要
主语-谓语协调在吸引语素位于主语和谓语之间时展现复杂行为：判断grammaticality受吸引语素的 grammatical 特征的影响。例如，在句子 "女孩子near boys likes climbing" 中，吸引语素（boys）与谓语（likes）不匹配 grammatical 数量，导致 locally 不可能的转折概率。我们在保持句子长度相同的情况下 parametrically 调整吸引语素和谓语之间的距离。我们评估了人类和两种人工神经网络模型的表现：两者在吸引语素较近于谓语时错误率增加，而神经网络模型在接近机会水平，而人类大多能够超越吸引语素干扰。此外，我们还发现了吸引语素距离谓语的线性效果。我们假设一个可能的原因是计算邻近单词之间的转折概率。然而，古典的吸引模型，如cue-based模型，可能足够解释这种现象，因此推动新的研究。数据和分析可以在获取。

Natural Language Processing Through Transfer Learning: A Case Study on Sentiment Analysis

paper_url: http://arxiv.org/abs/2311.16965
repo_url: https://github.com/djdprogramming/adfa2
paper_authors: Aman Yadav, Abhishek Vichare
for: 这篇论文主要探讨了基于自然语言处理的评价情感分类问题，强调了使用传输学习来提高模型的准确率。
methods: 该论文使用了BERT模型，并对 IMDb 电影评论数据集进行了预处理和编码，以便于 NLP 模型进行分析。
results: 根据实验结果，使用传输学习的 BERT 模型在评价情感分类任务上达到了 100% 的准确率，但是需要进一步的分析来验证模型的泛化能力。

Abstract
Artificial intelligence and machine learning have significantly bolstered the technological world. This paper explores the potential of transfer learning in natural language processing focusing mainly on sentiment analysis. The models trained on the big data can also be used where data are scarce. The claim is that, compared to training models from scratch, transfer learning, using pre-trained BERT models, can increase sentiment classification accuracy. The study adopts a sophisticated experimental design that uses the IMDb dataset of sentimentally labelled movie reviews. Pre-processing includes tokenization and encoding of text data, making it suitable for NLP models. The dataset is used on a BERT based model, measuring its performance using accuracy. The result comes out to be 100 per cent accurate. Although the complete accuracy could appear impressive, it might be the result of overfitting or a lack of generalization. Further analysis is required to ensure the model's ability to handle diverse and unseen data. The findings underscore the effectiveness of transfer learning in NLP, showcasing its potential to excel in sentiment analysis tasks. However, the research calls for a cautious interpretation of perfect accuracy and emphasizes the need for additional measures to validate the model's generalization.

摘要
人工智能和机器学习在技术世界中具有重要地位。本文探讨了基于传输学习的自然语言处理（NLP）方面的潜在性，特化于情感分析。据 estudios，使用预训练BERT模型可以提高情感分类精度，比训练模型从头开始更高。本研究采用了复杂的实验设计，使用IMDb sentiment标注的电影评论数据集。数据处理包括准备和编码文本数据，使其适用于NLP模型。模型使用BERT基于模型，测试准确率为100%。尽管完全准确率看起来很吸引人，但可能是过拟合或未看到数据的一种情况。进一步的分析是必要的，以验证模型能够处理多样化和未看到的数据。研究结果证明了基于传输学习的NLP的可能性，并证明了情感分析任务中的效果。然而，研究需要谨慎地解释完美准确率，并强调需要进一步的验证来确保模型的普遍性。

paper_url: http://arxiv.org/abs/2311.16882
repo_url: None
paper_authors: Bowen Li, Yongxin Yang, Steven McDonagh, Shifeng Zhang, Petru-Daniel Tudosiu, Sarah Parisot
for: 这篇论文是为了提高图像编辑的精度和准确性，以及扩展图像编辑的多种指令类型（如空间布局基于的编辑、pose、scribbles、边框图）。
methods: 这篇论文提出了一种在推理时间进行图像编辑优化的方法，将编辑任务分解成两个竞争性子任务：成功地地ocal image modification和全局内容一致性保持，两个子任务受到两个专门的损失函数引导。通过调整每个损失函数的影响，建立一个灵活的图像编辑解决方案，可以根据用户的喜好进行调整。
results: 作者通过文本、pose和scribble编辑条件进行评估，并通过质量和量化实验显示出其能够实现复杂的编辑任务。

Abstract
Image editing affords increased control over the aesthetics and content of generated images. Pre-existing works focus predominantly on text-based instructions to achieve desired image modifications, which limit edit precision and accuracy. In this work, we propose an inference-time editing optimisation, designed to extend beyond textual edits to accommodate multiple editing instruction types (e.g. spatial layout-based; pose, scribbles, edge maps). We propose to disentangle the editing task into two competing subtasks: successful local image modifications and global content consistency preservation, where subtasks are guided through two dedicated loss functions. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits, through both qualitative and quantitative experiments.

摘要
图像编辑提供了更多控制图像生成的美学和内容的能力。现有的工作主要通过文本基本的指令来实现图像修改，这限制了修改精度和准确性。在这种工作中，我们提议在推理时间编辑优化，以扩展 beyond 文本修改，涵盖多种修改指令类型（例如空间布局基于; 姿势、笔迹、边极值）。我们分解编辑任务为两个竞争性子任务：成功地ocal image modification和全局内容一致性保持，两个子任务由两个专门的损失函数引导。通过让每个损失函数的影响可调，我们建立了灵活的编辑解决方案，可以根据用户喜好进行调整。我们通过文本、姿势和笔迹编辑条件进行评估，并通过Qualitative和Quantitative实验展示了我们的能力实现复杂的修改。

A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

paper_url: http://arxiv.org/abs/2311.16865
repo_url: None
paper_authors: Noëmi Aepli, Chantal Amrhein, Florian Schottmann, Rico Sennrich
for: 本研究旨在评估现有评价纪录的限制，以及它们如何处理非标准语言变体（如瑞士德语）。
methods: 作者采集了人工翻译和人工评估数据，并创建了一个挑战集来测试现有纪录的表现。
results: 研究结果显示，现有纪录无法可靠地评估瑞士德语文本生成输出，特别是在段级别。作者还提出了初步的设计改进，可以增强对非标准语言变体的Robustness。

Abstract
For sensible progress in natural language processing, it is important that we are aware of the limitations of the evaluation metrics we use. In this work, we evaluate how robust metrics are to non-standardized dialects, i.e. spelling differences in language varieties that do not have a standard orthography. To investigate this, we collect a dataset of human translations and human judgments for automatic machine translations from English to two Swiss German dialects. We further create a challenge set for dialect variation and benchmark existing metrics' performances. Our results show that existing metrics cannot reliably evaluate Swiss German text generation outputs, especially on segment level. We propose initial design adaptations that increase robustness in the face of non-standardized dialects, although there remains much room for further improvement. The dataset, code, and models are available here: https://github.com/textshuttle/dialect_eval

摘要
为了合理进步在自然语言处理领域，我们需要意识到我们使用的评估指标的局限性。在这项工作中，我们评估了自动机器翻译从英语到两种瑞士德语方言的robust度。为了进行这项研究，我们收集了人类翻译和人类评估自动机器翻译的数据集。我们还创建了一个挑战集，以评估不同方言的性能。我们的结果表明，现有的指标无法可靠地评估瑞士德语文本生成输出，特别是在段级别。我们提出了初步的设计修改，以增加对非标准方言的Robustness，although there remains much room for further improvement。数据集、代码和模型可以在以下链接中下载：https://github.com/textshuttle/dialect_eval。

RELIC: Investigating Large Language Model Responses using Self-Consistency

paper_url: http://arxiv.org/abs/2311.16842
repo_url: None
paper_authors: Furui Cheng, Vilém Zouhar, Simran Arora, Mrinmaya Sachan, Hendrik Strobelt, Mennatallah El-Assady
for: 本研究旨在帮助用户对大语言模型生成的文本进行可靠性检查，以避免混淆fact和 fiction。
methods: 我们提出了一种交互式系统，利用同一个LLM生成多个样本的自 consistency来评估它对具体陈述的信任程度。
results: 经过用户测试，我们发现我们的方法可以帮助用户更好地识别LLM生成的可靠性问题，并且提供了设计启示和未来研究的经验教训。

Abstract
Large Language Models (LLMs) are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. To tackle this challenge, we propose an interactive system that helps users obtain insights into the reliability of the generated text. Our approach is based on the idea that the self-consistency of multiple samples generated by the same LLM relates to its confidence in individual claims in the generated texts. Using this idea, we design RELIC, an interactive system that enables users to investigate and verify semantic-level variations in multiple long-form responses. This allows users to recognize potentially inaccurate information in the generated text and make necessary corrections. From a user study with ten participants, we demonstrate that our approach helps users better verify the reliability of the generated text. We further summarize the design implications and lessons learned from this research for inspiring future studies on reliable human-LLM interactions.

摘要
大型语言模型（LLM）很知名的问题是混淆实际与虚假信息，称为幻见。为解决这个挑战，我们提议一种互动性系统，帮助用户获得生成文本中的可靠性信息。我们的方法基于多个样本生成同一个LLM的自 consistency关系，用于评估它对个别声明的信任程度。我们称这种方法为RELIC，它可以帮助用户查看多个长篇响应中的semantic级别差异，从而识别生成文本中可能的错误信息，并进行必要的更正。从十名参与者的用户研究中，我们证明了我们的方法可以帮助用户更好地验证生成文本的可靠性。我们还总结了这些研究的设计启示和经验教训，以启发未来的人类-LLM交互研究。

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

paper_url: http://arxiv.org/abs/2311.16839
repo_url: None
paper_authors: Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, Conghui He
for: addresses the “hallucination problem” in multimodal large language models, where the models generate textual descriptions that contain inaccurate or non-existent content from the image.
methods: introduces a novel strategy called Hallucination-Aware Direct Preference Optimization (HA-DPO), which treats the hallucination problem as a unique preference selection issue and trains the model to favor the non-hallucinating response when presented with two responses of the same image.
results: the paper shows a significant reduction in the hallucination problem and an enhancement in the models’ generalization capabilities with HA-DPO. Specifically, the MiniGPT-4 model demonstrates a 34.5% absolute improvement in POPE accuracy and a 41% relative improvement in MME score.

Abstract
Multimodal large language models have made significant advancements in recent years, yet they still suffer from a common issue known as the "hallucination problem" where the models generate textual descriptions that contain inaccurate or non-existent content from the image. To address this issue, this paper introduces a novel strategy: Hallucination-Aware Direct Preference Optimization (HA-DPO). Our approach treats the hallucination problem as a unique preference selection issue, where the model is trained to favor the non-hallucinating response when presented with two responses of the same image (one accurate and one hallucinating). This paper also presents an efficient process for constructing hallucination sample pairs to ensure high-quality, style-consistent pairs for stable HA-DPO training. We applied this strategy to two mainstream multimodal models, and the results showed a significant reduction in the hallucination problem and an enhancement in the models' generalization capabilities. With HA-DPO, the MiniGPT-4 model demonstrates significant advancements: POPE accuracy increases from 51.13% to 85.66% (34.5% absolute improvement), and the MME score escalates from 968.58 to 1365.76 (41% relative improvement). The code, models, and datasets will be made publicly available.

摘要
多Modal大型自然语言模型在最近几年内有了 significiant advancements，但它们仍然面临一个通称为“hallucination problem”的问题，即模型生成的文本描述中包含不准确或不存在图像中的内容。为解决这个问题，本文提出了一种新的策略：Hallucination-Aware Direct Preference Optimization（HA-DPO）。我们的方法将hallucination问题视为一种独特的偏好选择问题，其中模型在同一张图像中提供了两个回答（一个准确的回答和一个投射的回答）时，偏好不投射的回答。本文还提供了一种高效的构造hallucination样本对的过程，以确保HA-DPO训练中的样本质量高、风格一致。我们在两种主流的多Modal模型上应用了这种策略，结果表明HA-DPO可以减少hallucination问题，并提高模型的总体通用能力。使用HA-DPO，MiniGPT-4模型的POPE精度从51.13%提高到85.66%（34.5%绝对提高），MME分数从968.58提高到1365.76（41%相对提高）。代码、模型和数据将公开发布。

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

paper_url: http://arxiv.org/abs/2311.17126
repo_url: None
paper_authors: Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-Ping Liu, Hongxia Yang
for: 本研究旨在提高文本到图像（T2I）生成模型的可 Compositional 能力，通过使用大型自然语言模型（LLM）作为布局生成器。
methods: 我们提出了一种新的方法，利用LLM的链式思维提示来解释文本，并生成符合文本semantic的物品布局。然后，我们使用一种高效的Adapter，通过跨注意力机制来集成布局信息到稳定的扩散模型中。
results: 我们的实验表明，使用LLM作为布局生成器可以提高图像质量和布局准确性，这表明了LLM在增强生成图像模型的潜在潜力。

Abstract
Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to translate the semantic content from the text into images entirely. While conditioning on the layout has shown to be effective in improving the compositional ability of T2I diffusion models, they typically require manual layout input. In this work, we introduce a novel approach to improving T2I diffusion models using Large Language Models (LLMs) as layout generators. Our method leverages the Chain-of-Thought prompting of LLMs to interpret text and generate spatially reasonable object layouts. The generated layout is then used to enhance the generated images' composition and spatial accuracy. Moreover, we propose an efficient adapter based on a cross-attention mechanism, which explicitly integrates the layout information into the stable diffusion models. Our experiments demonstrate significant improvements in image quality and layout accuracy, showcasing the potential of LLMs in augmenting generative image models.

摘要
Translation notes:* "Recent advancements" is translated as "最近的进步" (most recent advancements)* "text-to-image" is translated as "文本到图像" (T2I)* "diffusion models" is translated as "扩散模型" (diffusion models)* "semantic content" is translated as " semantic 内容" (semantic content)* "layout" is translated as "布局" (layout)* "Chain-of-Thought prompting" is translated as "链式思维提示" (Chain-of-Thought prompting)* "Large Language Models" is translated as "大型语言模型" (LLMs)* "adapter" is translated as "适配器" (adapter)* "cross-attention mechanism" is translated as "交叉注意机制" (cross-attention mechanism)* "stable diffusion models" is translated as "稳定的扩散模型" (stable diffusion models)* "image quality" is translated as "图像质量" (image quality)* "layout accuracy" is translated as "布局准确度" (layout accuracy)

Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

paper_url: http://arxiv.org/abs/2311.16822
repo_url: None
paper_authors: Martin Briesch, Dominik Sobania, Franz Rothlauf
for: 这个论文主要研究了大语言模型（LLM）在各种benchmark和对话应用程序中的表现，以及LLM生成的内容如何被 reuse 来训练下一代LLM。
methods: 作者使用了一个新的数据集来Empirically研究了这种自食用训练循环，并通过分析和测试来评估生成输出的质量和多样性。
results: 研究发现，自食用训练循环初期可以提高生成输出的质量和多样性，但是经过一些代数后，生成输出的多样性会不可避免地下降，而这种下降的速率取决于实际数据和生成数据的比例。

Abstract
Large language models (LLM) have become state of the art in many benchmarks and conversational LLM applications like ChatGPT are now widely used by the public. Those LLMs can be used to generate large amounts of content which is posted on the internet to various platforms. As LLMs are trained on datasets usually collected from the internet, this LLM-generated content might be used to train the next generation of LLMs. Therefore, a self-consuming training loop emerges in which new LLM generations are trained on the output from the previous generations. We empirically study this self-consuming training loop using a novel dataset to analytically and accurately measure quality and diversity of generated outputs. We find that this self-consuming training loop initially improves both quality and diversity. However, after a few generations the output inevitably degenerates in diversity. We find that the rate of degeneration depends on the proportion of real and generated data.

摘要
大型语言模型（LLM）已经成为训练 benchmark 和对话 LLM 应用程序如 ChatGPT 的公共使用。这些 LLM 可以生成大量的内容，该内容会被上传到互联网上不同的平台。因为 LLM 通常是由互联网上的数据集训练，因此这些 LLM 生成的内容可能会用于训练下一代 LLM。这导致一个自我养分训练循环，在这个循环中，新一代 LLM 将被训练在上一代 LLM 的输出上。我们使用一个新的数据集来Empirically 研究这个自我养分训练循环，并使用新的方法来量化和准确地测量生成的质量和多样性。我们发现，这个自我养分训练循环初期会提高质量和多样性。但是，在几代之后，输出的多样性将无法避免很快地下降。我们发现，这个下降的速度取决于实际和生成的数据的比例。

Evaluating Optimal Reference Translations

paper_url: http://arxiv.org/abs/2311.16787
repo_url: https://github.com/ufal/optimal-reference-translations
paper_authors: Vilém Zouhar, Věra Kloudová, Martin Popel, Ondřej Bojar
for: 提高机器翻译系统的译文质量
methods: 创建更可靠的文档级人工参照翻译，以提高机器翻译系统的译文质量
results: 对文档级人工参照翻译进行评估，并证明了与标准参照翻译的相比，文档级人工参照翻译具有更高的质量 Waterfall Please note that the “results” section is written in a more concise format, using the metaphor of a “waterfall” to represent the increase in quality.

Abstract
The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this article, we propose a methodology for creating more reliable document-level human reference translations, called "optimal reference translations," with the simple aim to raise the bar of what should be deemed "human translation quality." We evaluate the obtained document-level optimal reference translations in comparison with "standard" ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.

摘要
当前机器翻译系统在高资源语言对的情况下的总体翻译质量已经非常好。评估标准方法不适用于揭示仍然存在的多个翻译错误和质量不足。此外，标准参考翻译质量经常受到质疑，而且在一些语言对上，MT独立达到了相同的质量水平。因此，进一步研究在这些高资源环境下是困难的。本文提出了一种创建更可靠的文档级人工参考翻译的方法ологи，以提高人翻译质量的标准。我们评估了获得的文档级优化参考翻译，与标准参考翻译进行比较，证实了质量的显著提高，同时也记录了评估和翻译编辑的关系。

Radiology-Aware Model-Based Evaluation Metric for Report Generation

paper_url: http://arxiv.org/abs/2311.16764
repo_url: None
paper_authors: Amos Calamida, Farhad Nooralahzadeh, Morteza Rohanian, Koji Fujimoto, Mizuho Nishio, Michael Krauthammer
for: This paper proposes a new automated evaluation metric for machine-generated radiology reports.
methods: The paper uses the successful COMET architecture adapted for the radiology domain, and trains and publishes four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph.
results: The paper shows that the proposed metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores, and demonstrates that one of the checkpoints exhibits a high correlation with human judgment, as assessed using publicly available annotations of six board-certified radiologists and two radiologists on a collection of 100 reports.Here’s the same information in Simplified Chinese text:
for: 这个论文提出了一种新的自动评估机器生成的放射学报告的方法。
methods: 论文使用了成功的COMET架构，适应放射学领域，并训练了四个医学掌握的模型检查点，其中一个基于RadGraph放射学知识图。
results: 论文显示，提出的方法与已有的BERTscore、BLEU和CheXbert分数呈moderate到高的相关性，并表明一个检查点与六名证明Radiologist的公共可用注释中的人类评价呈高相关性。

Abstract
We propose a new automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain. We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph. Our results show that our metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores. Furthermore, we demonstrate that one of our checkpoints exhibits a high correlation with human judgment, as assessed using the publicly available annotations of six board-certified radiologists, using a set of 200 reports. We also performed our own analysis gathering annotations with two radiologists on a collection of 100 reports. The results indicate the potential effectiveness of our method as a radiology-specific evaluation metric. The code, data, and model checkpoints to reproduce our findings will be publicly available.

摘要
我们提出了一种新的自动评估指标 для机器生成的医学报告，基于成功的COMET架构，适用于医学领域。我们训练并发布了四个医学方向的模型检查点，其中一个基于RadGraph医学知识图。我们的结果显示，我们的指标与已确立的指标 such as BERTscore、BLEU和CheXbert score 相关度较高。此外，我们还证明了我们的检查点与人类评估相关，通过使用公开available的六名证encized radiologist的注释来评估。我们还进行了自己的分析，使用两名医生对一个集合的100份报告进行了注释。结果表明了我们的方法的医学特有性和可能的效果。我们将公开code、数据和模型检查点，以便重现我们的结论。

Entity-Aspect-Opinion-Sentiment Quadruple Extraction for Fine-grained Sentiment Analysis

paper_url: http://arxiv.org/abs/2311.16678
repo_url: None
paper_authors: Dan Ma, Jun Xu, Zongyu Wang, Xuezhi Cao, Yunsen Xian
for: 提高 aspect-based sentiment analysis (ABSA) 任务中的精度和可靠性，解决多个方面和属性之间的关系问题。
methods: 提出了一种新的任务 called Entity-Aspect-Opinion-Sentiment Quadruple Extraction (EASQE)，通过层次分解方面名称，避免信息损失、非封闭注释和意见误解，以提高 ABSA 任务的精度和可靠性。
results: 提出了一种基于 sequence-tagging 的 Trigger-Opinion 框架，可以在 EASQE 任务中实现高度的精度和可靠性，并且可以应用于其他 ABSA 任务中，与现有方法相比，显著提高了性能。

Abstract
Product reviews often contain a large number of implicit aspects and object-attribute co-existence cases. Unfortunately, many existing studies in Aspect-Based Sentiment Analysis (ABSA) have overlooked this issue, which can make it difficult to extract opinions comprehensively and fairly. In this paper, we propose a new task called Entity-Aspect-Opinion-Sentiment Quadruple Extraction (EASQE), which aims to hierarchically decompose aspect terms into entities and aspects to avoid information loss, non-exclusive annotations, and opinion misunderstandings in ABSA tasks. To facilitate research in this new task, we have constructed four datasets (Res14-EASQE, Res15-EASQE, Res16-EASQE, and Lap14-EASQE) based on the SemEval Restaurant and Laptop datasets. We have also proposed a novel two-stage sequence-tagging based Trigger-Opinion framework as the baseline for the EASQE task. Empirical evaluations show that our Trigger-Opinion framework can generate satisfactory EASQE results and can also be applied to other ABSA tasks, significantly outperforming state-of-the-art methods. We have made the four datasets and source code of Trigger-Opinion publicly available to facilitate further research in this area.

摘要
To facilitate research in this new task, we have constructed four datasets (Res14-EASQE, Res15-EASQE, Res16-EASQE, and Lap14-EASQE) based on the SemEval Restaurant and Laptop datasets. We have also proposed a novel two-stage sequence-tagging based Trigger-Opinion framework as the baseline for the EASQE task. Our Trigger-Opinion framework can generate satisfactory EASQE results and can also be applied to other ABSA tasks, significantly outperforming state-of-the-art methods.We have made the four datasets and source code of Trigger-Opinion publicly available to facilitate further research in this area.

A Distribution-Based Threshold for Determining Sentence Similarity

paper_url: http://arxiv.org/abs/2311.16675
repo_url: None
paper_authors: Gioele Cadamuro, Marco Gruppo
for: 本研究的目的是解决一种semantic textual similarity（STS）问题，在这个问题中，需要匹配两个句子，它们只有一个微的区别，例如名称、地址、标识码等。
methods: 本研究使用了基于siamese架构的神经网络来创建类似和不类似句子之间的距离分布。这些分布的目的是找到一个”阈值”，这个阈值表示一个明确的量，可以用于在新预测和后续分析中分辨类似和不类似句子的vector距离。此外，我们还开发了一种将 predications 打分的方法，该方法将来自分布的特征和距离函数的特征相结合。
results: 我们通过应用这种系统来解决一个广泛使用的STS问题的benchmark dataset，并得到了一些普适的结果，表明这种方法可以在更广泛的领域中转移。

Abstract
We hereby present a solution to a semantic textual similarity (STS) problem in which it is necessary to match two sentences containing, as the only distinguishing factor, highly specific information (such as names, addresses, identification codes), and from which we need to derive a definition for when they are similar and when they are not. The solution revolves around the use of a neural network, based on the siamese architecture, to create the distributions of the distances between similar and dissimilar pairs of sentences. The goal of these distributions is to find a discriminating factor, that we call "threshold", which represents a well-defined quantity that can be used to distinguish vector distances of similar pairs from vector distances of dissimilar pairs in new predictions and later analyses. In addition, we developed a way to score the predictions by combining attributes from both the distributions' features and the way the distance function works. Finally, we generalize the results showing that they can be transferred to a wider range of domains by applying the system discussed to a well-known and widely used benchmark dataset for STS problems.

摘要
我们现在提出一种解决 semantic textual similarity (STS) 问题的解决方案，该问题需要匹配两句话，其中只有特定信息（如名称、地址、标识代码）作为区分因素。我们需要从这些句话中 derivate 一个定义，用于确定当两句话相似或不相似时。我们的解决方案基于 siamese 架构的神经网络，用于生成两句话之间的距离分布。我们的目标是找到一个名为 "阈值" 的定义，该定义表示一个具体的量，可以用于在新预测和后续分析中分辨 vector 距离相似对的距离和不相似对的距离。此外，我们还开发了一种将这些预测分数相加的方法，该方法基于分布的特征和距离函数的方式。最后，我们总结了结果，并证明它们可以在更广泛的领域中传递。为此，我们应用了讨论的系统到一个广泛使用的 STS 问题 benchmark 数据集中。

Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical Classification

paper_url: http://arxiv.org/abs/2311.16650
repo_url: https://github.com/jyansir/text2tree
paper_authors: Jiahuan Yan, Haojun Gao, Zhang Kai, Weize Liu, Danny Chen, Jian Wu, Jintai Chen
for: 这篇论文旨在应对医疗文本分类任务中，深度学习方法的表现仍然不稳定，主要因为医疗文本样本通常具有极大的不均衡和罕有的问题。
methods: 本论文提出了一个名为 Text2Tree 的新框架独立式算法，仅使用内部标签阶层在训练深度学习模型。文章还提出了两种新的学习方法：一种是 Similarity Surrogate Learning（SSL），另一种是 Dissimilarity Mixup Learning（DML），它们可以将标签之间的关系转换为内部标签阶层的知识，以提高文本分类的稳定性。
results: 在作者公开的公共数据集和实际医疗资料上，我们的方法稳定地达到了旧有和进阶的不均衡分类方法的性能水平。

Abstract
Deep learning approaches exhibit promising performances on various text tasks. However, they are still struggling on medical text classification since samples are often extremely imbalanced and scarce. Different from existing mainstream approaches that focus on supplementary semantics with external medical information, this paper aims to rethink the data challenges in medical texts and present a novel framework-agnostic algorithm called Text2Tree that only utilizes internal label hierarchy in training deep learning models. We embed the ICD code tree structure of labels into cascade attention modules for learning hierarchy-aware label representations. Two new learning schemes, Similarity Surrogate Learning (SSL) and Dissimilarity Mixup Learning (DML), are devised to boost text classification by reusing and distinguishing samples of other labels following the label representation hierarchy, respectively. Experiments on authoritative public datasets and real-world medical records show that our approach stably achieves superior performances over classical and advanced imbalanced classification methods.

摘要
深度学习方法在不同的文本任务上展现出了承诺的表现，但在医疗文本分类方面仍然受到挑战。不同于现有主流方法，该文章强调数据挑战在医疗文本中，并提出了一种新的框架无关算法 called Text2Tree，只使用内部标签层次结构在深度学习模型中进行训练。我们将ICD代码树结构 embedding到级别自适应注意力模块中，以学习层次意识的标签表示。此外，我们还提出了两种新的学习方法：一种是类似替换学习（SSL），另一种是不同混合学习（DML），以便重用和分辨其他标签的样本，并在标签表示层次结构中进行学习。实验结果表明，我们的方法在公共数据集和真实的医疗记录中稳定地实现了 классические和先进的不平衡分类方法的超越表现。

Scaling Political Texts with ChatGPT

paper_url: http://arxiv.org/abs/2311.16639
repo_url: None
paper_authors: Gaël Le Mens, Aina Gallego
for: 这个论文是用于使用GPT-4来获取政治文本的位置估计。
methods: 论文使用了一种新的方法，通过将英国政党纲领和美国国会议员的推文位于经济、社会和移民政策维度上进行了定位。
results: 论文得到的结果表明，使用GPT-4来获取位置估计的性能相对较高，与专家的定位相关性为93%或更高，与批判投票的定位相关性为97%。

Abstract
We use GPT-4 to obtain position estimates of political texts in continuous spaces. We develop and validate a new approach by positioning British party manifestos on the economic, social, and immigration policy dimensions and tweets by members of the US Congress on the left-right ideological spectrum. For the party manifestos, the correlation between the positions produced by GPT-4 and experts is 93% or higher, a performance similar to or better than that obtained with crowdsourced position estimates. For individual tweets, the positions obtained with GPT-4 achieve a correlation of 91% with crowdsourced position estimates. For senators of the 117th US Congress, the positions obtained with GPT-4 achieve a correlation of 97% with estimates based on roll call votes and of 96% with those based on campaign funding. Correlations are also substantial within party, indicating that position estimates produced with GPT-4 capture within-party differences between senators. Overall, using GPT-4 for ideological scaling is fast, cost-efficient, and reliable. This approach provides a viable alternative to scaling by both expert raters and crowdsourcing.

摘要
Translation in Simplified Chinese:我们使用 GPT-4 获取政治文本的位置估计在连续空间中。我们开发和验证了一种新的方法，通过将英国政党纲领Positioning在经济、社会和移民政策维度上，以及美国国会议员的左右政策立场Positioning。对于政党纲领，GPT-4 生成的位置与专家的相关性高于或等于 93%，性能与人类审核的位置估计相似或更好。对于个人推文，GPT-4 生成的位置与人类审核的位置相关性达 91%。对于117届美国国会参议员，GPT-4 生成的位置与基于投票记录和竞选资金的估计相关性达 97%和96%。内党之间的相关性也很高，表明 GPT-4 生成的位置估计能够捕捉内党之间的差异。总之，使用 GPT-4 进行政治倾向的标准化是快速、成本效益高的。这种方法可以作为专家评审和人类审核的代替方案。

On the Long Range Abilities of Transformers

paper_url: http://arxiv.org/abs/2311.16620
repo_url: None
paper_authors: Itamar Zimerman, Lior Wolf
for: 本文是为了探讨transformer架构在长距离任务上的表现不佳问题，并提出了一些改进方法来提高表现。
methods: 本文使用了一些启发自长距离层的方法，包括状态空间层、线性RNN层和全球卷积层，并通过对 transformer 架构进行微小改进来提高表现。
results: 本文通过实验和理论分析，证明了将这些方法integrated into transformer架构可以大幅提高long距离任务的表现，并且这些改进不需要额外的计算和trainable参数。

Abstract
Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work, drawing inspiration from key attributes of long-range layers, such as state-space layers, linear RNN layers, and global convolution layers, we demonstrate that minimal modifications to the transformer architecture can significantly enhance performance on the Long Range Arena (LRA) benchmark, thus narrowing the gap with these specialized layers. We identify that two key principles for long-range tasks are (i) incorporating an inductive bias towards smoothness, and (ii) locality. As we show, integrating these ideas into the attention mechanism improves results with a negligible amount of additional computation and without any additional trainable parameters. Our theory and experiments also shed light on the reasons for the inferior performance of transformers on long-range tasks and identify critical properties that are essential for successfully capturing long-range dependencies.

摘要
尽管转换器在现代DL和特别是NLP领域具有很高的主导地位，但它们在长距离任务上表现不佳，比如特有的长距离层。在这种工作中，我们Drawing inspiration from long-range layers的关键特征，如状态空间层、线性RNN层和全球卷积层，我们示出了对transformer架构进行微小修改可以大幅提高LRA标准套件中的性能，因此降低与专门设计 для长距离任务的层之间的差距。我们认为两个关键原则对长距离任务是（i）在满足平滑性的偏好，以及（ii）本地性。我们表明，在注意力机制中 integrating这些想法可以提高结果，而无需额外计算和无需额外可训练参数。我们的理论和实验还披露了长距离任务中转换器的不佳表现的原因，并确定了捕捉长距离依赖关系所需的关键属性。

MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing

paper_url: http://arxiv.org/abs/2311.16588
repo_url: https://github.com/yale-lily/medgen
paper_authors: Rui Yang, Qingcheng Zeng, Keen You, Yujie Qiao, Lucas Huang, Chia-Chun Hsieh, Benjamin Rosand, Jeremy Goldwasser, Amisha D Dave, Tiarnan D. L. Keenan, Emily Y Chew, Dragomir Radev, Zhiyong Lu, Hua Xu, Qingyu Chen, Irene Li
For: The paper is written for biomedical researchers and healthcare professionals who need an easy-to-use toolkit for medical text processing.* Methods: The paper introduces MedGen, a comprehensive natural language processing (NLP) toolkit that includes four advanced generative functions (question answering, text summarization, text simplification, and machine translation) and 12 essential NLP functions (such as word tokenization and sentence segmentation).* Results: The authors fine-tuned 32 domain-specific language models, evaluated them thoroughly on 24 established benchmarks, and conducted manual reviews with clinicians. They also expanded the toolkit by introducing query and search functions, and standardized and integrated functions from third-party libraries.Here’s the same information in Simplified Chinese text:* For: 这篇论文是为生物医学研究人员和医疗专业人员而写的，他们需要一个易于使用的医学文本处理工具。* Methods: 论文介绍了MedGen，一个全面的自然语言处理（NLP）工具包，包括四种先进的生成函数（问答、摘要、简化和翻译）以及12种基本的NLP函数（例如词 Tokenization和句子分 segmentation）。* Results: 作者对32个领域特定的语言模型进行了精心的 fine-tuning，对24个已知的标准准则进行了严格的评估，并与临床医生进行了手动审查。他们还扩展了工具包，添加了查询和搜索功能，并将第三方库中的函数标准化和集成到了工具中。工具、模型和相关数据都公开提供于https://github.com/Yale-LILY/MedGen。

Abstract
This study introduces MedGen, a comprehensive natural language processing (NLP) toolkit designed for medical text processing. MedGen is tailored for biomedical researchers and healthcare professionals with an easy-to-use, all-in-one solution that requires minimal programming expertise. It includes (1) Generative Functions: For the first time, MedGen includes four advanced generative functions: question answering, text summarization, text simplification, and machine translation; (2) Basic NLP Functions: MedGen integrates 12 essential NLP functions such as word tokenization and sentence segmentation; and (3) Query and Search Capabilities: MedGen provides user-friendly query and search functions on text corpora. We fine-tuned 32 domain-specific language models, evaluated them thoroughly on 24 established benchmarks and conducted manual reviews with clinicians. Additionally, we expanded our toolkit by introducing query and search functions, while also standardizing and integrating functions from third-party libraries. The toolkit, its models, and associated data are publicly available via https://github.com/Yale-LILY/MedGen.

摘要

生成函数：MedGen 首次包含四个高级生成函数：问答、文本概要、文本简化和机器翻译。2. 基础 NLP 函数：MedGen 集成了 12 种基础 NLP 函数，包括词语分词和句子分 segmentation。3. 查询和搜索功能：MedGen 提供了用户友好的查询和搜索功能，可以在文本 corpora 上进行查询和搜索。我们对 MedGen 进行了严格的评估，包括对 24 个已知 benchmark 进行了严格的评估，并与临床医生进行了手动审查。此外，我们还扩展了我们的工具包，包括引入查询和搜索功能，并标准化和集成了第三方库中的函数。MedGen 的工具、模型和相关数据都可以通过 https://github.com/Yale-LILY/MedGen 获取。

Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions

paper_url: http://arxiv.org/abs/2311.16579
repo_url: None
paper_authors: Xinhong Chen, Zongxi Li, Yaowei Wang, Haoran Xie, Jianping Wang, Qing Li
for: 本文主要研究是识别文本中情感和原因之间的 causal 关系，强调特定的上下文句子参与到这种 causal 关系中。
methods: 本文提出了一种新的任务，即根据不同的上下文句子是否存在有效的 causal 关系，并使用人工标注 benchmark 数据集来获取标签。 furthermore, the authors propose an end-to-end multi-task framework to handle the two goals of the task, including a context masking module and a prediction aggregation module.
results: experiments 和 ablation 研究结果表明，提出的方法效果和通用性都很高。

Abstract
The study of causal relationships between emotions and causes in texts has recently received much attention. Most works focus on extracting causally related clauses from documents. However, none of these works has considered that the causal relationships among the extracted emotion and cause clauses can only be valid under some specific context clauses. To highlight the context in such special causal relationships, we propose a new task to determine whether or not an input pair of emotion and cause has a valid causal relationship under different contexts and extract the specific context clauses that participate in the causal relationship. Since the task is new for which no existing dataset is available, we conduct manual annotation on a benchmark dataset to obtain the labels for our tasks and the annotations of each context clause's type that can also be used in some other applications. We adopt negative sampling to construct the final dataset to balance the number of documents with and without causal relationships. Based on the constructed dataset, we propose an end-to-end multi-task framework, where we design two novel and general modules to handle the two goals of our task. Specifically, we propose a context masking module to extract the context clauses participating in the causal relationships. We propose a prediction aggregation module to fine-tune the prediction results according to whether the input emotion and causes depend on specific context clauses. Results of extensive comparative experiments and ablation studies demonstrate the effectiveness and generality of our proposed framework.

摘要
研究文本中情感和原因之间的 causal 关系在当前已经获得了很多关注。大多数工作都是提取文本中的 causally 相关句。然而，这些工作都没有考虑情感和原因句之间的 causal 关系只有在某些特定的上下文句中才能成立。为了强调这种特殊的 causal 关系，我们提出了一个新的任务，即判断输入情感和原因是否存在有效的 causal 关系，并提取特定上下文句的类型。由于这个任务是新的，我们手动标注了一个标准数据集，以获取我们任务的标签和每个上下文句的类型标注。我们采用负样本构建的方法来构建最终数据集，以保持输入文本中具有和没有 causal 关系的文本的平衡。基于构建的数据集，我们提出了一个端到端多任务框架，其中我们设计了两个新的通用模块：上下文屏蔽模块和预测聚合模块。Specifically, we propose a context masking module to extract the context clauses participating in the causal relationships. We propose a prediction aggregation module to fine-tune the prediction results according to whether the input emotion and causes depend on specific context clauses. Results of extensive comparative experiments and ablation studies demonstrate the effectiveness and generality of our proposed framework.

Evaluation of dynamic characteristics of power grid based on GNN and application on knowledge graph

paper_url: http://arxiv.org/abs/2311.16522
repo_url: None
paper_authors: Hao Pei, Si Lin, Chuanfu Li, Che Wang, Haoming Chen, Sizhe Li
for: 这篇论文旨在提高智能故障诊断方面的网络运行和维护。
methods: 该方法使用图神经网络（GNN）来检测电力网络中的故障节点，并使用特殊的电气特征提取模型和知识图来进行特征提取。该方法还利用时间数据，通过前后时期节点状态的比较来帮助当前故障检测。
results: 实验结果表明，这种GNN可以准确地检测出模拟场景中的故障节点，具体的准确率为99.53%。此外，图神经网络的特征模型化也允许对故障节点的扩散进行质量的分析，为分析故障节点提供了有价值的信息。

Abstract
A novel method for detecting faults in power grids using a graph neural network (GNN) has been developed, aimed at enhancing intelligent fault diagnosis in network operation and maintenance. This GNN-based approach identifies faulty nodes within the power grid through a specialized electrical feature extraction model coupled with a knowledge graph. Incorporating temporal data, the method leverages the status of nodes from preceding and subsequent time periods to aid in current fault detection. To validate the effectiveness of this GNN in extracting node features, a correlation analysis of the output features from each node within the neural network layer was conducted. The results from experiments show that this method can accurately locate fault nodes in simulated scenarios with a remarkable 99.53% accuracy. Additionally, the graph neural network's feature modeling allows for a qualitative examination of how faults spread across nodes, providing valuable insights for analyzing fault nodes.

摘要
新的方法 для探测电力网络中的故障使用图 neural network (GNN) 已经开发，以增强智能故障诊断在网络运行和维护中。这种 GNN 基于的方法通过特殊的电气特征提取模型和知识图结合来标识电力网络中的故障节点。利用时间数据，这种方法可以利用前后时期节点的状态来帮助当前故障检测。为验证 GNN 在提取节点特征方面的效果，对每个神经网络层的输出特征进行了相关分析。实验结果显示，这种方法可以高精度地检测模拟场景中的故障节点，具体的检测精度为 99.53%。此外，图 neural network 的特征模型化allow for 分析故障如何在节点之间传播，提供了有价值的分析故障节点的信息。

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

paper_url: http://arxiv.org/abs/2311.16509
repo_url: None
paper_authors: Kazuki Yamauchi, Yusuke Ijima, Yuki Saito
for: 生成自然语言描述的说话风格
methods: 使用对应的语音和自然语言描述数据进行预测Prefix вектор，并使用大语言模型（LLM）基于文本解码器进行文本生成
results: 利用更加丰富的LLMs、语音自我超vised学习特征和句子重塑增强生成的说话风格描述的准确性和多样性

Abstract
We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. As a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning, StyleCap uses paired data of speech and natural language descriptions to train neural networks that predict prefix vectors fed into a large language model (LLM)-based text decoder from a speech representation vector. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions. Samples of speaking-style captions generated by our StyleCap are publicly available.

摘要
我们提出了 StyleCap，一种方法生成来自语音的自然语言描述。大多数传统技术用于非语言/非语言信息认识都是通过类别分类或已定标签的强度估计来进行识别，但这些技术无法提供识别结果的逻辑解释。作为自动发送风格提示的末级方法的第一步，StyleCap使用了对speech和自然语言描述的对应数据来训练神经网络，从语音表示向量预测前缀 вектор，并将其传递给基于大语言模型（LLM）的文本解码器。我们探索适合这新任务的合适的文本解码器和语音特征表示。实验结果表明，我们的 StyleCap 通过richer LLMs для文本解码器、语音自我超vised学习（SSL）特征和句子重塑扩展提高了生成的发送风格描述的准确性和多样性。样本的发送风格描述生成于我们的 StyleCap 已经公开发布。

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

paper_url: http://arxiv.org/abs/2311.16452
repo_url: None
paper_authors: Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, Renqian Luo, Scott Mayer McKinney, Robert Osazuwa Ness, Hoifung Poon, Tao Qin, Naoto Usuyama, Chris White, Eric Horvitz
for: 这个论文是为了探讨基于GPT-4的普通基础模型在医学领域的能力，以及如何通过引言工程来解锁这些模型的特殊能力。
methods: 这篇论文使用了多种引言策略，包括各种推荐、医学知识背景和问题预测等，以激活GPT-4模型的特殊能力。
results: 根据实验结果，使用Medprompt引言策略可以让GPT-4模型在多个医学领域的benchmark测试中表现出色，并且比预先训练的专家模型（如Med-PaLM 2）有许多更好的表现。此外，Medprompt还可以在其他领域中展示出优异的表现，如电机工程、机器学习、哲学、会计、法律和心理学等。

Abstract
Generalist foundation models such as GPT-4 have displayed surprising capabilities in a wide variety of domains and tasks. Yet, there is a prevalent assumption that they cannot match specialist capabilities of fine-tuned models. For example, most explorations to date on medical competency benchmarks have leveraged domain-specific training, as exemplified by efforts on BioGPT and Med-PaLM. We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training. Rather than using simple prompting to highlight the model's out-of-the-box capabilities, we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks. The prompting methods we explore are general purpose, and make no specific use of domain expertise, removing the need for expert-curated content. Our experimental design carefully controls for overfitting during the prompt engineering process. We introduce Medprompt, based on a composition of several prompting strategies. With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite. The method outperforms leading specialist models such as Med-PaLM 2 by a significant margin with an order of magnitude fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27% reduction in error rate on the MedQA dataset over the best methods to date achieved with specialist models and surpasses a score of 90% for the first time. Beyond medical problems, we show the power of Medprompt to generalize to other domains and provide evidence for the broad applicability of the approach via studies of the strategy on exams in electrical engineering, machine learning, philosophy, accounting, law, nursing, and clinical psychology.

摘要
通用基础模型如GPT-4已经表现出了广泛领域和任务的某些能力。然而，有一种普遍的假设认为它们无法与专家模型的细化能力相匹配。例如，大多数尝试到date在医疗能力标准 benchmarks 上使用了域pecific 训练，如 BioGPT 和 Med-PaLM。我们在 GPT-4 在医疗挑战benchmarks 上的能力的尝试中建立了基础。而不是使用简单的提示来强调模型的外部能力，我们进行了系统性的提示工程。我们发现，提示创新可以解锁更深的专家能力，并且表明GPT-4可以轻松地超越过去最佳结果。我们的实验设计 méticulously 控制了提示工程过程中的溢出。我们引入 Medprompt，基于多种提示策略的集成。使用 Medprompt，GPT-4在 MultiMedQA 集合中的所有九个benchmark dataset上达到了状态的最佳结果。我们的方法比领导的专家模型such as Med-PaLM 2 margin 的许多次。驱使 GPT-4 使用 Medprompt 可以在 MedQA 集合中reduction 错误率27%，超过了最佳方法以特date 获得的结果。此外，我们还证明了 Medprompt 在其他领域中的广泛适用性，并提供了对战略的证明。

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

paper_url: http://arxiv.org/abs/2311.16444
repo_url: None
paper_authors: Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta, Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato
for: 本文提出了一个新的比较方式，用于将 dense video captioning 中的模型从多视图视频转移到单视图视频上。这种转移是由于 egocentric 视频的数据稀缺所致，而 web 视频的数据则充沛。本文通过在这两个数据集之间进行转移学习，来解决这个问题。
methods: 本文提出了一种新的视点不变学习方法，通过对 web 视频和 egocentric 视频进行 adversarial 训练，来bridging 这两个视点之间的视图差异。在预训练和细化训练两个阶段中，都使用了 adversarial 训练来学习不变的特征。
results: 本文通过对两个数据集进行转移学习，来证明这种方法的有效性。研究表明，这种方法可以有效地过渡知识，并且可以减少视图差异。此外，本文还提出了一个实际的 egocentric 数据集（EgoYC2），以便进一步研究 egocentric 视频的描述。

Abstract
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning, adapting models from web instructional videos with exocentric views to an egocentric view. While dense video captioning (predicting time segments and their captions) is primarily studied with exocentric videos (e.g., YouCook2), benchmarks with egocentric videos are restricted due to data scarcity. To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is demanded as a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes. In this work, we first create a real-life egocentric dataset (EgoYC2) whose captions are shared with YouCook2, enabling transfer learning between these datasets assuming their ground-truth is accessible. To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

摘要
我们提出了一个新的基准测试方法，用于跨视图知识传递的紧密视频描述，将从互联网教学视频中的外部视图适应到内部视图。 although dense video captioning (预测时间段和其描述) 已经主要研究了外部视频（例如 YouCook2），但是基于内部视图的数据受限，因此需要实用的方法。 To overcome the limited video availability, transferring knowledge from abundant exocentric web videos is a practical approach. However, learning the correspondence between exocentric and egocentric views is difficult due to their dynamic view changes. The web videos contain mixed views focusing on either human body actions or close-up hand-object interactions, while the egocentric view is constantly shifting as the camera wearer moves. This necessitates the in-depth study of cross-view transfer under complex view changes.在这个工作中，我们首先创建了一个真实的内部视频集（EgoYC2），其caption是与YouCook2的caption相同，这使得可以进行转移学习。 To bridge the view gaps, we propose a view-invariant learning method using adversarial training in both the pre-training and fine-tuning stages. While the pre-training is designed to learn invariant features against the mixed views in the web videos, the view-invariant fine-tuning further mitigates the view gaps between both datasets. We validate our proposed method by studying how effectively it overcomes the view change problem and efficiently transfers the knowledge to the egocentric domain. Our benchmark pushes the study of the cross-view transfer into a new task domain of dense video captioning and will envision methodologies to describe egocentric videos in natural language.

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

paper_url: http://arxiv.org/abs/2311.17086
repo_url: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
paper_authors: Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu
for: 这篇论文旨在提出一种简单易用的语言转移方法，以增强文本描述到图像生成的能力。
methods: 该方法基于知识储存法，使用轻量级的多层感知网络（PEA）和教师知识储存来实现语言转移。
results: 研究表明，使用该方法可以减少训练数据量，并且可以在不同语言之间进行跨语言文本描述到图像生成。 Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion.

Abstract
Text-to-image diffusion models are well-known for their ability to generate realistic images based on textual prompts. However, the existing works have predominantly focused on English, lacking support for non-English text-to-image models. The most commonly used translation methods cannot solve the generation problem related to language culture, while training from scratch on a specific language dataset is prohibitively expensive. In this paper, we are inspired to propose a simple plug-and-play language transfer method based on knowledge distillation. All we need to do is train a lightweight MLP-like parameter-efficient adapter (PEA) with only 6M parameters under teacher knowledge distillation along with a small parallel data corpus. We are surprised to find that freezing the parameters of UNet can still achieve remarkable performance on the language-specific prompt evaluation set, demonstrating that PEA can stimulate the potential generation ability of the original UNet. Additionally, it closely approaches the performance of the English text-to-image model on a general prompt evaluation set. Furthermore, our adapter can be used as a plugin to achieve significant results in downstream tasks in cross-lingual text-to-image generation. Code will be available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion

摘要
文本到图像协拌模型已经很有名气，可以基于文本提示生成真实的图像。然而，现有的工作具主要集中在英语上，缺乏不同语言的支持。现有的翻译方法无法解决语言文化生成问题，而训练从零开始在特定语言数据集上是非常昂贵的。本文我们是被 inspirited 提出了一种简单的插件式语言传递方法，基于知识储存。我们只需要训练一个简单的多层感知（MLP） like 参数效率适应器（PEA），只有600万参数，并在老师知识储存下进行师学。我们感到了惊讶，发现冻结原始 UNet 的参数仍然可以达到了语言特定的提问评估集中的很好表现，表明PEA可以激活原始 UNet 的潜在生成能力。此外，我们的适应器可以用作插件，以达到跨语言文本到图像生成下游任务的显著成果。代码将在：https://github.com/OPPO-Mente-Lab/PEA-Diffusion 上发布。

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models

paper_url: http://arxiv.org/abs/2311.16421
repo_url: https://github.com/astrodrew/cdeval
paper_authors: Yuhang Wang, Yanxu Zhu, Chao Kong, Shuyu Wei, Xiaoyuan Yi, Xing Xie, Jitao Sang
For: 本研究的目的是提出一个新的评估大语言模型（LLM）文化尺度的benchmark，以便更好地考虑文化因素在LLM的开发和评估中。* Methods: 本研究使用了GPT-4自动生成和人工验证，建立了六个文化尺度的CDEvalbenchmark，覆盖了七个领域。* Results: 实验结果显示，主流LLM的文化尺度存在一定的一致性和差异性，强调了在不同领域和文化Setting中发展和评估LLM时需要考虑文化因素的重要性。

Abstract
As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.

摘要
随着大型语言模型（LLMs）的扩大，其能力的调整问题逐渐受到了更多的关注，以确保其负责任和道德使用。然而，现有的Alignment努力主要集中在通用价值观such as HHH原则，文化方面，即人类社会的多元化和多样性，尚未得到了足够的注意。本工作介绍了一个新的Benchmark，CDEval，用于评估LLMs的文化维度。CDEval通过自动生成和人工验证，覆盖了六个文化维度 across seven domains。我们的完整的实验提供了LLMs的文化方面的有趣发现，包括不同维度和领域之间的一致和差异。这些发现 подчеркивает了在LLM开发中包含文化考虑的重要性，特别是在多元文化环境中应用。通过CDEval，我们希望扩大LLMAlignment研究的视野，包括文化维度，以提供更加全面的框架，以便未来LLMs的发展和评估。这个Benchmark serves as a valuable resource for cultural studies in LLMs，开启了更加文化敏感和aware的模型的道路。

2023-11-28

Does VLN Pretraining Work with Nonsensical or Irrelevant Instructions?

RETSim: Resilient and Efficient Text Similarity

General-Purpose vs. Domain-Adapted Large Language Models for Extraction of Data from Thoracic Radiology Reports

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Scalable Extraction of Training Data from (Production) Language Models

ChatGPT’s One-year Anniversary: Are Open-Source Large Language Models Catching up?

Assessing the influence of attractor-verb distance on grammatical agreement in humans and language models

Natural Language Processing Through Transfer Learning: A Case Study on Sentiment Analysis

Optimisation-Based Multi-Modal Semantic Image Editing

A Benchmark for Evaluating Machine Translation Metrics on Dialects Without Standard Orthography

RELIC: Investigating Large Language Model Responses using Self-Consistency

Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

Evaluating Optimal Reference Translations

Radiology-Aware Model-Based Evaluation Metric for Report Generation

Entity-Aspect-Opinion-Sentiment Quadruple Extraction for Fine-grained Sentiment Analysis

A Distribution-Based Threshold for Determining Sentence Similarity

Text2Tree: Aligning Text Representation to the Label Tree Hierarchy for Imbalanced Medical Classification

Scaling Political Texts with ChatGPT

On the Long Range Abilities of Transformers

MedGen: A Python Natural Language Processing Toolkit for Medical Text Processing

Recognizing Conditional Causal Relationships about Emotions and Their Corresponding Conditions

Evaluation of dynamic characteristics of power grid based on GNN and application on knowledge graph

StyleCap: Automatic Speaking-Style Captioning from Speech Based on Speech and Language Self-supervised Learning Models

Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine

Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models