cs.CL - 2023-08-19

An Empirical Study of CLIP for Text-based Person Search

paper_url: http://arxiv.org/abs/2308.10045
repo_url: https://github.com/flame-chasers/tbps-clip
paper_authors: Min Cao, Yang Bai, Ziyin Zeng, Mang Ye, Min Zhang
for: This paper aims to explore the potential of the visual-language pre-training model CLIP for downstream Text-Based Person Search (TBPS) tasks.methods: The paper conducts a comprehensive empirical study of CLIP for TBPS, including revisiting critical design considerations such as data augmentation and loss function, and implementing practical training tricks.results: The model achieves satisfactory performance without any sophisticated modules, and the probing experiments demonstrate the effectiveness of TBPS-CLIP from various aspects, providing empirical insights and highlighting future research directions.Here’s the simplified Chinese text:for: 这篇论文想要探索CLIP视觉语言预训模型在下游文本人像检索任务上的潜力。methods: 论文通过对CLIP进行全面的实验研究，包括重新评估关键设计因素，如数据增强和损失函数，以及实施实用的训练技巧。results: 模型无需任何复杂模块就可以达到满意性的性能，并通过 probing 实验表明TBPS-CLIP在多个方面的效果，提供了实证意义和未来研究方向。

Abstract
Text-based Person Search (TBPS) aims to retrieve the person images using natural language descriptions. Recently, Contrastive Language Image Pretraining (CLIP), a universal large cross-modal vision-language pre-training model, has remarkably performed over various cross-modal downstream tasks due to its powerful cross-modal semantic learning capacity. TPBS, as a fine-grained cross-modal retrieval task, is also facing the rise of research on the CLIP-based TBPS. In order to explore the potential of the visual-language pre-training model for downstream TBPS tasks, this paper makes the first attempt to conduct a comprehensive empirical study of CLIP for TBPS and thus contribute a straightforward, incremental, yet strong TBPS-CLIP baseline to the TBPS community. We revisit critical design considerations under CLIP, including data augmentation and loss function. The model, with the aforementioned designs and practical training tricks, can attain satisfactory performance without any sophisticated modules. Also, we conduct the probing experiments of TBPS-CLIP in model generalization and model compression, demonstrating the effectiveness of TBPS-CLIP from various aspects. This work is expected to provide empirical insights and highlight future CLIP-based TBPS research.

摘要

GameEval: Evaluating LLMs on Conversational Games

paper_url: http://arxiv.org/abs/2308.10032
repo_url: None
paper_authors: Dan Qiao, Chenfei Wu, Yaobo Liang, Juntao Li, Nan Duan
for: This paper aims to evaluate large language models (LLMs) through goal-driven conversational games, addressing the limitations of existing evaluation methods.
methods: The proposed approach, called GameEval, treats LLMs as game players and assigns them distinct roles with specific goals achieved through conversations of various forms, such as discussion, question answering, and voting.
results: Extensive experiments show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems.Here are the three points in Simplified Chinese:
for: 这篇论文目标是通过对话游戏来评估大语言模型（LLM），超越现有评估方法的限制。
methods: 提议的方法是通过将 LLM 当作游戏玩家，赋予它们特定的目标，通过不同的对话形式，如讨论、问答和投票，来评估模型的能力。
results: 广泛的实验表明，GameEval 可以有效地区分不同的 LLM 的能力，为复杂问题的解决提供全面的评估。

Abstract
The rapid advancements in large language models (LLMs) have presented challenges in evaluating those models. Existing evaluation methods are either reference-based or preference based, which inevitably need human intervention or introduce test bias caused by evaluator models. In this paper, we propose GameEval, a novel approach to evaluating LLMs through goal-driven conversational games, overcoming the limitations of previous methods. GameEval treats LLMs as game players and assigns them distinct roles with specific goals achieved by launching conversations of various forms, including discussion, question answering, and voting. We design three unique games with cooperative or adversarial objectives, accompanied by corresponding evaluation metrics, to show how this new paradigm comprehensively evaluates model performance.Through extensive experiments, we show that GameEval can effectively differentiate the capabilities of various LLMs, providing a comprehensive assessment of their integrated abilities to solve complex problems. Our public anonymous code is available at https://github.com/GameEval/GameEval.

摘要
快速发展的大语言模型（LLM）带来了评估这些模型的挑战。现有的评估方法都是基于参考或偏好基础的，因此需要人工干预或引入评估器模型的测试偏见。本文提出了一种新的评估方法——GameEval，通过对LML进行目标驱动的对话游戏来评估其表现。GameEval将LML当作游戏玩家，赋予它们不同的角色和目标，通过发起不同的对话形式，包括讨论、问答和投票，来评估其能力解决复杂问题。我们设计了三个独特的游戏，每个游戏都有合作或对抗目标，并附带了相应的评估指标。我们通过广泛的实验表明，GameEval可以有效地区分不同的LML表现，提供全面评估这些模型的复杂问题解决能力。我们的公共匿名代码可以在https://github.com/GameEval/GameEval上下载。

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

paper_url: http://arxiv.org/abs/2308.10025
repo_url: None
paper_authors: Kaihang Pan, Juncheng Li, Hongye Song, Hao Fei, Wei Ji, Shuo Zhang, Jun Lin, Xiaozhong Liu, Siliang Tang
for: 这个研究的目的是提高 dense retrieval 模型在多元搜寻任务中的表现，并且可以让模型适应不同的搜寻意图。
methods: 这个研究使用了 ControlRetriever，一个可以控制 dense retrieval 模型的方法，并且使用了 ControlNet 的基础，将不同的搜寻模型融合到一个整体系统中，并且使用了一个新的 LLM 导向的指令生成和迭代训练策略，将 ControlRetriever 训练成可以适应不同的搜寻任务。
results: 实验结果显示，在 BEIR 评量标准中，ControlRetriever 可以在不需要任务特定调整的情况下，与基eline方法相比，获得了明显的改善，并且在零基eline情况下也实现了州际级的表现。

Abstract
Recent studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we introduce ControlRetriever, a generic and efficient approach with a parameter isolated architecture, capable of controlling dense retrieval models to directly perform varied retrieval tasks, harnessing the power of instructions that explicitly describe retrieval intents in natural language. Leveraging the foundation of ControlNet, which has proven powerful in text-to-image generation, ControlRetriever imbues different retrieval models with the new capacity of controllable retrieval, all while being guided by task-specific instructions. Furthermore, we propose a novel LLM guided Instruction Synthesizing and Iterative Training strategy, which iteratively tunes ControlRetriever based on extensive automatically-generated retrieval data with diverse instructions by capitalizing the advancement of large language models. Extensive experiments show that in the BEIR benchmark, with only natural language descriptions of specific retrieval intent for each task, ControlRetriever, as a unified multi-task retrieval system without task-specific tuning, significantly outperforms baseline methods designed with task-specific retrievers and also achieves state-of-the-art zero-shot performance.

摘要
Translation in Simplified Chinese: latest studies have shown that dense retrieval models, lacking dedicated training data, struggle to perform well across diverse retrieval tasks, as different retrieval tasks often entail distinct search intents. To address this challenge, in this work we introduce ControlRetriever, a generic and efficient approach with a parameter isolated architecture, capable of controlling dense retrieval models to directly perform varied retrieval tasks, harnessing the power of instructions that explicitly describe retrieval intents in natural language. Building on the foundation of ControlNet, which has proven powerful in text-to-image generation, ControlRetriever imbues different retrieval models with the new capacity of controllable retrieval, all while being guided by task-specific instructions. Furthermore, we propose a novel LLM guided Instruction Synthesizing and Iterative Training strategy, which iteratively tunes ControlRetriever based on extensive automatically-generated retrieval data with diverse instructions by capitalizing the advancement of large language models. Extensive experiments show that in the BEIR benchmark, with only natural language descriptions of specific retrieval intent for each task, ControlRetriever, as a unified multi-task retrieval system without task-specific tuning, significantly outperforms baseline methods designed with task-specific retrievers and also achieves state-of-the-art zero-shot performance.

paper_url: http://arxiv.org/abs/2308.09985
repo_url: https://github.com/albertan017/hicl
paper_authors: Hanzhuo Tan, Chunpu Xu, Jing Li, Yuqun Zhang, Zeyang Fang, Zeyu Chen, Baohua Lai
for: addresses the issue of compromised performance in existing natural language understanding (NLU) models when faced with short and noisy social media content.
methods: leverages in-context learning (ICL) and a novel hashtag-driven in-context learning (HICL) framework, which pre-trains a model #Encoder using hashtags to drive BERT-based pre-training through contrastive learning, and employs a gradient-based method to identify trigger terms useful in fusing information from both sources.
results: substantially advances the previous state-of-the-art results on seven downstream tasks, and found that combining source input with a top-retrieved post from #Encoder is more effective than using semantically similar posts, and trigger words can largely benefit in merging context from the source and retrieved posts.Here is the answer in Simplified Chinese text:
for: 解决现有的自然语言理解（NLU）模型在面对短暴露的社交媒体内容时表现不佳的问题。
methods: 利用启发式学习（ICL）和一种带有标签驱动的启发式学习（HICL）框架，通过使用标签驱动BERT预训练的 pré-training，并使用梯度法来确定权重用于将来源和检索出的帖子内容融合。
results: substantially advance了之前的状态值表现结果，并发现将源输入与 #Encoder 预测的top-retrieved帖子融合是比使用相似的帖子更有效的。

Abstract
Natural language understanding (NLU) is integral to various social media applications. However, existing NLU models rely heavily on context for semantic learning, resulting in compromised performance when faced with short and noisy social media content. To address this issue, we leverage in-context learning (ICL), wherein language models learn to make inferences by conditioning on a handful of demonstrations to enrich the context and propose a novel hashtag-driven in-context learning (HICL) framework. Concretely, we pre-train a model #Encoder, which employs #hashtags (user-annotated topic labels) to drive BERT-based pre-training through contrastive learning. Our objective here is to enable #Encoder to gain the ability to incorporate topic-related semantic information, which allows it to retrieve topic-related posts to enrich contexts and enhance social media NLU with noisy contexts. To further integrate the retrieved context with the source text, we employ a gradient-based method to identify trigger terms useful in fusing information from both sources. For empirical studies, we collected 45M tweets to set up an in-context NLU benchmark, and the experimental results on seven downstream tasks show that HICL substantially advances the previous state-of-the-art results. Furthermore, we conducted extensive analyzes and found that: (1) combining source input with a top-retrieved post from #Encoder is more effective than using semantically similar posts; (2) trigger words can largely benefit in merging context from the source and retrieved posts.

摘要
natural language understanding (NLU) 是社交媒体应用程序中的一个重要组成部分。然而，现有的 NLU 模型很重要地依赖于上下文进行 semantic learning，这会导致它们在短 и噪音的社交媒体内容上表现不佳。为解决这个问题，我们利用 in-context learning (ICL)，其中语言模型通过使用一些示例来增强上下文，并提出了一个具有 hash 标签驱动的增Context learning (HICL) 框架。具体来说，我们首先预训 #Encoder，该模型使用 #hash 标签（用户标注的主题标签）来驱动 BERT 基于的预训练。我们的目标是让 #Encoder 能够integrate topic-related semantic information，以便从 retrieve 的上下文中提取相关信息，并在社交媒体 NLU 中增强噪音上下文的表现。此外，我们还使用 Gradient 基本方法来确定激活词，以便将源文本和检索到的上下文 fusion 到一起。为 empirical studies，我们收集了 45 万条 tweet，并设置了一个 in-context NLU benchmark。实验结果显示，HICL 在七个下游任务上substantially advance 了之前的状态的术。此外，我们还进行了广泛的分析，发现：(1) 将源输入与检索到的最佳帖子 fusion 是更有效的 чем使用相似的帖子;(2) 触发词可以很大程度地帮助合并来源和检索到的上下文。

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

paper_url: http://arxiv.org/abs/2308.09975
repo_url: https://github.com/sufe-aiflm-lab/fineval
paper_authors: Liwen Zhang, Weige Cai, Zhaowei Liu, Zhi Yang, Wei Dai, Yujie Liao, Qianru Qin, Yifei Li, Xingyu Liu, Zhiqiang Liu, Zhoufan Zhu, Anbo Wu, Xin Guo, Yun Chen
for: 本研究旨在评估大语言模型（LLMs）在金融领域知识上的表现，并提供一个特性rich的评估 benchmark。
methods: 本研究使用了多种提问类型，包括零shot、几shot、答案只、链式思维等，以评估state-of-the-art的中文和英文 LLMS 在金融领域知识上的表现。
results: 结果显示，只有 GPT-4 在不同的提问设置下达到了接近 70% 的准确率，表明 LLMS 在金融领域知识上的可能性很大。

Abstract
Large language models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, yet their efficacy in more challenging and domain-specific tasks remains largely unexplored. This paper presents FinEval, a benchmark specifically designed for the financial domain knowledge in the LLMs. FinEval is a collection of high-quality multiple-choice questions covering Finance, Economy, Accounting, and Certificate. It includes 4,661 questions spanning 34 different academic subjects. To ensure a comprehensive model performance evaluation, FinEval employs a range of prompt types, including zero-shot and few-shot prompts, as well as answer-only and chain-of-thought prompts. Evaluating state-of-the-art Chinese and English LLMs on FinEval, the results show that only GPT-4 achieved an accuracy close to 70% in different prompt settings, indicating significant growth potential for LLMs in the financial domain knowledge. Our work offers a more comprehensive financial knowledge evaluation benchmark, utilizing data of mock exams and covering a wide range of evaluated LLMs.

摘要
大型自然语言模型（LLMs）已经在各种自然语言处理任务中显示出了卓越表现，然而它们在更加具有挑战性和领域特有性的任务中的表现仍然尚未得到了充分的探索。这篇论文提出了FinEval，一个专门为金融领域知识的benchmark。FinEval包括4,661个高质量多选问题，涵盖了财务、经济、会计和证书等34个学科。为了全面评估模型的表现，FinEval使用了多种提问类型，包括零shot和几shot提问、以及答案只提问和链式思维提问。对现有的中文和英文LLMs进行FinEval的评估，结果显示，只有GPT-4在不同的提问设置中达到了近70%的准确率，这表明LLMs在金融领域知识方面还有很大的成长 potential。我们的工作提供了一个更加全面的金融知识评估benchmark，利用了考试数据和覆盖了各种评估LLMs。

Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

paper_url: http://arxiv.org/abs/2308.09892
repo_url: https://github.com/bcwarner/sts-select
paper_authors: Benjamin C. Warner, Ziqi Xu, Simon Haroutounian, Thomas Kannampallil, Chenyang Lu
for: 这篇论文是为了解决问题，即使 survey data 具有较高的特征数量且较低的例子数量， machine learning 模型还是能够预测结果的问题。
methods: 这篇论文使用了 feature selection 来解决这个问题，特别是使用 textual names of features 来评估哪些特征是有用的。
results: 研究发现，使用 STS 来选择特征可以实现更高的性能模型，比较传统的特征选择算法。

Abstract
Survey data can contain a high number of features while having a comparatively low quantity of examples. Machine learning models that attempt to predict outcomes from survey data under these conditions can overfit and result in poor generalizability. One remedy to this issue is feature selection, which attempts to select an optimal subset of features to learn upon. A relatively unexplored source of information in the feature selection process is the usage of textual names of features, which may be semantically indicative of which features are relevant to a target outcome. The relationships between feature names and target names can be evaluated using language models (LMs) to produce semantic textual similarity (STS) scores, which can then be used to select features. We examine the performance using STS to select features directly and in the minimal-redundancy-maximal-relevance (mRMR) algorithm. The performance of STS as a feature selection metric is evaluated against preliminary survey data collected as a part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can result in higher performance models compared to traditional feature selection algorithms.

摘要
Survey data often contains a large number of features but only a small number of examples. If machine learning models are used to predict outcomes from this data, they may overfit and have poor generalizability. One solution to this problem is feature selection, which involves selecting a subset of the most relevant features to learn from. A previously unexplored source of information in the feature selection process is the textual names of the features, which may be semantically indicative of which features are relevant to the target outcome. We can use language models (LMs) to evaluate the relationships between feature names and target names and produce semantic textual similarity (STS) scores. These scores can then be used to select features. We compare the performance of STS as a feature selection metric with traditional feature selection algorithms using preliminary survey data collected as part of a clinical study on persistent post-surgical pain (PPSP). The results suggest that features selected with STS can lead to higher performance models.

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

paper_url: http://arxiv.org/abs/2308.09862
repo_url: None
paper_authors: Maithili Sabane, Onkar Litake, Aman Chadha
for: developing a Question Answering dataset for low-resource languages Hindi and Marathi
methods: novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi
results: release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples, and the best-performing models for both Hindi and MarathiHere’s the simplified Chinese text for the three key points:
for: 为低资源语言� Hinidi 和 Marathi 开发问答集
methods: 使用 novel Approach 将 SQuAD 2.0 集 перевод成 Hinidi 和 Marathi
results: 发布了最大的问答集，每个集 contain 28,000 个样本，并且在两种语言上提供了最佳性能的模型

Abstract
The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

摘要
Recent advances in deep learning have led to the development of highly sophisticated systems with an insatiable appetite for data. However, building good deep learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages - Hindi and Marathi. Despite Hindi being the third most spoken language worldwide with 345 million speakers and Marathi being the 11th most spoken language globally with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To address the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Our method leverages similarity tools, which has the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method

paper_url: http://arxiv.org/abs/2308.09861
repo_url: None
paper_authors: Yu-An Liu, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Wei Chen, Yixing Fan, Xueqi Cheng
for: 本研究主要针对于 dense retrieval (DR) 模型的 Robustness 问题。
methods: 本研究使用了现有的 adversarial attack 方法，并提出了一种基于 contrastive learning 的新方法来攻击 DR 模型。
results: 实验结果显示，新方法可以对 DR 模型进行有效的攻击，并且可以使用小量文本干扰来诱导模型返回错误结果。

Abstract
Neural ranking models (NRMs) and dense retrieval (DR) models have given rise to substantial improvements in overall retrieval performance. In addition to their effectiveness, and motivated by the proven lack of robustness of deep learning-based approaches in other areas, there is growing interest in the robustness of deep learning-based approaches to the core retrieval problem. Adversarial attack methods that have so far been developed mainly focus on attacking NRMs, with very little attention being paid to the robustness of DR models. In this paper, we introduce the adversarial retrieval attack (AREA) task. The AREA task is meant to trick DR models into retrieving a target document that is outside the initial set of candidate documents retrieved by the DR model in response to a query. We consider the decision-based black-box adversarial setting, which is realistic in real-world search engines. To address the AREA task, we first employ existing adversarial attack methods designed for NRMs. We find that the promising results that have previously been reported on attacking NRMs, do not generalize to DR models: these methods underperform a simple term spamming method. We attribute the observed lack of generalizability to the interaction-focused architecture of NRMs, which emphasizes fine-grained relevance matching. DR models follow a different representation-focused architecture that prioritizes coarse-grained representations. We propose to formalize attacks on DR models as a contrastive learning problem in a multi-view representation space. The core idea is to encourage the consistency between each view representation of the target document and its corresponding viewer via view-wise supervision signals. Experimental results demonstrate that the proposed method can significantly outperform existing attack strategies in misleading the DR model with small indiscernible text perturbations.

摘要
神经排名模型（NRM）和紧凑检索（DR）模型已经导致总体检索性能得到了重大改善。此外，由于深度学习基本概念的不稳定性在其他领域已经证明了其不足，因此对深度学习基本概念的检索问题的Robustness也有增加的兴趣。许多攻击方法主要target NRMs，DR模型几乎没有受到关注。在本文中，我们介绍了抗击式检索任务（AREA）。AREA任务的目标是让DR模型返回一个不在初始候选文档中的目标文档。我们使用现有的NRMs攻击方法，并发现这些方法在DR模型上的表现不如预期差。我们归因这种不一致性于NRMs的交互强调的架构，DR模型采用了一种强调媒体表示的架构。我们提出了一种对DR模型进行攻击的方法，定义为多视图表示空间中的对比学习问题。核心思想是在每个视图中supervise the viewer's representation of the target document and its corresponding viewer via view-wise supervision signals。实验结果表明，我们的方法可以在小型文本干扰下明显超越现有攻击策略。

How susceptible are LLMs to Logical Fallacies?

paper_url: http://arxiv.org/abs/2308.09853
repo_url: https://github.com/Amir-pyh/LOGICOM
paper_authors: Amirreza Payandeh, Dan Pluth, Jordan Hosier, Xuesu Xiao, Vijay K. Gurbani
for: 本研究探讨了大型自然语言模型（LLM）在多轮辩论中的合理思维能力，特别是对逻辑错误的影响。
methods: 本研究使用了Logic Competence Measurement Benchmark（LOGICOM），一个用于评估LLM在逻辑错误下的逻辑理解能力的诊断标准。LOGICOM包括两个代理：一个诱导者和一个辩者，在一个争议性主题上进行多轮辩论，诱导者尝试使辩者接受其主张的正确性。
results: 研究发现，LLM可以通过理解来修改其意见。但是，当面临逻辑错误时，GPT-3.5和GPT-4分别被误导41%和69%更多次，相比于逻辑理解时。此外，本研究还提供了一个包含逻辑vs. 逻辑错误的5k+对话的新数据集，并公开发布了源代码和数据集。

Abstract
This paper investigates the rational thinking capability of Large Language Models (LLMs) in multi-round argumentative debates by exploring the impact of fallacious arguments on their logical reasoning performance. More specifically, we present Logic Competence Measurement Benchmark (LOGICOM), a diagnostic benchmark to assess the robustness of LLMs against logical fallacies. LOGICOM involves two agents: a persuader and a debater engaging in a multi-round debate on a controversial topic, where the persuader tries to convince the debater of the correctness of its claim. First, LOGICOM assesses the potential of LLMs to change their opinions through reasoning. Then, it evaluates the debater's performance in logical reasoning by contrasting the scenario where the persuader employs logical fallacies against one where logical reasoning is used. We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics, claims, and reasons supporting them. Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning. However, when presented with logical fallacies, GPT-3.5 and GPT-4 are erroneously convinced 41% and 69% more often, respectively, compared to when logical reasoning is used. Finally, we introduce a new dataset containing over 5k pairs of logical vs. fallacious arguments. The source code and dataset of this work are made publicly available.

摘要
We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4, two popular LLMs, using a dataset of controversial topics, claims, and reasons supporting them. Our findings show that both GPT-3.5 and GPT-4 can adjust their opinions through reasoning, but when presented with logical fallacies, they are erroneously convinced 41% and 69% more often, respectively, compared to when logical reasoning is used.To further evaluate the ability of LLMs to distinguish between logical and fallacious arguments, we have created a new dataset containing over 5,000 pairs of logical vs. fallacious arguments. This dataset is publicly available, along with the source code for our benchmark. Our results have important implications for the development of LLMs and their use in real-world applications where logical reasoning is critical.

paper_url: http://arxiv.org/abs/2308.09778
repo_url: None
paper_authors: Navid Rajabi, Jana Kosecka
for: 这个研究是用来评估大规模视力语言模型（VLM）在不同的视觉理解任务中表现的，特别是在理解空间关系方面。
methods: 这个研究使用了细化的组合基础grounding技术来评估视觉关系理解任务的性能，并提出了底层方法来排序空间句子并评估模型的表现。
results: 研究发现，现有的视力语言模型在理解空间关系方面表现不佳，有较大的差距和人类表现。而提出的方法可以减少这个差距，并提高模型的表现。

Abstract
With the advances in large scale vision-and-language models (VLMs) it is of interest to assess their performance on various visual reasoning tasks such as counting, referring expressions and general visual question answering. The focus of this work is to study the ability of these models to understanding spatial relations. Previously, this has been tackled using image-text matching (Liu, Emerson, and Collier 2022) or visual question answering task, both showing poor performance and a large gap compared to human performance. To better understand the gap, we present fine-grained compositional grounding of spatial relationships and propose a bottom up approach for ranking spatial clauses and evaluating the performance of spatial relationship reasoning task. We propose to combine the evidence from grounding noun phrases corresponding to objects and their locations to compute the final rank of the spatial clause. We demonstrate the approach on representative vision-language models (Tan and Bansal 2019; Gupta et al. 2022; Kamath et al. 2021) and compare and highlight their abilities to reason about spatial relationships.

摘要
随着大规模视言语模型（VLM）的发展，有兴趣测试它们在不同的视觉逻辑任务中的表现，如计数、引用表达和通用的视觉问答。本工作的重点是研究这些模型在理解空间关系方面的能力。在过去，这些任务通常通过图像文本匹配（Liu、Emerson、Collier 2022）或视觉问答任务进行评估，两者都显示了较差的表现和人类表现之间的大差。为更好地了解这个差距，我们提出了细化的 composer grounding 技术和底层方法来评估空间关系逻辑任务的表现。我们建议将对象和其位置的描述语Fragment（noun phrase）grounding 结果相加以计算最终的空间句排名。我们在代表性的视觉语言模型（Tan和Bansal 2019；Gupta等 2022；Kamath等 2021）上进行了示例实现，并对它们的空间关系逻辑能力进行了比较和强调。

YORC: Yoruba Reading Comprehension dataset

paper_url: http://arxiv.org/abs/2308.09768
repo_url: None
paper_authors: Anuoluwapo Aremu, Jesujoba O. Alabi, David Ifeoluwa Adelani
for: 这个论文创建了一个新的多选式 йоруба阅读理解数据集，基于 йоруба高中阅读理解考试。
methods: 该论文使用了已有的英文 RACE 数据集进行交叉语言传递，并使用预训练 encoder-only 模型获得基准结果。此外，还使用了大型自然语言模型（LLMs）如 GPT-4 进行推荐。
results: 该论文提供了基准结果和使用 LLMs 的结果。

Abstract
In this paper, we create YORC: a new multi-choice Yoruba Reading Comprehension dataset that is based on Yoruba high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.

摘要
在这篇论文中，我们创建了YORC：一个新的多选 йору巴读写理解数据集，该数据集基于 йору巴高中读写理解考试。我们提供了基线结果，使用现有的英语RACE数据集进行cross-lingual转移，并使用预训练的encoder-only模型。此外，我们还提供了使用大语言模型（LLMs）如GPT-4的结果。

OCR Language Models with Custom Vocabularies

paper_url: http://arxiv.org/abs/2308.09671
repo_url: None
paper_authors: Peter Garst, Reeve Ingle, Yasuhisa Fujii
for: 提高特殊领域文档识别率
methods: 生成领域特定词语模型，附加到通用语模型上，并使用修改后的CTC搜索解码器
results: 降低特殊领域文档词语错误率

Abstract
Language models are useful adjuncts to optical models for producing accurate optical character recognition (OCR) results. One factor which limits the power of language models in this context is the existence of many specialized domains with language statistics very different from those implied by a general language model - think of checks, medical prescriptions, and many other specialized document classes. This paper introduces an algorithm for efficiently generating and attaching a domain specific word based language model at run time to a general language model in an OCR system. In order to best use this model the paper also introduces a modified CTC beam search decoder which effectively allows hypotheses to remain in contention based on possible future completion of vocabulary words. The result is a substantial reduction in word error rate in recognizing material from specialized domains.

摘要
语言模型是Optical Character Recognition（OCR）结果的有用辅助工具。一个限制语言模型在这种情况下的力量是存在许多特殊领域的语言统计数据与一般语言模型所假设的语言统计数据异常不同 - 想象检查、医疗订单等多种专业文档类型。本文提出了一种方法，可以在运行时效率地生成并附加专业领域词汇基于语言模型。为了最好地使用这种模型，本文还提出了一种修改后的CTC搜索解码器，可以让假设中的词语在未来完成的可能性下保持在竞争中。这导致了特殊领域中word error rate的显著减少。

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment

paper_url: http://arxiv.org/abs/2308.09662
repo_url: https://github.com/declare-lab/red-instruct
paper_authors: Rishabh Bhardwaj, Soujanya Poria
For: This paper aims to address the risk of large language models (LLMs) producing harmful outputs and to develop a safety evaluation benchmark for LLMs.* Methods: The paper proposes a new safety evaluation benchmark called RED-EVAL, which uses a red-teaming approach to test the susceptibility of LLMs to harmful prompts. The authors also propose a method called RED-INSTRUCT for aligning LLMs with safe and helpful responses.* Results: The paper shows that even widely deployed LLMs are susceptible to harmful prompts, with more than 65% and 73% of harmful queries eliciting unethical responses from GPT-4 and ChatGPT, respectively. The authors also demonstrate the consistency of RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Finally, the authors show that their proposed safety alignment method (RED-INSTRUCT) can improve the safety of LLMs while preserving their utility.

Abstract
Larger language models (LLMs) have taken the world by storm with their massive multi-tasking capabilities simply by optimizing over a next-word prediction objective. With the emergence of their properties and encoded knowledge, the risk of LLMs producing harmful outputs increases, making them unfit for scalable deployment for the public. In this work, we propose a new safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that even widely deployed models are susceptible to the Chain of Utterances-based (CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in generating harmful responses in more than 86% of the red-teaming attempts. Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting, we collect a dataset that consists of 1.9K harmful questions covering a wide range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2) SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the safety alignment of LLMs by minimizing the negative log-likelihood over helpful responses and penalizing over harmful responses by gradient accent over sample loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the utility of the baseline models (TruthfulQA, MMLU, and BBH).

摘要
大型语言模型（LLM）已经在全球引起了风波，它们通过优化下一个单词预测目标来实现巨大多任务能力。然而， LLM 的出现也带来了它们生成有害输出的风险，使得它们不适合大规模部署。在这种情况下，我们提出了一个新的安全评估标准RED-EVAL，它通过红色队伍（red-teaming）来评估 LLM 的安全性。我们发现，广泛部署的模型都是对 CoU（链接词）提示的敏感，可以被破解，从而生成有害输出。此外，我们还证明了 RED-EVAL 在 8 个开源 LLM 上的一致性，它们在红色队伍中生成有害输出的情况超过 86%。接着，我们提出了 RED-INSTRUCT，一种安全对齐 LLM 的方法。它包括两个阶段：1）危险问题收集：通过 CoU 提示，我们收集了一个包含 1.9K 个危险问题、9.5K 个安全问题和 7.3K 个危险对话的数据集，来自 ChatGPT; 2）安全对齐：我们示例了如何使用 conversational 数据集来对 LLM 进行安全对齐，通过负采样损失来减少帮助Response的负损失，并对危险Response进行惩罚。我们的模型 STARLING，基于 Vicuna-7B 的 fine-tune，在 RED-EVAL 和 HHH benchmark 上被观察到更安全地对齐，而不会影响基eline模型（TruthfulQA、MMLU 和 BBH）的实用性。

2023-08-19

An Empirical Study of CLIP for Text-based Person Search

GameEval: Evaluating LLMs on Conversational Games

ControlRetriever: Harnessing the Power of Instructions for Controllable Retrieval

HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding

FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models

Utilizing Semantic Textual Similarity for Clinical Survey Data Feature Selection

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi

Black-box Adversarial Attacks against Dense Retrieval Models: A Multi-view Contrastive Learning Method

How susceptible are LLMs to Logical Fallacies?

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

YORC: Yoruba Reading Comprehension dataset

OCR Language Models with Custom Vocabularies

Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment