cs.CL - 2023-08-06

Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models

paper_url: http://arxiv.org/abs/2308.03151
repo_url: https://github.com/aaronma2020/Food500-Cap
paper_authors: Zheng Ma, Mianzhi Pan, Wenhan Wu, Kanzhi Cheng, Jianbing Zhang, Shujian Huang, Jiajun Chen
for: 这篇论文旨在探讨 популяр的视觉语言模型（VLM）在特定领域中的能力。
methods: 该论文使用了多种探测方法，包括零 shot 设定下的评估方法，以检测 VLM 的局限性。
results: 实验结果显示，Popular VLM 在食品领域下表现较差，而且对不同地区的食品 Item 的处理能力具有偏见。

Abstract
Vision-language models (VLMs) have shown impressive performance in substantial downstream multi-modal tasks. However, only comparing the fine-tuned performance on downstream tasks leads to the poor interpretability of VLMs, which is adverse to their future improvement. Several prior works have identified this issue and used various probing methods under a zero-shot setting to detect VLMs' limitations, but they all examine VLMs using general datasets instead of specialized ones. In practical applications, VLMs are usually applied to specific scenarios, such as e-commerce and news fields, so the generalization of VLMs in specific domains should be given more attention. In this paper, we comprehensively investigate the capabilities of popular VLMs in a specific field, the food domain. To this end, we build a food caption dataset, Food-500 Cap, which contains 24,700 food images with 494 categories. Each image is accompanied by a detailed caption, including fine-grained attributes of food, such as the ingredient, shape, and color. We also provide a culinary culture taxonomy that classifies each food category based on its geographic origin in order to better analyze the performance differences of VLM in different regions. Experiments on our proposed datasets demonstrate that popular VLMs underperform in the food domain compared with their performance in the general domain. Furthermore, our research reveals severe bias in VLMs' ability to handle food items from different geographic regions. We adopt diverse probing methods and evaluate nine VLMs belonging to different architectures to verify the aforementioned observations. We hope that our study will bring researchers' attention to VLM's limitations when applying them to the domain of food or culinary cultures, and spur further investigations to address this issue.

摘要
视力语模型（VLM）在多Modal任务中表现出色，但只是对下游任务的细致 fine-tuning 可能导致 VLM 的解释性差，这对其未来改进带来障碍。一些先前的研究已经发现这个问题，并使用了不同的探测方法来检测 VLM 的局限性，但这些研究都使用了通用的数据集而不是专门的数据集。在实际应用中，VLM 通常被应用于特定场景，如电商和新闻领域，因此 VLM 在特定领域的泛化性应该得更多的注意。在这篇论文中，我们广泛探讨了流行的 VLM 在食品领域的能力。为此，我们建立了一个食品描述集合，即 Food-500 Cap，该集合包含 24,700 个食品图像，其中每个图像都有 494 个类别。每个图像都有详细的描述，包括食品的成分、形状和颜色。我们还提供了一个culinary culture taxonomy，该税onomy分类每个食品类别基于其地理起源，以便更好地分析 VLM 在不同地区的表现差异。我们对我们提posed的数据集进行实验，发现流行的 VLM 在食品领域的表现落后于其在通用领域的表现。此外，我们的研究发现 VLM 对不同地区的食品项目存在严重的偏见。我们采用了多种探测方法，并评估了九种不同架构的 VLM，以确认以上观察。我们希望通过这种研究，引起研究者对 VLM 在食品或culinary cultures领域的应用中的局限性的注意，并促进进一步的调查以解决这一问题。

Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in NLG Evaluation

paper_url: http://arxiv.org/abs/2308.03131
repo_url: https://github.com/sefazeng/llm-ref
paper_authors: Xianfeng Zeng, Yijin Liu, Fandong Meng, Jie Zhou
for: 提高 matching-based 评估 metric 和人类评估的相关性, 特别是对比 neural-based metric 如 BLEURT.
methods: 使用多个参考文本来增强 matching-based metric 的一致性。
results: 在 WMT Metrics benchmark 中，多参考 F200spBLEU 与单参考 F200spBLEU 之间的准确率提高为 7.2%，并且超过 neural-based BERTscore 的准确率提高为 3.9%。此外，我们发现 LLM 中的数据泄露问题可以通过我们的多参考 metric 减少到一定程度。

Abstract
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks. However, recent studies have revealed a weak correlation between these matching-based metrics and human evaluations, especially when compared with neural-based metrics like BLEURT. In this paper, we conjecture that the performance bottleneck in matching-based metrics may be caused by the limited diversity of references. To address this issue, we propose to utilize \textit{multiple references} to enhance the consistency between these metrics and human evaluations. Within the WMT Metrics benchmarks, we observe that the multi-references F200spBLEU surpasses the conventional single-reference one by an accuracy improvement of 7.2\%. Remarkably, it also exceeds the neural-based BERTscore by an accuracy enhancement of 3.9\%. Moreover, we observe that the data leakage issue in large language models (LLMs) can be mitigated to a large extent by our multi-reference metric. We release the code and data at \url{https://github.com/SefaZeng/LLM-Ref}

摘要
对于自然语言生成（NLG）任务，N-gram匹配基于评估度量，如BLEU和chrF，在广泛应用。然而，最近的研究表明，这些匹配基于度量与人工评估之间的相关性较弱，特别是与神经网络基于度量 like BLEURT 相比。在这篇论文中，我们 conjecture 这种性能瓶颈可能是因为参考文本的多样性有限。为了解决这个问题，我们提议使用多个参考文本来提高这些度量与人工评估之间的一致性。在 WMT Metrics benchmark 中，我们发现使用多个参考文本的 F200spBLEU 的精度提高了7.2%，而且还超过了基于神经网络的 BERTscore 的精度提高3.9%。此外，我们发现大语言模型（LLM）中的数据泄露问题可以通过我们的多个参考文本度量来减轻。我们在 GitHub 上发布了代码和数据，请参考 \url{https://github.com/SefaZeng/LLM-Ref}。

“Kurosawa”: A Script Writer’s Assistant

paper_url: http://arxiv.org/abs/2308.03122
repo_url: None
paper_authors: Prerak Gandhi, Vishal Pramanik, Pushpak Bhattacharyya
for: 这 paper 的目的是提出一种基于 AI 的剧本创作工具箱，以便自动生成剧本和剧场。
methods: 该工具箱使用 GPT-3 进行微调，并使用 manually annotated 的剧本和剧场数据进行训练。
results: 经过评估后，这些自动生成的剧本和剧场被一家著名的娱乐平台 ErosNow 的编剧人员用于创作。

Abstract
Storytelling is the lifeline of the entertainment industry -- movies, TV shows, and stand-up comedies, all need stories. A good and gripping script is the lifeline of storytelling and demands creativity and resource investment. Good scriptwriters are rare to find and often work under severe time pressure. Consequently, entertainment media are actively looking for automation. In this paper, we present an AI-based script-writing workbench called KUROSAWA which addresses the tasks of plot generation and script generation. Plot generation aims to generate a coherent and creative plot (600-800 words) given a prompt (15-40 words). Script generation, on the other hand, generates a scene (200-500 words) in a screenplay format from a brief description (15-40 words). Kurosawa needs data to train. We use a 4-act structure of storytelling to annotate the plot dataset manually. We create a dataset of 1000 manually annotated plots and their corresponding prompts/storylines and a gold-standard dataset of 1000 scenes with four main elements -- scene headings, action lines, dialogues, and character names -- tagged individually. We fine-tune GPT-3 with the above datasets to generate plots and scenes. These plots and scenes are first evaluated and then used by the scriptwriters of a large and famous media platform ErosNow. We release the annotated datasets and the models trained on these datasets as a working benchmark for automatic movie plot and script generation.

摘要
互联网娱乐业的生命线是故事告诉，电影、电视节目和Stand-up喜剧都需要故事。一个好的和抓人的剧本是故事告诉的生命线，需要创造力和资源投入。好的剧本作家罕见，常工作于严格的时间压力下。因此，娱乐媒体 aktif looking for automation。在这篇论文中，我们介绍了一个基于人工智能的剧本创作工具 called KUROSAWA，它 Addresses the tasks of plot generation and script generation。Plot generation aims to generate a coherent and creative plot (600-800 words) given a prompt (15-40 words). Script generation, on the other hand, generates a scene (200-500 words) in a screenplay format from a brief description (15-40 words). Kurosawa需要数据进行训练。我们使用了4 acts of storytelling structure to annotate the plot dataset manually。我们创建了1000个手动注释的剧本和其相应的提示/故事情节的数据集，以及1000个完整的场景集，每个场景包括四个主要元素：场景标题、行为线、对白和角色名称，并且每个元素都被标记 separately。我们使用这些数据集来精度地调整GPT-3，以生成剧本和场景。这些剧本和场景首先被评估，然后被一家著名的娱乐平台ErosNow的编剧使用。我们发布了注释数据集和基于这些数据集的模型，作为自动电影剧本和剧本生成的工作准则。

PromptSum: Parameter-Efficient Controllable Abstractive Summarization

paper_url: http://arxiv.org/abs/2308.03117
repo_url: None
paper_authors: Mathieu Ravaut, Hailin Chen, Ruochen Zhao, Chengwei Qin, Shafiq Joty, Nancy Chen
for: 提高摘要生成的性能和可控性，同时实现参数效率和数据效率。
methods: combinig prompt tuning (PT) 技术和多任务目标，以及使用明确的实体提示。
results: 在 популяр的摘要生成benchmark上达到了竞争性ROUGE成绩，同时具有强的可控性，只需要 Parameters的几个数量级下锻炼。

Abstract
Prompt tuning (PT), a parameter-efficient technique that only tunes the additional prompt embeddings while keeping the backbone pre-trained language model (PLM) frozen, has shown promising results in language understanding tasks, especially in low-resource scenarios. However, effective prompt design methods suitable for generation tasks such as summarization are still lacking. At the same time, summarization guided through instructions (discrete prompts) can achieve a desirable double objective of high quality and controllability in summary generation. Towards a goal of strong summarization performance under the triple conditions of parameter-efficiency, data-efficiency, and controllability, we introduce PromptSum, a method combining PT with a multi-task objective and discrete entity prompts for abstractive summarization. Our model achieves competitive ROUGE results on popular abstractive summarization benchmarks coupled with a strong level of controllability through entities, all while only tuning several orders of magnitude less parameters.

摘要
Prompt tuning（PT），一种 parameter-efficient 技术，只是调整额外提示 embedding，保持预训练语言模型（PLM）冻结，已经显示出在语言理解任务中获得了良好的结果，特别是在低资源enario中。然而，适用于生成任务，如摘要的有效提示方法仍然缺乏。而摘要指导 instrucions（discrete prompts）可以实现一个双重目标：高质量和可控性。为了实现强大的摘要性能在三个条件下：参数效率、数据效率和可控性，我们介绍 PromptSum，一种将PT与多任务目标和分类提示拼接在一起的方法。我们的模型在流行的摘要生成benchmark上达到了竞争性ROUGE成绩，同时保持了高水平的可控性，所有这些都只需要调整几个数量级的参数。

Improving Domain-Specific Retrieval by NLI Fine-Tuning

paper_url: http://arxiv.org/abs/2308.03103
repo_url: None
paper_authors: Roman Dušek, Aleksander Wawer, Christopher Galias, Lidia Wojciechowska
for: investigate the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
methods: employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
results: NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.Here’s the full text in Simplified Chinese:
for: 这篇论文旨在调查自然语言推理（NLI）数据的细化潜力，以提高信息检索和排名。
methods: 我们使用一种监督方法，使用对比损失和NLI数据来细化单语言和多语言句子编码器。
results: NLI细化提高了模型在两种任务和两种语言中的性能，并有可能提高单语言和多语言模型。I hope this helps! Let me know if you have any other questions.

Abstract
The aim of this article is to investigate the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking. We demonstrate this for both English and Polish languages, using data from one of the largest Polish e-commerce sites and selected open-domain datasets. We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data. Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models. Finally, we investigate uniformity and alignment of the embeddings to explain the effect of NLI-based fine-tuning for an out-of-domain use-case.

摘要
本文的目的是调查自然语言推理（NLI）数据的细化 potential，以提高信息检索和排序。我们通过英语和波兰语两种语言进行了实验，使用了一个大型波兰电商网站的数据和一些开放领域的数据集。我们使用了监督方法和对比损失来练化单语言和多语言句子编码器。我们的结果表明，NLI 练化可以提高模型在两种语言中的性能，并且可以改善单语言和多语言模型。最后，我们进行了均匀性和对齐的调查，以解释NLI基于练化的效果在异语言应用场景中。

LARCH: Large Language Model-based Automatic Readme Creation with Heuristics

paper_url: http://arxiv.org/abs/2308.03099
repo_url: https://github.com/hitachi-nlp/larch
paper_authors: Yuta Koreeda, Terufumi Morishita, Osamu Imaichi, Yasuhiro Sogawa
for: This paper aims to demonstrate the ability of large language models (LLMs) to generate coherent and factually correct readmes for software development projects, and to introduce a new tool called LARCH (LLM-based Automatic Readme Creation with Heuristics) that leverages representative code identification with heuristics and weak supervision to achieve this goal.methods: The authors use a dataset of 100 open-source projects to train and evaluate LARCH, and compare its performance with a baseline that does not rely on representative code identification. They use human and automated evaluations to assess the quality of the generated readmes, and show that LARCH outperforms the baseline in the majority of cases.results: The authors report that LARCH is capable of generating coherent and factually correct readmes in the majority of cases, and that it outperforms the baseline in terms of readability, accuracy, and completeness. They also provide a demo video showcasing LARCH’s capabilities, which is available at https://youtu.be/ZUKkh5ED-O4.

Abstract
Writing a readme is a crucial aspect of software development as it plays a vital role in managing and reusing program code. Though it is a pain point for many developers, automatically creating one remains a challenge even with the recent advancements in large language models (LLMs), because it requires generating an abstract description from thousands of lines of code. In this demo paper, we show that LLMs are capable of generating a coherent and factually correct readmes if we can identify a code fragment that is representative of the repository. Building upon this finding, we developed LARCH (LLM-based Automatic Readme Creation with Heuristics) which leverages representative code identification with heuristics and weak supervision. Through human and automated evaluations, we illustrate that LARCH can generate coherent and factually correct readmes in the majority of cases, outperforming a baseline that does not rely on representative code identification. We have made LARCH open-source and provided a cross-platform Visual Studio Code interface and command-line interface, accessible at https://github.com/hitachi-nlp/larch. A demo video showcasing LARCH's capabilities is available at https://youtu.be/ZUKkh5ED-O4.

摘要
制作readme文档是软件开发中的一个重要环节，它扮演着管理和重用代码的重要角色。尽管这是许多开发者的痛点，但自动生成readme仍然是一项挑战，尤其是在大语言模型（LLM）的最新进展下。这是因为需要从千行代码中生成抽象的描述。在这个demo paper中，我们显示了LLM可以生成准确和 coherent的readme，只要我们可以确定代码的 Representatives。基于这一发现，我们开发了LARCH（LLM-based Automatic Readme Creation with Heuristics），它利用代码 Representative identification和规则来生成readme。经过人工和自动评估，我们表明LARCH可以在大多数情况下生成准确和 coherent的readme，超过了不含代码 Representative identification的基线。我们将LARCH开源，并提供了跨平台Visual Studio Code接口和命令行接口，可以在https://github.com/hitachi-nlp/larch中下载。一个展示LARCH的功能的 demo 视频可以在https://youtu.be/ZUKkh5ED-O4上找到。

System-Initiated Transitions from Chit-Chat to Task-Oriented Dialogues with Transition Info Extractor and Transition Sentence Generator

paper_url: http://arxiv.org/abs/2308.03098
repo_url: None
paper_authors: Ye Liu, Stefan Ultes, Wolfgang Minker, Wolfgang Maier
for: investigate how a unified dialogue model can take the initiative during the dialogue mode transition from chit-chat to task-oriented in a coherent and cooperative manner.
methods: built a {transition information extractor} (TIE) and a {transition sentence generator} (TSG) through efficient Adapter tuning and transition prompt learning.
results: achieved promising performance regarding the proactive transitions and improved the TIE model by utilizing Conditional Random Fields (CRF). The TSG can flexibly generate transition sentences while maintaining the unified capabilities of normal chit-chat and task-oriented response generation.

Abstract
In this work, we study dialogue scenarios that start from chit-chat but eventually switch to task-related services, and investigate how a unified dialogue model, which can engage in both chit-chat and task-oriented dialogues, takes the initiative during the dialogue mode transition from chit-chat to task-oriented in a coherent and cooperative manner. We firstly build a {transition info extractor} (TIE) that keeps track of the preceding chit-chat interaction and detects the potential user intention to switch to a task-oriented service. Meanwhile, in the unified model, a {transition sentence generator} (TSG) is extended through efficient Adapter tuning and transition prompt learning. When the TIE successfully finds task-related information from the preceding chit-chat, such as a transition domain, then the TSG is activated automatically in the unified model to initiate this transition by generating a transition sentence under the guidance of transition information extracted by TIE. The experimental results show promising performance regarding the proactive transitions. We achieve an additional large improvement on TIE model by utilizing Conditional Random Fields (CRF). The TSG can flexibly generate transition sentences while maintaining the unified capabilities of normal chit-chat and task-oriented response generation.

摘要
在这项研究中，我们研究了从小聊天转移到任务相关服务的对话场景，并研究了一个统一对话模型，可以同时参与小聊天和任务准备对话。我们首先建立了一个{"transition信息抽取器"}(TIE)，以跟踪先前的小聊天交互并检测用户可能的任务转换意图。在统一模型中，我们通过高效的Adapter调整和转换提示学习扩展了{转换句生成器}(TSG)。当TIE成功检测到前一个小聊天中的任务相关信息，例如转换领域，那么TSG会自动在统一模型中被启动，通过生成一个转换句来实现转换，并且在TIE提供的转换信息的指导下进行转换句生成。实验结果表明，我们在掌握前一个小聊天中的任务相关信息后，可以通过TSG生成转换句来实现掌握任务的跳转，并且可以保持统一的对话能力。此外，我们还通过使用 Conditional Random Fields (CRF) 来提高TIE模型的性能，并且TSG可以灵活地生成转换句，同时保持统一的对话能力。

TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties

paper_url: http://arxiv.org/abs/2308.03051
repo_url: None
paper_authors: Karima Kadaoui, Samar M. Magdy, Abdul Waheed, Md Tawkat Islam Khondaker, Ahmed Oumar El-Shangiti, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
For: 这项研究旨在评估Google Bard和OpenAI ChatGPT等大型语言模型在十种阿拉伯语言中的翻译能力。* Methods: 研究使用了这些模型的机器翻译功能，包括古典阿拉伯语、现代标准阿拉伯语和几种方言变体。此外，研究还进行了人类中心的研究，以评估Bard模型在翻译任务中遵循人类指令的能力。* Results: 研究发现，LLMs在某些阿拉伯语言方言上存在困难，特别是缺乏公共数据的语言，如阿尔及利亚和毛里塔尼亚语言。然而，它们在更常见的方言上表现良好，尽管有时与商业系统如Google Translate相比落后。此外，研究还发现Bard模型在翻译任务中遵循人类指令的能力有限。总的来说，研究表明，现有的LLMs仍然缺乏包容性，无法满足不同社区的语言和文化特点。

Abstract
Large language models (LLMs) finetuned to follow human instructions have recently emerged as a breakthrough in AI. Models such as Google Bard and OpenAI ChatGPT, for example, are surprisingly powerful tools for question answering, code debugging, and dialogue generation. Despite the purported multilingual proficiency of these models, their linguistic inclusivity remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic, Modern Standard Arabic, and several nuanced dialectal variants. Furthermore, we undertake a human-centric study to scrutinize the efficacy of the most recent model, Bard, in following human instructions during translation tasks. Our exhaustive analysis indicates that LLMs may encounter challenges with certain Arabic dialects, particularly those for which minimal public data exists, such as Algerian and Mauritanian dialects. However, they exhibit satisfactory performance with more prevalent dialects, albeit occasionally trailing behind established commercial systems like Google Translate. Additionally, our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.

摘要
大型语言模型（LLM），如Google Bard和OpenAI ChatGPT，在最近几年内崭新出现，并表现出很强的问答、代码调试和对话生成能力。然而，这些模型的语言多样性仍然未得到足够的探索。因此，我们对Bard和ChatGPT（包括GPT-3.5和GPT-4）在十种阿拉伯语言中的机器翻译能力进行了全面的评估。我们的评估覆盖了古典阿拉伯语、现代标准阿拉伯语以及一些细腻的方言变体。此外，我们还进行了人类中心的研究，以评估Bard在翻译任务中遵循人类指令的能力。我们的详细分析表明，LLMs可能会面临certain阿拉伯语言的挑战，特别是缺乏公共数据的阿尔及杜和 Mauritania的方言。然而，它们在更常见的方言上表现得比较满意，尽管有时会落后于已有的商业系统如Google翻译。此外，我们的分析还发现了Bard在翻译任务中遵循人类指令的能力有限。总之，我们的发现表明，当前的LLMs仍然远离含义，只能够部分地适应不同社区的语言和文化特点。

3D-EX : A Unified Dataset of Definitions and Dictionary Examples

paper_url: http://arxiv.org/abs/2308.03043
repo_url: https://github.com/f-almeman/3d-ex
paper_authors: Fatemah Almeman, Hadi Sheikhi, Luis Espinosa-Anke
for: 这篇论文的目的是为了提供一个中央知识库， combin ing 英语资源，以便在 NLP 任务中使用。
methods: 这篇论文使用了 <term, definition, example> triplets 来填充 lexical 资源的 gap，并提供了一个统一的评估框架，以避免 memorization。
results: 实验结果表明，这些数据可以在下游 NLP 任务中有效地应用。

Abstract
Definitions are a fundamental building block in lexicography, linguistics and computational semantics. In NLP, they have been used for retrofitting word embeddings or augmenting contextual representations in language models. However, lexical resources containing definitions exhibit a wide range of properties, which has implications in the behaviour of models trained and evaluated on them. In this paper, we introduce 3D- EX , a dataset that aims to fill this gap by combining well-known English resources into one centralized knowledge repository in the form of triples. 3D- EX is a unified evaluation framework with carefully pre-computed train/validation/test splits to prevent memorization. We report experimental results that suggest that this dataset could be effectively leveraged in downstream NLP tasks. Code and data are available at https://github.com/F-Almeman/3D-EX .

摘要
Translate the given text into Simplified Chinese.定义是自然语言处理（NLP）中的基本构件，它们在语义学和计算 semantics 中扮演着重要的角色。在 NLP 中，定义被用于改进词嵌入或增强语言模型的上下文表示。然而，各种语言资源中的定义具有各种特点，这会影响模型在这些资源上训练和评估的行为。本文介绍了3D- EX 数据集，它将英语资源集成到一个中心知识库中，并提供了三元组。3D- EX 提供了一个统一的评估框架，并且在训练/验证/测试分区中进行了精心预计算，以避免Memorization。我们发现，这个数据集可以在下游 NLP 任务中得到有效地利用。代码和数据可以在 GitHub 上找到：https://github.com/F-Almeman/3D-EX。

Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction

paper_url: http://arxiv.org/abs/2308.02951
repo_url: https://github.com/liy140/multidomain-measextract-corpus
paper_authors: Yueling Li, Sebastian Martschat, Simone Paolo Ponzetto
for: 本研究旨在开发一种跨Domain的自动量测和上下文提取方法，利用预训练语言模型。
methods: 我们构建了多源多Domain的语料库，并在这个语料库上训练了一个端到端提取管道。然后，我们进行多源任务适应性预训练和细化调整，以评估我们的模型在跨Domain的总体化能力。
results: 我们的结果表明，多源训练导致最佳总体结果，而单源训练对各个域的结果具有最佳效果。虽然我们的设置可以提取量值和单位，但需要进一步研究以提高上下文实体的提取。我们将在线上发布使用的跨Domain语料库。

Abstract
We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-specific error analysis and derive insights for future work. Our results suggest that multi-source training leads to the best overall results, while single-source training yields the best results for the respective individual domain. While our setup is successful at extracting quantity values and units, more research is needed to improve the extraction of contextual entities. We make the cross-domain corpus used in this work available online.

摘要
我们提出了跨领域方法，用于自动测量和上下文提取，基于预训练语言模型。我们构建了多源多领域 corpora，并训练了端到端提取管道。然后，我们实施多源任务适应预训练和精度调整，以评估我们模型的跨领域一致性。此外，我们提出了任务特定错误分析，并从中提取了未来工作的想法。我们的结果表明，多源训练得到了最佳总结果，而单源训练得到了每个固定领域的最佳结果。虽然我们的设置可以提取量值和单位，但更多的研究是需要进一步提高上下文实体的提取。我们在这里发布了用于这项工作的跨领域 corpus。

Towards Consistency Filtering-Free Unsupervised Learning for Dense Retrieval

paper_url: http://arxiv.org/abs/2308.02926
repo_url: https://github.com/Haoxiang-WasedaU/Towards-Consistency-Filtering-Free-Unsupervised-Learning-for-Dense-Retrieval
paper_authors: Haoxiang Shi, Sumio Fujita, Tetsuya Sakai
for: 这个研究是为了解决现代神经信息搜寻（IR）中的领域转移问题。
methods: 这个研究使用了不同的方法来取代过去常用的领域专门的手动标注和人工生成的统计数据，以提高rankere的效率和表现。
results: 研究结果显示，使用TextRank基于pseudo relevance feedback的方法可以更好地超越其他方法，而且训练和测试效率都能持续提高。

Abstract
Domain transfer is a prevalent challenge in modern neural Information Retrieval (IR). To overcome this problem, previous research has utilized domain-specific manual annotations and synthetic data produced by consistency filtering to finetune a general ranker and produce a domain-specific ranker. However, training such consistency filters are computationally expensive, which significantly reduces the model efficiency. In addition, consistency filtering often struggles to identify retrieval intentions and recognize query and corpus distributions in a target domain. In this study, we evaluate a more efficient solution: replacing the consistency filter with either direct pseudo-labeling, pseudo-relevance feedback, or unsupervised keyword generation methods for achieving consistent filtering-free unsupervised dense retrieval. Our extensive experimental evaluations demonstrate that, on average, TextRank-based pseudo relevance feedback outperforms other methods. Furthermore, we analyzed the training and inference efficiency of the proposed paradigm. The results indicate that filtering-free unsupervised learning can continuously improve training and inference efficiency while maintaining retrieval performance. In some cases, it can even improve performance based on particular datasets.

摘要
域名转移是现代神经信息检索（IR）中的一大挑战。以前的研究使用了域名特定的手动标注和生成的人工数据来训练一个通用排名器，并生成一个域名特定的排名器。然而，训练这些一致性筛选器是计算机代价高昂，这会明显降低模型效率。另外，一致性筛选器经常难以识别检索意图和查询和文献分布在目标域中。在这种研究中，我们评估了一种更高效的解决方案：取代一致性筛选器，使用直接 pseudo-标注、 pseudo-相关反馈或无监督关键词生成方法来实现一致性自由无监督排名。我们对这些方法进行了广泛的实验评估，结果显示，在平均情况下，基于 TextRank 的 pseudo-相关反馈方法表现较好。此外，我们还分析了我们提议的训练和推理效率。结果表明，无监督无筛选的学习可以不断提高训练和推理效率，同时维持检索性能。在某些情况下，它可以超越基于特定数据集的表现。