cs.CL - 2023-08-02

Careful Whisper – leveraging advances in automatic speech recognition for robust and interpretable aphasia subtype classification

paper_url: http://arxiv.org/abs/2308.01327
repo_url: None
paper_authors: Laurin Wagner, Mario Zusag, Theresa Bloder
for: 本研究旨在提供一种自动化方法，用于从语音录音中识别语音异常，以帮助评估语音障碍。
methods: 该方法结合了 Connectionist Temporal Classification (CTC) 和 encoder-decoder 型自动语音识别模型，通过生成丰富的语音特征和清晰的转录，并应用了一些自然语言处理技术来提取特征，生成健康语音的原型。
results: 该方法可以很准确地分类 recording 中的人群，并且可以准确地分类最常见的语音障碍类型。

Abstract
This paper presents a fully automated approach for identifying speech anomalies from voice recordings to aid in the assessment of speech impairments. By combining Connectionist Temporal Classification (CTC) and encoder-decoder-based automatic speech recognition models, we generate rich acoustic and clean transcripts. We then apply several natural language processing methods to extract features from these transcripts to produce prototypes of healthy speech. Basic distance measures from these prototypes serve as input features for standard machine learning classifiers, yielding human-level accuracy for the distinction between recordings of people with aphasia and a healthy control group. Furthermore, the most frequently occurring aphasia types can be distinguished with 90% accuracy. The pipeline is directly applicable to other diseases and languages, showing promise for robustly extracting diagnostic speech biomarkers.

摘要

Grounded Image Text Matching with Mismatched Relation Reasoning

paper_url: http://arxiv.org/abs/2308.01236
repo_url: None
paper_authors: Yu Wu, Yana Wei, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He
for: 本 paper 引入了 Grounded Image Text Matching with Mismatched Relation (GITM-MR)，一个新的视觉语言联合任务，用于评估基于 transformer 预训练模型的关系理解能力。
methods: GITM-MR 任务需要模型首先确定一个表达是否描述了一幅图片，然后Localize 提及的对象或者将文本中的匹配错误部分与图片相匹配。我们提供了一个评估预训练模型的标准准则，专注于有限数据和非标准句子长度的情况。
results: 我们的评估结果显示，预训练模型在有限数据和非标准句子长度情况下缺乏数据效率和长度泛化能力。为此，我们提出了 Relation-sensitive Correspondence Reasoning Network (RCRN)，该模型通过irectional 信息传递和语言结构引导的bi-directional message propagation来提高关系意识和推理能力，并能够在长度泛化和数据效率两个领域中具有出色的表现。

Abstract
This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency.

摘要
Simplified Chinese translation:这篇论文介绍了一个新任务，即基于图像和文本的链接匹配任务（GITM-MR），该任务评估基于转换器预训练模型的关系理解能力。该任务需要模型先确定文本描述图像，然后在图像中找到引用的对象或将文本中匹配不正确的部分与图像相匹配。作者们提供了一个评估基准，以评估预训练模型在这个任务中的表现，特别是在有限数据和非常长文本长度下。评估结果显示，预训练模型在数据有限和文本长度不正常的情况下缺乏数据效率和长度泛化能力。为解决这个问题，作者们提出了关系感知相关理解网络（RCRN），该网络通过双向消息传递和语言结构引导的关系感知来提高模型的数据效率和长度泛化能力。RCRN可以被视为一个模块化程序，并在长度泛化和数据效率两个方面达到了优秀的表现。

Global Hierarchical Neural Networks using Hierarchical Softmax

paper_url: http://arxiv.org/abs/2308.01210
repo_url: https://github.com/jschuurmans/hsoftmax
paper_authors: Jetze Schuurmans, Flavius Frasincar
for: 这篇论文提出了一个基于垂直软max的全球层次分类框架，这种方法适用于任何分类任务中存在自然的阶层结构。
methods: 这篇论文使用了垂直软max创建全球层次分类器，并在四个文本分类 datasets 上进行了实验。
results: 在四个 datasets 中，垂直软max 都提高了与平均软max 相比的macro-F1和macro- recall，并在三个 datasets 中 дости得了更高的微精度和macro精度。

Abstract
This paper presents a framework in which hierarchical softmax is used to create a global hierarchical classifier. The approach is applicable for any classification task where there is a natural hierarchy among classes. We show empirical results on four text classification datasets. In all datasets the hierarchical softmax improved on the regular softmax used in a flat classifier in terms of macro-F1 and macro-recall. In three out of four datasets hierarchical softmax achieved a higher micro-accuracy and macro-precision.

摘要

ADS-Cap: A Framework for Accurate and Diverse Stylized Captioning with Unpaired Stylistic Corpora

paper_url: http://arxiv.org/abs/2308.01143
repo_url: https://github.com/njucckevin/ads-cap
paper_authors: Kanzhi Cheng, Zheng Ma, Shi Zong, Jianbing Zhang, Xinyu Dai, Jiajun Chen
for: 这 paper 的目的是提出一种 novel framework 来生成具有准确性和多样性的风格化描述（ADS-Cap）。
methods: 这 paper 使用了对匹配学习模块来对图像和文本特征进行对接，并使用了 conditional variational auto-encoder 来自动记忆多样化的风格特征在隐藏空间中。它还设计了一个简单 yet effective 的检查模块来提高风格准确性。
results: 对两个广泛使用的风格化图像描述数据集进行实验，ADS-Cap 在保持图像一致性、风格准确性和多样性三者之间的折衔上达到了出色的表现。

Abstract
Generating visually grounded image captions with specific linguistic styles using unpaired stylistic corpora is a challenging task, especially since we expect stylized captions with a wide variety of stylistic patterns. In this paper, we propose a novel framework to generate Accurate and Diverse Stylized Captions (ADS-Cap). Our ADS-Cap first uses a contrastive learning module to align the image and text features, which unifies paired factual and unpaired stylistic corpora during the training process. A conditional variational auto-encoder is then used to automatically memorize diverse stylistic patterns in latent space and enhance diversity through sampling. We also design a simple but effective recheck module to boost style accuracy by filtering style-specific captions. Experimental results on two widely used stylized image captioning datasets show that regarding consistency with the image, style accuracy and diversity, ADS-Cap achieves outstanding performances compared to various baselines. We finally conduct extensive analyses to understand the effectiveness of our method. Our code is available at https://github.com/njucckevin/ADS-Cap.

摘要
生成具有具体语言风格特征的图像描述文本是一项复杂的任务，尤其是需要具有各种风格特征的描述文本。在这篇论文中，我们提出了一种新的框架，即准确多样化风格描述（ADS-Cap）。我们的 ADS-Cap 首先使用对比学习模块将图像和文本特征进行对应，在训练过程中统一 paired фактических和无对应风格 Corpora。然后，我们使用conditional variational autoencoder来自动记忆在latent space中的多样化风格特征，并通过抽样提高多样性。我们还设计了一个简单 yet effective的重新检查模块，以提高风格准确性。我们的实验结果表明， compared to various baselines, ADS-Cap 在图像和风格准确性方面达到了出色的表现。我们 finally conducted extensive analyses to understand the effectiveness of our method.我们的代码可以在 https://github.com/njucckevin/ADS-Cap 上获取。

Beyond Generic: Enhancing Image Captioning with Real-World Knowledge using Vision-Language Pre-Training Model

paper_url: http://arxiv.org/abs/2308.01126
repo_url: https://github.com/njucckevin/knowcap
paper_authors: Kanzhi Cheng, Wenpo Song, Zheng Ma, Wenhao Zhu, Zixuan Zhu, Jianbing Zhang
for: This paper aims to improve the ability of current captioning approaches to generate descriptions that incorporate real-world knowledge, such as named entities and contextual information.
methods: The proposed method, called Knowledge-guided Replay (K-Replay), consists of two parts: a knowledge prediction task on automatically collected replay exemplars to continuously awaken the VLP model’s memory about knowledge, and a knowledge distillation constraint to improve the faithfulness of generated descriptions.
results: The approach effectively incorporates knowledge into descriptions, outperforming a strong VLP baseline by 20.9 points (78.7->99.6) in CIDEr score and 20.5 percentage points (34.0%->54.5%) in knowledge recognition accuracy.Here is the information in Simplified Chinese text:
for: 本研究旨在提高当前captioning方法可以 incorporate 实际世界知识，如名词和上下文信息。
methods: 提议的方法是 Knowledge-guided Replay (K-Replay)，包括两部分：一个知识预测任务，使得 VLP 模型对知识的记忆不断被触发，以避免模型填充通用模式; 以及一个知识储存约束，以改善生成的描述的准确性，从而缓解知识幻觉。
results: 方法能够有效地将知识 incorporated 到描述中，比对 STRONG VLP 基eline 高20.9分 (78.7->99.6) 的 CIDEr 得分，以及20.5% (34.0%->54.5%) 的知识认知精度。

Abstract
Current captioning approaches tend to generate correct but "generic" descriptions that lack real-world knowledge, e.g., named entities and contextual information. Considering that Vision-Language Pre-Training (VLP) models master massive such knowledge from large-scale web-harvested data, it is promising to utilize the generalizability of VLP models to incorporate knowledge into image descriptions. However, using VLP models faces challenges: zero-shot inference suffers from knowledge hallucination that leads to low-quality descriptions, but the generic bias in downstream task fine-tuning hinders the VLP model from expressing knowledge. To address these concerns, we propose a simple yet effective method called Knowledge-guided Replay (K-Replay), which enables the retention of pre-training knowledge during fine-tuning. Our approach consists of two parts: (1) a knowledge prediction task on automatically collected replay exemplars to continuously awaken the VLP model's memory about knowledge, thus preventing the model from collapsing into the generic pattern; (2) a knowledge distillation constraint to improve the faithfulness of generated descriptions hence alleviating the knowledge hallucination. To evaluate knowledge-enhanced descriptions, we construct a novel captioning benchmark KnowCap, containing knowledge of landmarks, famous brands, special foods and movie characters. Experimental results show that our approach effectively incorporates knowledge into descriptions, outperforming strong VLP baseline by 20.9 points (78.7->99.6) in CIDEr score and 20.5 percentage points (34.0%->54.5%) in knowledge recognition accuracy. Our code and data is available at https://github.com/njucckevin/KnowCap.

摘要
当前的标题生成方法通常会生成正确的 pero "通用" 的描述，缺乏实际世界知识，例如名词和上下文信息。考虑到视力语言预训练（VLP）模型从大规模的网络采集数据中积累了庞大的知识，因此可以利用 VLP 模型的通用性来插入知识到图像描述中。然而，使用 VLP 模型存在挑战：零 shot 推理会导致知识幻觉，从而导致低质量的描述，而下游任务精通化也会阻碍 VLP 模型表达知识。为了解决这些问题，我们提出了一种简单 yet 有效的方法，即知识引导重播（K-Replay），它可以在精通化过程中保持 VLP 模型的预训练知识。我们的方法包括两个部分：1. 使用自动收集的回退 exemplars 进行知识预测任务，以continuously awaken VLP 模型的记忆，防止模型落入通用模式；2. 使用知识继承约束，以提高生成的描述的准确性，从而缓解知识幻觉。为了评估描述中的知识，我们建立了一个新的描述 benchmark 知Cap，包含了地标、名牌产品、特色美食和电影人物的知识。实验结果显示，我们的方法可以有效地插入知识到描述中，高于强 VLP 基线 by 20.9 个 CIDEr 分数（78.7->99.6）和 20.5 个 percentage points（34.0%->54.5%）的知识认可率。我们的代码和数据可以在上获取。

MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching

paper_url: http://arxiv.org/abs/2308.01927
repo_url: https://github.com/zju-daily/multiem
paper_authors: Xiaocan Zeng, Pengfei Wang, Yuren Mao, Lu Chen, Xiaoze Liu, Yunjun Gao
for: 这篇论文主要针对的是实际数据管理系统中实现不监督实体匹配（Unsupervised Entity Matching，UEM）问题。
methods: 这篇论文提出了一种新的无监督多表实体匹配方法（Multi-table Entity Matching，MultiEM），它是一个并行可分解的管道，包括加强实体表示、表wise层次合并和浸泡筛选。
results: 对六个实际数据集进行了广泛的实验，证明了 MultiEM 在效果和效率两个方面具有优势。

Abstract
Entity Matching (EM), which aims to identify all entity pairs referring to the same real-world entity from relational tables, is one of the most important tasks in real-world data management systems. Due to the labeling process of EM being extremely labor-intensive, unsupervised EM is more applicable than supervised EM in practical scenarios. Traditional unsupervised EM assumes that all entities come from two tables; however, it is more common to match entities from multiple tables in practical applications, that is, multi-table entity matching (multi-table EM). Unfortunately, effective and efficient unsupervised multi-table EM remains under-explored. To fill this gap, this paper formally studies the problem of unsupervised multi-table entity matching and proposes an effective and efficient solution, termed as MultiEM. MultiEM is a parallelable pipeline of enhanced entity representation, table-wise hierarchical merging, and density-based pruning. Extensive experimental results on six real-world benchmark datasets demonstrate the superiority of MultiEM in terms of effectiveness and efficiency.

摘要
实体匹配（EM），旨在从关系表中找到真实世界实体对应的所有实体对，是现实数据管理系统中最重要的任务之一。由于EM标注过程非常劳动密集，因此在实际场景中更适合使用无监督EM。传统的无监督EM假设所有实体来自两个表，但在实际应用中更常见的是从多个表匹配实体，即多表实体匹配（多表EM）。然而，有效和高效的无监督多表EM还未得到了足够的探索。为了填补这个空白，本文正式研究了无监督多表实体匹配问题，并提出了一种高效和高效的解决方案，名为MultiEM。MultiEM是一个并行的管道，包括增强实体表示、表位层次合并和浮点筛选。在六个真实世界 benchmark 数据集上进行了广泛的实验，结果表明MultiEM在效果和效率两个方面具有优势。

Leveraging Few-Shot Data Augmentation and Waterfall Prompting for Response Generation

paper_url: http://arxiv.org/abs/2308.01080
repo_url: None
paper_authors: Lea Krause, Selene Báez Santamaría, Michiel van der Meer, Urja Khurana
for: 本研究探讨了基于主观知识的对话模型设计，尤其是响应生成。
methods: 我们的方法采用了大量数据分析，以评估提供的数据集中的关键因素，如响应长度、情感和对话动作。我们还使用了少量学习来扩展数据集，并提出了三种方法来解决DSTC11：（1）任务特定模型探索，（2）将最常见问题 incorporate into all generated responses，和（3）水fall提问技术使用组合GPT-3和ChatGPT。
results: 我们的实验结果表明，使用水fall提问技术可以提高对话模型的性能，并且可以在不同的任务上实现高效的对话模型设计。

Abstract
This paper discusses our approaches for task-oriented conversational modelling using subjective knowledge, with a particular emphasis on response generation. Our methodology was shaped by an extensive data analysis that evaluated key factors such as response length, sentiment, and dialogue acts present in the provided dataset. We used few-shot learning to augment the data with newly generated subjective knowledge items and present three approaches for DSTC11: (1) task-specific model exploration, (2) incorporation of the most frequent question into all generated responses, and (3) a waterfall prompting technique using a combination of both GPT-3 and ChatGPT.

摘要

任务特定模型探索 (task-specific model exploration)2. 包含最常见问题的所有生成响应 (incorporation of the most frequent question into all generated responses)3. 组合GPT-3和ChatGPT的水质提示技术 (a waterfall prompting technique using a combination of both GPT-3 and ChatGPT)Note:* “DSTC11” refers to the Dialogue System Technology Challenge (DSTC) 2011, which is a benchmarking task for conversational AI systems.* “subjective knowledge” refers to the knowledge that is personal and subjective, and may not be easily quantifiable or observable.* “response generation” refers to the task of generating appropriate responses to user inputs in a conversational setting.

Industrial Memories: Exploring the Findings of Government Inquiries with Neural Word Embedding and Machine Learning

paper_url: http://arxiv.org/abs/2308.02556
repo_url: None
paper_authors: Susan Leavy, Emilie Pine, Mark T Keane
for: 支持大量文本检索，探索政府调查发现的结论
methods: 使用word embedding、文本分类和可视化技术，创建一个交互式网页平台，帮助探索文本，发现新的历史发现
results: 通过转换爱尔兰政府的industrial school inquiry发现，创建了一个可交互的网页平台，帮助探索文本，发现新的历史发现

Abstract
We present a text mining system to support the exploration of large volumes of text detailing the findings of government inquiries. Despite their historical significance and potential societal impact, key findings of inquiries are often hidden within lengthy documents and remain inaccessible to the general public. We transform the findings of the Irish government's inquiry into industrial schools and through the use of word embedding, text classification and visualisation, present an interactive web-based platform that enables the exploration of the text to uncover new historical insights.

摘要
我们提供一个文本挖掘系统，用于探索大量关于政府调查结果的文本。尽管这些调查结果具有历史意义和社会影响，但它们常常被困在长篇文本中，不可供一般公众查阅。我们使用词嵌入、文本分类和视觉化技术，将爱尔兰政府关于工业学校调查的结果转化为一个交互式网页平台，让用户可以通过探索文本来发现新的历史发现。

SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis

paper_url: http://arxiv.org/abs/2308.01018
repo_url: None
paper_authors: Ramanan Sivaguru, Vasista Sai Lodagala, S Umesh
for: 提高 FastSpeech2 synthesized speech 质量
methods: 使用 Self-Supervised Learning (SSL) 模型的表示来增强 FastSpeech2 synthesized speech 的质量
results: 比基eline FastSpeech2 更高的对象和主观评价指标表示提高了 synthesized speech 的质量

Abstract
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As a part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. In particular, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to having the reconstruction loss. The richness of speech characteristics from the SSL features reflects in the output speech quality, with the objective and subjective evaluation measures of the proposed approach outperforming the baseline FastSpeech2.

摘要
While FastSpeech2 aims to integrate aspects of speech such as pitch, energy, and duration as conditional inputs, it still leaves scope for richer representations. As part of this work, we leverage representations from various Self-Supervised Learning (SSL) models to enhance the quality of the synthesized speech. Specifically, we pass the FastSpeech2 encoder's length-regulated outputs through a series of encoder layers with the objective of reconstructing the SSL representations. In the SALTTS-parallel implementation, the representations from this second encoder are used for an auxiliary reconstruction loss with the SSL features. The SALTTS-cascade implementation, however, passes these representations through the decoder in addition to having the reconstruction loss. The richness of speech characteristics from the SSL features is reflected in the output speech quality, with the proposed approach outperforming the baseline FastSpeech2 based on both objective and subjective evaluation measures.

DiactTOD: Learning Generalizable Latent Dialogue Acts for Controllable Task-Oriented Dialogue Systems

paper_url: http://arxiv.org/abs/2308.00878
repo_url: None
paper_authors: Qingyang Wu, James Gung, Raphael Shu, Yi Zhang
for: 这篇论文主要是为了提高任务对话系统的回答生成质量，通过使用对话动作标注。
methods: 这篇论文使用了一种新的终端对话动作模型（DiactTOD），该模型可以在隐藏空间中表示对话动作，并在零Instance情况下使用这些隐藏表示来生成可控的回答。
results: 在多个实验设定下，包括零实例、少数实例和全数据精度调整，该方法达到了状态之最高的性能，并且可以在终端和策略优化配置下实现Zero-shot和几shot的情况。

Abstract
Dialogue act annotations are important to improve response generation quality in task-oriented dialogue systems. However, it can be challenging to use dialogue acts to control response generation in a generalizable way because different datasets and tasks may have incompatible annotations. While alternative methods that utilize latent action spaces or reinforcement learning do not require explicit annotations, they may lack interpretability or face difficulties defining task-specific rewards. In this work, we present a novel end-to-end latent dialogue act model (DiactTOD) that represents dialogue acts in a latent space. DiactTOD, when pre-trained on a large corpus, is able to predict and control dialogue acts to generate controllable responses using these latent representations in a zero-shot fashion. Our approach demonstrates state-of-the-art performance across a wide range of experimental settings on the MultiWOZ dataset, including zero-shot, few-shot, and full data fine-tuning with both end-to-end and policy optimization configurations.

摘要
对话执行动标注是提高任务对话系统响应质量的关键。然而，使用对话执行动来控制响应生成可能会困难，因为不同的数据集和任务可能存在不兼容的标注。而使用隐藏空间或奖励学习方法可能缺乏可读性或定义任务特定的奖励是困难的。在这项工作中，我们提出了一种新的端到端隐藏对话执行动模型（DiactTOD），该模型可以在隐藏空间中表示对话执行动。当 DiactTOD 在大量文本 corpus 上预训练后，可以预测和控制对话执行动，并使用这些隐藏表示生成可控的响应。我们的方法在多种实验设置下达到了状态级表现，包括零shot、少shot 和全数据精度调整，并在端到端和政策优化配置下达到了最佳性能。

Proceedings Modalities in substructural logics: Applications at the interfaces of logic, language and computation

paper_url: http://arxiv.org/abs/2308.03679
repo_url: None
paper_authors: Michael Moortgat, Mehrnoosh Sadrzadeh
for: 本文探讨了类别逻辑中的隐式结构规则，并通过模态逻辑控制逻辑资源的分配。
methods: 本文使用了模态逻辑来控制逻辑资源的分配，并应用于自然语言 syntax 和 semantics 等领域。
results: 本文在逻辑方面提供了新的应用领域，包括自然语言 syntax 和 semantics 等领域的动态逻辑推理。

Abstract
By calling into question the implicit structural rules that are taken for granted in classical logic, substructural logics have brought to the fore new forms of reasoning with applications in many interdisciplinary areas of interest. Modalities, in the substructural setting, provide the tools to control and finetune the logical resource management. The focus of the workshop is on applications in the areas of interest to the ESSLLI community, in particular logical approaches to natural language syntax and semantics and the dynamics of reasoning. The workshop is held with the support of the Horizon 2020 MSCA-Rise project MOSAIC .

摘要
经启发了古典逻辑中隐式结构规则的假设，Subview逻辑已经把新的理由形式带到了前线。Modalities在Subview设置下提供了控制和精细化逻辑资源的工具。工作室的焦点是在ESSLLI社区兴趣领域的应用，特别是逻辑对自然语言语法和 semantics的逻辑approaches以及推理的动态。工作室得到了欧盟海绵2020年MSCA-Rise项目MOSAIC的支持。

Aspect based sentimental analysis for travellers’ reviews

paper_url: http://arxiv.org/abs/2308.02548
repo_url: None
paper_authors: Mohammed Saad M Alaydaa, Jun Li, Karl Jinkins
for: 本研究旨在为机场管理人员提供更加细化的服务质量评估方法，以便更好地了解旅客的需求和意见。
methods: 本研究使用了方面基 sentiment分析方法，以分析来自Google Maps的旅客评论，并提供了更加细化的服务质量评估结果。
results: 研究结果表明，使用方面基 sentiment分析方法可以提供更加细化的服务质量评估结果，帮助机场管理人员更好地了解旅客的需求和意见，并提供了更加准确的改进建议。

Abstract
Airport service quality evaluation is commonly found on social media, including Google Maps. This valuable for airport management in order to enhance the quality of services provided. However; prior studies either provide general review for topics discussed by travellers or provide sentimental value to tag the entire review without specifically mentioning the airport service that is behind such value. Accordingly, this work proposes using aspect based sentimental analysis in order to provide more detailed analysis for travellers reviews. This works applied aspect based sentimental analysis on data collected from Google Map about Dubai and Doha airports. The results provide tangible reasons to use aspect based sentimental analysis in order to understand more the travellers and spot airport services that are in need for improvement.

摘要
空港服务质量评估通常出现在社交媒体上，包括Google Maps。这对空港管理有益，可以提高提供的服务质量。然而，先前的研究 either提供旁ieri travelers discussed的通用评论或对整个评论进行情感值标签，而不是特定的空港服务。因此，这项工作提议使用方面基 sentimental analysis，以提供更加细化的旅客评论分析。这个工作在Google Maps上收集的关于 Dubai 和 Doha 机场的数据上进行了方面基 sentimental analysis。结果提供了具体的原因，以便使用方面基 sentimental analysis，以更好地理解旅客和改进机场服务。

GRDD: A Dataset for Greek Dialectal NLP

paper_url: http://arxiv.org/abs/2308.00802
repo_url: https://github.com/stergioscha/greek_dialect_corpus
paper_authors: Stergios Chatzikyriakidis, Chatrine Qwaider, Ilias Kolokousis, Christina Koula, Dimitris Papadakis, Efthymia Sakellariou
for: This paper is written for the purpose of creating a large-scale dataset for the computational study of Modern Greek dialects, and to perform dialect identification using machine learning (ML) algorithms and deep learning (DL) architectures.
methods: The paper uses a dataset of raw text data from four Modern Greek dialects (Cretan, Pontic, Northern Greek, and Cypriot Greek) to perform dialect identification. The authors experiment with traditional ML algorithms and simple DL architectures to achieve good performance on the task.
results: The results of the paper show very good performance on the task of dialect identification, with the top performing algorithms achieving high accuracy. However, error analysis reveals that insufficient dataset cleaning is a major source of errors.

Abstract
In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

摘要
在这篇论文中，我们发布了现代希腊方言的计算机研究数据集。数据集包含四种现代希腊方言的原始文本数据，即crete, pontic, northern greek和cypriot greek。数据集虽然庞大但受到干扰，但是这是现代希腊方言的大规模方言资源的首次尝试。我们使用该数据集进行方言定义，并使用传统的机器学习算法以及简单的人工智能建筑。结果表明，这些方言之间有明显的特征，使得简单的机器学习模型可以在这种任务上表现出色。我们进行错误分析，发现在一些情况下，错误是由于数据集不够清洁。

Self-Supervised Contrastive BERT Fine-tuning for Fusion-based Reviewed-Item Retrieval

paper_url: http://arxiv.org/abs/2308.00762
repo_url: https://github.com/d3mlab/rir_data
paper_authors: Mohammad Mahdi Abdollah Pour, Parsa Farinneya, Armin Toroghi, Anton Korikov, Ali Pesaranghader, Touqir Sajed, Manasa Bharadwaj, Borislav Mavrin, Scott Sanner
for: 该论文主要针对的是 Reviewed-Item Retrieval (RIR) 任务，即使用神经网络 Retrieval (IR) 方法来对用户提出的复杂自然语言查询进行匹配。
methods: 该论文提出了一种基于自我超vised learning的方法，通过对 BERT 表示的对比学习来扩展 Neural IR 方法到 RIR 任务。具体来说，该方法使用了同一个物品的正面评论作为积极样本，选择最不相似的评论作为硬性正样本，并使用不同物品的评论作为硬性负样本。此外，该方法还 explore anchor sub-sampling 和 meta-data 的使用。
results: 实验结果表明，使用 Late Fusion 方法进行对比学习的 Neural RIR 方法可以与所有其他对比 IR 配置、神经 IR 和稀 scattered retrieval 基线方法进行比较，并且达到最高效果。这表明了在 Neural RIR 方法中利用两级结构以及在 Early Fusion 和 Late Fusion 方法之间的转换可以提高效果。

Abstract
As natural language interfaces enable users to express increasingly complex natural language queries, there is a parallel explosion of user review content that can allow users to better find items such as restaurants, books, or movies that match these expressive queries. While Neural Information Retrieval (IR) methods have provided state-of-the-art results for matching queries to documents, they have not been extended to the task of Reviewed-Item Retrieval (RIR), where query-review scores must be aggregated (or fused) into item-level scores for ranking. In the absence of labeled RIR datasets, we extend Neural IR methodology to RIR by leveraging self-supervised methods for contrastive learning of BERT embeddings for both queries and reviews. Specifically, contrastive learning requires a choice of positive and negative samples, where the unique two-level structure of our item-review data combined with meta-data affords us a rich structure for the selection of these samples. For contrastive learning in a Late Fusion scenario, we investigate the use of positive review samples from the same item and/or with the same rating, selection of hard positive samples by choosing the least similar reviews from the same anchor item, and selection of hard negative samples by choosing the most similar reviews from different items. We also explore anchor sub-sampling and augmenting with meta-data. For a more end-to-end Early Fusion approach, we introduce contrastive item embedding learning to fuse reviews into single item embeddings. Experimental results show that Late Fusion contrastive learning for Neural RIR outperforms all other contrastive IR configurations, Neural IR, and sparse retrieval baselines, thus demonstrating the power of exploiting the two-level structure in Neural RIR approaches as well as the importance of preserving the nuance of individual review content via Late Fusion methods.

摘要
随着自然语言界面的发展，用户可以提出更加复杂的自然语言查询，这对搜索笔记、书籍和电影等项目的搜索提供了更多的可能性。然而，神经信息检索（Neural IR）方法已经提供了状态态的结果，但它们没有被扩展到评论项目检索（RIR）任务中，其中需要将查询和评论得分聚合（或融合）为项目级别的分数。由于缺乏标注的 RIR 数据集，我们将神经 IR 方法扩展到 RIR 任务中，通过利用自我指导的方法进行对比学习BERT表示。具体来说，对比学习需要选择正例和负例样本，我们利用Item-review数据集的特殊两级结构，以及元数据，选择正例和负例样本。我们还 investigate了使用相同项目和相同分数的积极正例样本，选择最不相似的 anchor item 中的积极负例样本，以及使用不同项目的最相似负例样本。此外，我们还探索了 anchor 子样本抽取和元数据增强。为了更加端到端，我们引入对比项目嵌入学习，将评论 fusion 到单个项目嵌入中。实验结果表明，使用 Late Fusion 对比学习方法，Neural RIR 的性能高于所有其他对比 IR 配置、神经 IR 和稀肥检索基线，这表明了利用 Neural RIR 方法的两级结构以及在 Late Fusion 方法中保持评论内容的细节的重要性。

The Bias Amplification Paradox in Text-to-Image Generation

paper_url: http://arxiv.org/abs/2308.00755
repo_url: https://github.com/preethiseshadri518/bias-amplification-paradox
paper_authors: Preethi Seshadri, Sameer Singh, Yanai Elazar
for: 这个论文研究了模型在文本到图像领域中的偏见增强现象，通过比较训练和生成图像中的性别比例来研究。
methods: 这个论文使用了稳定扩散来研究模型的偏见增强现象，并发现了模型在训练数据中存在性别职业偏见的增强。
results: 研究发现，模型的偏见增强主要归结于训练和生成文本之间的分布差异，例如训练caption中常包含直接表达性别信息的情况，而生成提示则不包含这种信息，这导致分布差异并影响偏见度量。经过考虑这些分布差异， amplification 降低了许多。这些发现表明了比较模型和训练数据中的偏见是一个挑战，以及模型训练过程中存在干扰因素的影响。

Abstract
Bias amplification is a phenomenon in which models increase imbalances present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION). However, we discover that amplification can largely be attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while the prompts we use do not, which leads to a distribution shift and consequently impacts bias measures. Once we account for various distributional differences between texts used for training and generation, we observe that amplification decreases considerably. Our findings illustrate the challenges of comparing biases in models and the data they are trained on, and highlight confounding factors that contribute to bias amplification.

摘要
<偏见增强是一种现象，在训练数据中存在的偏见会被模型进一步增加。在这篇论文中，我们研究了在文本到图像领域中的偏见增强，使用稳定扩散比较训练和生成图像中的性别比。我们发现，模型似乎会增加在训练数据中存在的性别职业偏见，但是我们发现，增强的主要原因是训练caption和模型提示之间的分布差异。例如，训练数据中的caption通常包含直接表达性别信息，而模型提示则不包含这种信息，这导致分布shift并影响偏见测量。一旦我们考虑了训练和生成文本之间的各种分布差异，我们发现增强明显减少。我们的发现表明了对比模型和它们训练数据中的偏见的困难，以及生成图像中的偏见增强的干扰因素。>

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

paper_url: http://arxiv.org/abs/2308.00683
repo_url: None
paper_authors: Nadezhda Chirkova, Sergey Troshin
for: 这个研究目的是调查不同子tokization选项对代码预训练模型的影响，并找到最有效和最短的子tokization方法。
methods: 该研究使用了大量语言模型预训练，并提出了一种新的子tokization方法，该方法可以减少平均长度17%，而无需下游性能下降。
results: 研究发现，选择合适的子tokization方法可以提高质量0.5-2%，可能会增加一些长度。

Abstract
Recent works have widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, namely the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account code specifics. We propose subtokenziation that reduces average length by 17% without downstream performance drop, and show that a carefully chosen subtokenization may improve quality by 0.5-2%, possibly with some length increase.

摘要