cs.CL - 2023-10-21

Structural generalization in COGS: Supertagging is (almost) all you need

paper_url: http://arxiv.org/abs/2310.14124
repo_url: https://github.com/alban-petit/semantic-supertag-parser
paper_authors: Alban Petit, Caio Corro, François Yvon
for: 提高 neural network 在不同类型的语言模型中的泛化能力
methods: 提出了一种基于图的semantic parsing框架，并对其进行了多种扩展以解决泛化问题
results: 实验结果表明，我们的方法可以在COGS dataset中提高泛化能力，特别是在需要结构泛化的例子上得到了显著提高

Abstract
In many Natural Language Processing applications, neural networks have been found to fail to generalize on out-of-distribution examples. In particular, several recent semantic parsing datasets have put forward important limitations of neural networks in cases where compositional generalization is required. In this work, we extend a neural graph-based semantic parsing framework in several ways to alleviate this issue. Notably, we propose: (1) the introduction of a supertagging step with valency constraints, expressed as an integer linear program; (2) a reduction of the graph prediction problem to the maximum matching problem; (3) the design of an incremental early-stopping training strategy to prevent overfitting. Experimentally, our approach significantly improves results on examples that require structural generalization in the COGS dataset, a known challenging benchmark for compositional generalization. Overall, our results confirm that structural constraints are important for generalization in semantic parsing.

摘要
多种自然语言处理应用程序中，神经网络通常无法泛化到不同分布中的示例。特别是在需要 Compositional Generalization 的 semantic parsing 数据集中，神经网络表现出了重要的局限性。在这种情况下，我们对一种基于神经网络的 semantic parsing 框架进行了多种扩展，以解决这个问题。主要提议包括：1. 引入精度标记步骤，使用整数线性Programming来表达 valency 约束。2. 将图像预测问题转换为最大匹配问题。3. 设计了逐步停止训练策略，以避免过拟合。实验表明，我们的方法可以在 COGS 数据集中，解决需要结构泛化的示例中显著提高结果。总之，我们的结果证明了结构约束对泛化的重要性。

Finite-context Indexing of Restricted Output Space for NLP Models Facing Noisy Input

paper_url: http://arxiv.org/abs/2310.14110
repo_url: https://github.com/mnhng/firo
paper_authors: Minh Nguyen, Nancy F. Chen
for: 提高 NLP 模型对不净输入的性能，而不是降低清晰输入的性能。
methods: FiRo 方法使用 finite-context aggregation 获取上下文嵌入，并在受限的输出空间内查找静止的表示。
results: FiRo 方法在六个分类任务和一个序列标注任务上，以不同程度的噪声为输入，与基eline相比表现出色。

Abstract
NLP models excel on tasks with clean inputs, but are less accurate with noisy inputs. In particular, character-level noise such as human-written typos and adversarially-engineered realistic-looking misspellings often appears in text and can easily trip up NLP models. Prior solutions to address character-level noise often alter the content of the inputs (low fidelity), thus inadvertently lowering model accuracy on clean inputs. We proposed FiRo, an approach to boost NLP model performance on noisy inputs without sacrificing performance on clean inputs. FiRo sanitizes the input text while preserving its fidelity by inferring the noise-free form for each token in the input. FiRo uses finite-context aggregation to obtain contextual embeddings which is then used to find the noise-free form within a restricted output space. The output space is restricted to a small cluster of probable candidates in order to predict the noise-free tokens more accurately. Although the clusters are small, FiRo's effective vocabulary (union of all clusters) can be scaled up to better preserve the input content. Experimental results show NLP models that use FiRo outperforming baselines on six classification tasks and one sequence labeling task at various degrees of noise.

摘要
FiRo 使用 finite-context aggregation 获取上下文嵌入，然后使用 restricted output space 来预测噪音自由形。输出空间是限制在一小 clusters 中，以更准确地预测噪音自由形。虽然 clusters 是小的，但FiRo 的有效词汇（union of all clusters）可以被扩展，以更好地保持输入内容。实验结果表明，使用 FiRo 的 NLP 模型在六个分类任务和一个序列标签任务上表现出色，在不同程度的噪音下都高于基eline。

Leveraging Knowledge Graphs for Orphan Entity Allocation in Resume Processing

paper_url: http://arxiv.org/abs/2310.14093
repo_url: None
paper_authors: Aagam Bakliwal, Shubham Manish Gandhi, Yashodhara Haribhakta
for: automatize and enhance the efficiency of the job screening process
methods: association mining, concept extraction, external knowledge linking, named entity recognition, and knowledge graph construction
results: successful bucketing of orphan entities within resumes, more effective candidate-job matching, and improved resume screening process accuracy.

Abstract
Significant challenges are posed in talent acquisition and recruitment by processing and analyzing unstructured data, particularly resumes. This research presents a novel approach for orphan entity allocation in resume processing using knowledge graphs. Techniques of association mining, concept extraction, external knowledge linking, named entity recognition, and knowledge graph construction are integrated into our pipeline. By leveraging these techniques, the aim is to automate and enhance the efficiency of the job screening process by successfully bucketing orphan entities within resumes. This allows for more effective matching between candidates and job positions, streamlining the resume screening process, and enhancing the accuracy of candidate-job matching. The approach's exceptional effectiveness and resilience are highlighted through extensive experimentation and evaluation, ensuring that alternative measures can be relied upon for seamless processing and orphan entity allocation in case of any component failure. The capabilities of knowledge graphs in generating valuable insights through intelligent information extraction and representation, specifically in the domain of categorizing orphan entities, are highlighted by the results of our research.

摘要
significannot challenges are posed in talent acquisition and recruitment by processing and analyzing unstructured data, particularly resumes. This research presents a novel approach for orphan entity allocation in resume processing using knowledge graphs. Techniques of association mining, concept extraction, external knowledge linking, named entity recognition, and knowledge graph construction are integrated into our pipeline. By leveraging these techniques, the aim is to automate and enhance the efficiency of the job screening process by successfully bucketing orphan entities within resumes. This allows for more effective matching between candidates and job positions, streamlining the resume screening process, and enhancing the accuracy of candidate-job matching. The approach's exceptional effectiveness and resilience are highlighted through extensive experimentation and evaluation, ensuring that alternative measures can be relied upon for seamless processing and orphan entity allocation in case of any component failure. The capabilities of knowledge graphs in generating valuable insights through intelligent information extraction and representation, specifically in the domain of categorizing orphan entities, are highlighted by the results of our research.Here's the text with some additional information about the Simplified Chinese translation:The translation is in Simplified Chinese, which is the standardized form of Chinese used in mainland China and Singapore. The text is written in a formal and academic style, using technical terms and concepts related to natural language processing, machine learning, and knowledge graphs.Some of the key concepts and techniques used in the text include:* orphan entity allocation (遗弃实体分配): the process of identifying and categorizing entities in unstructured data, such as resumes, that do not fit into predefined categories or structures.* knowledge graphs (知识图): a type of graph that represents entities and their relationships in a structured and interconnected way, allowing for efficient information extraction and analysis.* association mining (关联挖掘): a technique used to discover and extract relationships between entities in unstructured data, such as resumes.* concept extraction (概念提取): a technique used to identify and extract relevant concepts and entities from unstructured data, such as resumes.* named entity recognition (命名实体识别): a technique used to identify and extract specific types of entities, such as names of people, organizations, and locations, from unstructured data, such as resumes.Overall, the text presents a novel approach for automating and enhancing the efficiency of the job screening process using knowledge graphs and other techniques, with the goal of improving the accuracy of candidate-job matching and streamlining the resume screening process.

MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

paper_url: http://arxiv.org/abs/2310.14088
repo_url: None
paper_authors: Zexue He, Yu Wang, An Yan, Yao Liu, Eric Y. Chang, Amilcare Gentili, Julian McAuley, Chun-Nan Hsu
for: 本研究旨在提供一个多元、多任务、多领域医疗 benchmark，以推动语言模型在医疗领域的开发。
methods: 本研究使用了多种健康系统的数据，包括8种检查方式，共有22,779句 sentence和21,228份报告。研究人员在多个水平提供了专家标注，实现了精确的数据分类和多元的应用潜力。
results: 研究人员通过评估10种通用和领域特定的语言模型，包括健康领域基于领域的基eline和一般化的大语言模型（如ChatGPT），获得了不同任务之间的语言模型效果的评估结果。研究结果显示，大语言模型在不同任务之间的效果不同，并且发现了对 instrucion 的适应是一个重要的因素。

Abstract
Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.

摘要
医疗领域的数据集经常受限于专家的人工标注。在这篇论文中，我们介绍了医生eval，一个多级、多任务、多领域的医疗语言模型开发 benchmark。医生eval 全面，涵盖多个医疗系统，覆盖人体8种检查方式，共收集了22,779句话和21,228份报告。我们提供了多个水平的专家标注，为数据的细化使用提供了可能的潜在应用，支持广泛的任务。此外，我们系统地评估了10种通用和医疗领域特定的语言模型，包括医疗领域基线模型和普通大语言模型（如ChatGPT）。我们的评估发现，这两类语言模型在不同任务中的效果不同，而且在几个任务中，大语言模型的几个 shot 使用需要进行调教。我们的调查开创了医疗领域语言模型的 benchmarking 的可能性，并为将来的应用和进步提供了有价值的见解，了解大语言模型在医疗领域的优劣和局限性。

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

paper_url: http://arxiv.org/abs/2310.14053
repo_url: https://github.com/marcusm117/IdentityChain
paper_authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray
for: 评估大型自然语言处理模型（Code LLMs）的可靠性。
methods: 提出了一种名为IdentityChain的框架，可以同时评估模型的自我一致性和总体准确率。
results: 对11个Code LLMs进行了评估，发现它们无法保持自我一致性，并且通过使用IdentityChain可以曝光当前模型的三大缺陷。

Abstract
Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the general accuracy of Code LLMs on individual tasks has been extensively evaluated, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and general accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from general accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.

摘要
<>大型语言模型（大型语言模型，Code LLMs）在实际应用中越来越广泛使用，因此评估其重要。而大型语言模型在不同任务之间的一致性却受到了忽略。人们可以 intuition 来认为，一个可靠的模型应该在生成自然语言规范和代码之间保持一致。如果不能保持一致性，则表明模型对自然语言和编程语言共同下的 semantics 没有很好的理解，因此模型的可靠性将受到损害。在这篇论文中，我们首先正式定义了 Code LLMs 的自身一致性，然后我们设计了一个框架，叫做 IdentityChain，可以同时评估模型的自身一致性和总准确率。我们对 eleven 种 Code LLMs 进行了研究，发现它们在保持自身一致性方面存在问题，这实际上是一种与总准确率不同的特征。此外，我们还证明了 IdentityChain 可以作为模型调试工具，用于曝光当前模型的三大弱点。我们的代码可以在 https://github.com/marcusm117/IdentityChain 上找到。

Code-Switching with Word Senses for Pretraining in Neural Machine Translation

paper_url: http://arxiv.org/abs/2310.14050
repo_url: None
paper_authors: Vivek Iyer, Edoardo Barba, Alexandra Birch, Jeff Z. Pan, Roberto Navigli
for: 本研究旨在提高Neural Machine Translation（NMT）模型的精度和可靠性，通过在Pre-training阶段使用知识库中的单词意思信息来改善模型的多语言能力。
methods: 本研究提出了Word Sense Pretraining for Neural Machine Translation（WSP-NMT）方法，该方法利用知识库中的单词意思信息进行预训练，以提高模型在多语言翻译任务中的表现。
results: 实验结果表明，WSP-NMT方法可以显著提高翻译质量，并在不同的数据和资源匮乏情况下保持良好的表现。此外，研究还发现了在DiBiMT排除词检测测试上的细致精度提升。

Abstract
Lexical ambiguity is a significant and pervasive challenge in Neural Machine Translation (NMT), with many state-of-the-art (SOTA) NMT systems struggling to handle polysemous words (Campolungo et al., 2022). The same holds for the NMT pretraining paradigm of denoising synthetic "code-switched" text (Pan et al., 2021; Iyer et al., 2023), where word senses are ignored in the noising stage -- leading to harmful sense biases in the pretraining data that are subsequently inherited by the resulting models. In this work, we introduce Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - an end-to-end approach for pretraining multilingual NMT models leveraging word sense-specific information from Knowledge Bases. Our experiments show significant improvements in overall translation quality. Then, we show the robustness of our approach to scale to various challenging data and resource-scarce scenarios and, finally, report fine-grained accuracy improvements on the DiBiMT disambiguation benchmark. Our studies yield interesting and novel insights into the merits and challenges of integrating word sense information and structured knowledge in multilingual pretraining for NMT.

摘要
Lexical ambiguity 是 neural machine translation (NMT) 中的一个重要和普遍存在的挑战 (Campolungo et al., 2022)。多种现代 NMT 系统在处理多义词 (polysemous words) 方面几乎都有困难。同样的情况也出现在 NMT 预训练 paradigm 中，例如 denoising synthetic "code-switched" text (Pan et al., 2021; Iyer et al., 2023)，在杂化阶段中忽略单词的意思 -- 导致预训练数据中的词义偏见，这些偏见后来被传递给结果模型。在这种情况下，我们引入了 Word Sense Pretraining for Neural Machine Translation (WSP-NMT) - 一种结束到结束的方法，使用知识库中的单词意思特定信息来预训练多语言 NMT 模型。我们的实验表明，我们的方法可以提高总翻译质量。然后，我们证明了我们的方法可以扩展到各种复杂的数据和资源匮乏场景，并最后报告了 DiBiMT 分词标准 bencmark 上的细化准确性改进。我们的研究提供了关于将单词意思信息和结构化知识integrated into multilingual pretraining for NMT的新颖和有趣的发现。

MeaeQ: Mount Model Extraction Attacks with Efficient Queries

paper_url: http://arxiv.org/abs/2310.14047
repo_url: https://github.com/c-w-d/meaeq
paper_authors: Chengwei Dai, Minxuan Lv, Kun Li, Wei Zhou
For: The paper is written to address model extraction attacks in natural language processing (NLP) and to propose a method for stealing victim models with low query costs.* Methods: The paper uses a zero-shot sequence inference classifier combined with API service information to filter task-relevant data from a public text corpus, and a clustering-based data reduction technique to obtain representative data as queries for the attack.* Results: The paper achieves higher functional similarity to the victim model than baselines while requiring fewer queries, as demonstrated through extensive experiments conducted on four benchmark datasets.Here’s the same information in Simplified Chinese text:* For: 本文是为了 Addressing Model Extraction Attacks in Natural Language Processing (NLP) 和提出一种用于夺取受害模型的低查询成本的方法。* Methods: 本文使用 Zero-shot Sequence Inference Classifier 与 API 服务信息结合，从公共文本 corpus 中筛选任务相关的数据，并使用 clustering-based data reduction technique 获取表示性的数据作为攻击的查询。* Results: 本文在四个 benchmark 数据集上实现了高度的函数相似性，而且需要更少的查询，较基eline 高。

Abstract
We study model extraction attacks in natural language processing (NLP) where attackers aim to steal victim models by repeatedly querying the open Application Programming Interfaces (APIs). Recent works focus on limited-query budget settings and adopt random sampling or active learning-based sampling strategies on publicly available, unannotated data sources. However, these methods often result in selected queries that lack task relevance and data diversity, leading to limited success in achieving satisfactory results with low query costs. In this paper, we propose MeaeQ (Model extraction attack with efficient Queries), a straightforward yet effective method to address these issues. Specifically, we initially utilize a zero-shot sequence inference classifier, combined with API service information, to filter task-relevant data from a public text corpus instead of a problem domain-specific dataset. Furthermore, we employ a clustering-based data reduction technique to obtain representative data as queries for the attack. Extensive experiments conducted on four benchmark datasets demonstrate that MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries. Our code is available at https://github.com/C-W-D/MeaeQ.

摘要
我们研究模型EXTRACT attacks在自然语言处理（NLP）中，攻击者希望通过 repeatedly 访问公开的应用程序编程接口（API）来窃取受害者模型。最近的工作主要集中在有限Query 预算设置下，采用随机抽样或活动学习基于抽样策略在公共可用数据源上。然而，这些方法经常导致选择的查询缺乏任务相关性和数据多样性，从而导致寻求满意的结果需要较高的查询成本。在这篇论文中，我们提出 MeaeQ（模型EXTRACT攻击with高效查询），一种简单又有效的方法来解决这些问题。我们首先利用零拟合序列推理分类器，结合API服务信息，从公共文本资源中筛选任务相关的数据而不是具体领域特定的数据集。其次，我们使用聚类分析技术来减少数据，从而获得代表性的查询。我们对四个标准测试集进行了广泛的实验，结果显示，MeaeQ可以在 fewer 查询下达到更高的功能相似性，而与基eline 相比。我们的代码可以在上获取。

Tree Prompting: Efficient Task Adaptation without Fine-Tuning

paper_url: http://arxiv.org/abs/2310.14034
repo_url: https://github.com/csinva/treeprompt
paper_authors: John X. Morris, Chandan Singh, Alexander M. Rush, Jianfeng Gao, Yuntian Deng
for: 本文旨在提高小型语言模型（LM）的应用性，通过构建决策树来提高提问的精度。
methods: 本文提出了树提 prompting 方法，即在执行时，通过有效地路由上一步决策树的结果来确定下一步LM的调用。
results: 实验表明，树提 prompting 方法可以提高分类 dataset 的准确率，与Finetuning 相当，并且可以在各种任务上进行审查模型的决策过程。

Abstract
Prompting language models (LMs) is the main interface for applying them to new tasks. However, for smaller LMs, prompting provides low accuracy compared to gradient-based finetuning. Tree Prompting is an approach to prompting which builds a decision tree of prompts, linking multiple LM calls together to solve a task. At inference time, each call to the LM is determined by efficiently routing the outcome of the previous call using the tree. Experiments on classification datasets show that Tree Prompting improves accuracy over competing methods and is competitive with fine-tuning. We also show that variants of Tree Prompting allow inspection of a model's decision-making process.

摘要
LMs的提示是主要用于应用它们到新任务中。然而，对于较小的LMs，提示提供的准确率相对较低，而梯度基于的finetuning则提供了更高的准确率。Tree Prompting是一种提示方法，它建立了一棵决策树，将多个LM调用串起来解决任务。在推理时，每次LM调用的结果都会根据树的结构有效地Routing到下一次调用。我们的实验表明，Tree Prompting可以提高分类 datasets 的准确率，并与finetuning竞争。此外，我们还示出了Tree Prompting的变体可以对模型决策过程进行检查。

Analysing State-Backed Propaganda Websites: a New Dataset and Linguistic Study

paper_url: http://arxiv.org/abs/2310.14032
repo_url: https://github.com/gatenlp/wordpress-site-extractor
paper_authors: Freddy Heppell, Kalina Bontcheva, Carolina Scarton
for: 这篇论文旨在分析两个以前没有研究过的国家支持的假新闻网站，即Reliable Recent News (rrn.world)和WarOnFakes (waronfakes.com)，这两个网站分别发布了多种语言的内容，包括阿拉伯语、中文、英语、法语、德语和西班牙语。
methods: 我们使用了内容获取方法和跨站无监督主题聚合方法来处理这些多语言数据集，并对网页翻译和时间分析进行语言和时间分析。
results: 我们的研究发现，这两个网站的内容具有较高的假新闻分布率，并且有一些文章的发布日期不准确。我们还发现了这些网站之间的语言和主题相似性，以及它们在时间上的变化。我们还公开发布了一个包含14,053篇文章的新数据集，每篇文章都有相应的语言版本和附加的元数据，如链接和图片。本研究对NLP社区的主要贡献在于提供了假新闻网站的新数据集，以及训练NLP工具 для假新闻检测。

Abstract
This paper analyses two hitherto unstudied sites sharing state-backed disinformation, Reliable Recent News (rrn.world) and WarOnFakes (waronfakes.com), which publish content in Arabic, Chinese, English, French, German, and Spanish. We describe our content acquisition methodology and perform cross-site unsupervised topic clustering on the resulting multilingual dataset. We also perform linguistic and temporal analysis of the web page translations and topics over time, and investigate articles with false publication dates. We make publicly available this new dataset of 14,053 articles, annotated with each language version, and additional metadata such as links and images. The main contribution of this paper for the NLP community is in the novel dataset which enables studies of disinformation networks, and the training of NLP tools for disinformation detection.

摘要
这篇论文分析了两个未经研究的媒体站点，即可靠最新新闻（rrn.world）和战对假新闻（waronfakes.com），这两个站点都发布了多语言内容（阿拉伯语、中文、英语、法语、德语和西班牙语）。我们描述了我们的内容获取方法和跨站点无监督主题划分方法，并对多语言数据集进行语言和时间分析，以及文章发布日期的错误分析。我们公开发布了14,053篇文章，每篇文章都有相应的语言版本和附加元数据，如链接和图像。本文的主要贡献是提供了一个新的识别假新闻网络的数据集，以及训练NLP工具的机会。

LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions

paper_url: http://arxiv.org/abs/2310.14029
repo_url: https://github.com/vertaix/llm-prop
paper_authors: Andre Niyongabo Rubungo, Craig Arnold, Barry P. Rand, Adji Bousso Dieng
for: 这个论文旨在提出一种基于大语言模型（LLM）的方法，用于从晶体文本描述中预测晶体性质。
methods: 该方法使用大语言模型（LLM）来利用晶体文本描述来预测晶体的物理和电子性质。
results: 对比现有的graph neural network（GNN）方法，LLM-Prop方法在预测晶体带隙、是否直接带隙和晶体单元体积等性质上表现出较高的准确率。

Abstract
The prediction of crystal properties plays a crucial role in the crystal design process. Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks (GNNs). Although GNNs are powerful, accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge. Surprisingly, predicting crystal properties from crystal text descriptions is understudied, despite the rich information and expressiveness that text data offer. One of the main reasons is the lack of publicly available data for this task. In this paper, we develop and make public a benchmark dataset (called TextEdge) that contains text descriptions of crystal structures with their properties. We then propose LLM-Prop, a method that leverages the general-purpose learning capabilities of large language models (LLMs) to predict the physical and electronic properties of crystals from their text descriptions. LLM-Prop outperforms the current state-of-the-art GNN-based crystal property predictor by about 4% in predicting band gap, 3% in classifying whether the band gap is direct or indirect, and 66% in predicting unit cell volume. LLM-Prop also outperforms a finetuned MatBERT, a domain-specific pre-trained BERT model, despite having 3 times fewer parameters. Our empirical results may highlight the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction.

摘要
<> translate "The prediction of crystal properties plays a crucial role in the crystal design process. Current methods for predicting crystal properties focus on modeling crystal structures using graph neural networks (GNNs). Although GNNs are powerful, accurately modeling the complex interactions between atoms and molecules within a crystal remains a challenge. Surprisingly, predicting crystal properties from crystal text descriptions is understudied, despite the rich information and expressiveness that text data offer. One of the main reasons is the lack of publicly available data for this task. In this paper, we develop and make public a benchmark dataset (called TextEdge) that contains text descriptions of crystal structures with their properties. We then propose LLM-Prop, a method that leverages the general-purpose learning capabilities of large language models (LLMs) to predict the physical and electronic properties of crystals from their text descriptions. LLM-Prop outperforms the current state-of-the-art GNN-based crystal property predictor by about 4% in predicting band gap, 3% in classifying whether the band gap is direct or indirect, and 66% in predicting unit cell volume. LLM-Prop also outperforms a finetuned MatBERT, a domain-specific pre-trained BERT model, despite having 3 times fewer parameters. Our empirical results may highlight the current inability of GNNs to capture information pertaining to space group symmetry and Wyckoff sites for accurate crystal property prediction."中文翻译：<>预测 кристаллических 属性在 кристалли设计过程中扮演关键角色。现有方法用图 neural networks（GNNs）模拟 кристалли结构，尽管 GNNs 强大，但是很难准确地模拟 кристалли中原子和分子之间复杂的交互。尽管预测 кристалли属性从 кристалли文本描述是未explored的，尽管文本数据具有丰富的信息和表达能力。一个主要的原因是该任务上没有公共可用的数据。在这篇论文中，我们开发了一个名为 TextEdge 的 referential dataset，该 dataset包含 кристалли结构的文本描述和其属性。我们然后提出了 LLM-Prop，一种利用大型自然语言模型（LLMs）预测 кристалли物理和电子属性的方法。LLM-Prop 在预测带隙、直接或间接带隙和单元积体积方面比现有状态 искусственный neural networks （GNNs） Based crystal property predictor 高于4%，高于3%在分类直接或间接带隙，和66%在预测单元积体积。LLM-Prop 还高于一个finetuned MatBERT，一种预先训练的 BERT 模型，尽管它有3倍少的参数。我们的实验结果可能高亮当前 GNNs 无法捕捉关于空群 симметрии和 Wyckoff 位置的信息，以便准确预测 кристалли属性。

GASCOM: Graph-based Attentive Semantic Context Modeling for Online Conversation Understanding

paper_url: http://arxiv.org/abs/2310.14028
repo_url: None
paper_authors: Vibhor Agarwal, Yu Chen, Nishanth Sastry
for: 这篇论文是为了提高在线对话理解的性能而写的。
methods: 这篇论文提出了一种基于图 estructure的注意力机制，使得可以更好地理解在线对话的含义。
results: 论文使用了两种新的算法，可以从整个对话树中提取有用的信息，并且使用多头Graph Attention Mechanism来进一步细化对话上的含义模型化。论文的实验结果表明，与现状比较，这种方法可以提高对话理解的性能，提高了对 polarity prediction 和 hate speech detection 的性能。

Abstract
Online conversation understanding is an important yet challenging NLP problem which has many useful applications (e.g., hate speech detection). However, online conversations typically unfold over a series of posts and replies to those posts, forming a tree structure within which individual posts may refer to semantic context from higher up the tree. Such semantic cross-referencing makes it difficult to understand a single post by itself; yet considering the entire conversation tree is not only difficult to scale but can also be misleading as a single conversation may have several distinct threads or points, not all of which are relevant to the post being considered. In this paper, we propose a Graph-based Attentive Semantic COntext Modeling (GASCOM) framework for online conversation understanding. Specifically, we design two novel algorithms that utilise both the graph structure of the online conversation as well as the semantic information from individual posts for retrieving relevant context nodes from the whole conversation. We further design a token-level multi-head graph attention mechanism to pay different attentions to different tokens from different selected context utterances for fine-grained conversation context modeling. Using this semantic conversational context, we re-examine two well-studied problems: polarity prediction and hate speech detection. Our proposed framework significantly outperforms state-of-the-art methods on both tasks, improving macro-F1 scores by 4.5% for polarity prediction and by 5% for hate speech detection. The GASCOM context weights also enhance interpretability.

摘要
在线对话理解是一项重要又挑战性的自然语言处理（NLP）问题，它在许多应用中具有用于 hate speech detection 等应用。然而，在线对话通常是一串带有回快的帖子和回快中的帖子，组成一个树状结构，各个帖子可能引用上下文中的semantic context。这种 semantic cross-referencing 使得单个帖子难以理解，同时考虑整个对话树也不仅困难scaling，还可能导致偏导的解释，因为一个对话可能有多个不同的线索或焦点，其中不 todos 是 relevante para el post being considered。在这篇论文中，我们提出一个 Graph-based Attentive Semantic COntext Modeling（GASCOM）框架，用于在线对话理解。具体来说，我们设计了两种新的算法，利用在线对话的graph结构以及各个帖子的semantic信息，来选择 relevante context nodes from the whole conversation。此外，我们还设计了一个token-level multi-head graph attention mechanism，用于在不同的selected context utterances中进行细致的对话上下文模型化。使用这种 semantic conversational context，我们重新评估了两个已有的问题：polarity prediction和 hate speech detection。我们的提议的框架在两个任务上显著超越了当前的状态方法，提高了macro-F1分数 by 4.5% for polarity prediction和by 5% for hate speech detection。GASCOM上下文权重也提高了解释性。

Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation

paper_url: http://arxiv.org/abs/2310.14025
repo_url: https://github.com/anastasiakrith/multimodal-retrieval-for-vwsd
paper_authors: Anastasia Kritharoula, Maria Lymperaiou, Giorgos Stamou
for: 本文主要针对的是解决文本含义歧义的图像检索任务，即Visual Word Sense Disambiguation (VWSD)。
methods: 本文采用了多种方法，包括最新的transformer-based方法和大自然语言模型（LLMs）来解决VWSD任务。
results: experiments表明，我们的方法可以在VWSD任务中达到竞争性的排名结果，并且通过Chain-of-Thought（CoT）提示来帮助解释answer生成。

Abstract
Visual Word Sense Disambiguation (VWSD) is a novel challenging task with the goal of retrieving an image among a set of candidates, which better represents the meaning of an ambiguous word within a given context. In this paper, we make a substantial step towards unveiling this interesting task by applying a varying set of approaches. Since VWSD is primarily a text-image retrieval task, we explore the latest transformer-based methods for multimodal retrieval. Additionally, we utilize Large Language Models (LLMs) as knowledge bases to enhance the given phrases and resolve ambiguity related to the target word. We also study VWSD as a unimodal problem by converting to text-to-text and image-to-image retrieval, as well as question-answering (QA), to fully explore the capabilities of relevant models. To tap into the implicit knowledge of LLMs, we experiment with Chain-of-Thought (CoT) prompting to guide explainable answer generation. On top of all, we train a learn to rank (LTR) model in order to combine our different modules, achieving competitive ranking results. Extensive experiments on VWSD demonstrate valuable insights to effectively drive future directions.

摘要
Visible Word Sense Disambiguation (VWSD) 是一个新型的挑战性任务，旨在从一组候选者中选取一幅图像，更好地表现出一个模糊词的意思在特定上下文中。在这篇文章中，我们做出了一个重要的进步，通过应用不同的方法来解决这个有趣的任务。由于 VWSD 主要是文本-图像搜寻任务，我们探索了最新的 transformer-based 方法来进行多modal搜寻。此外，我们使用 Large Language Models (LLMs) 来增强给定的短语，解决对目标词的模糊性。我们还研究了 VWSD 作为单modal问题，通过将它转换为文本-文本和图像-图像搜寻，以及问答 (QA)，以全面探索相关模型的能力。为了吸取 LLMS 的隐藏知识，我们尝试使用 Chain-of-Thought (CoT) 提示来引导可解释的答案生成。最后，我们将多个模组联合起来，使用 learn to rank (LTR) 模型进行排名， achieving 竞争性的排名结果。广泛的实验在 VWSD 中，提供了宝贵的见解，将未来发展领域带向更好的未来。

Toward Stronger Textual Attack Detectors

paper_url: http://arxiv.org/abs/2310.14001
repo_url: https://github.com/pierrecolombo/adversarialattacksnlp
paper_authors: Pierre Colombo, Marine Picot, Nathan Noiry, Guillaume Staerman, Pablo Piantanida
for: 防止文本敌对攻击，即使是深度NLP系统。
methods: 引入了一个新的检测敌对攻击的框架——LAROUSSE，以及一个新的评价板准——STAKEOUT。
results: LAROUSSE比前一代方法更高效，并且可以防止梯度基本法。

Abstract
The landscape of available textual adversarial attacks keeps growing, posing severe threats and raising concerns regarding the deep NLP system's integrity. However, the crucial problem of defending against malicious attacks has only drawn the attention of the NLP community. The latter is nonetheless instrumental in developing robust and trustworthy systems. This paper makes two important contributions in this line of search: (i) we introduce LAROUSSE, a new framework to detect textual adversarial attacks and (ii) we introduce STAKEOUT, a new benchmark composed of nine popular attack methods, three datasets, and two pre-trained models. LAROUSSE is ready-to-use in production as it is unsupervised, hyperparameter-free, and non-differentiable, protecting it against gradient-based methods. Our new benchmark STAKEOUT allows for a robust evaluation framework: we conduct extensive numerical experiments which demonstrate that LAROUSSE outperforms previous methods, and which allows to identify interesting factors of detection rate variations.

摘要
文章做出了两个重要贡献：首先，我们引入了一个新的检测文本针对攻击的框架，即LAROUSSE，它是不需要监督学习、无参数、不可导的，因此具有更高的安全性。其次，我们引入了一个新的评估框架，即STAKEOUT，它包括9种常见的攻击方法、3个数据集和2个预训练模型。我们的新框架允许更加稳定地评估检测效果，我们进行了广泛的数字实验，结果表明LAROUSSE在前一代方法之上出众，同时还可以识别检测率的变化因素。

Transductive Learning for Textual Few-Shot Classification in API-based Embedding Models

paper_url: http://arxiv.org/abs/2310.13998
repo_url: None
paper_authors: Pierre Colombo, Victor Pellegrain, Malik Boudiaf, Victor Storchan, Myriam Tami, Ismail Ben Ayed, Celine Hudelot, Pablo Piantanida
for: This paper focuses on the practical applications of natural language processing, specifically few-shot classification, and addresses the issue of proprietary and closed APIs.
methods: The paper proposes a transductive inference learning paradigm that utilizes unlabeled data, along with a new parameter-free transductive regularizer based on the Fisher-Rao loss.
results: The paper presents experimental results using eight backbone models and an episodic evaluation over 1,000 episodes, which demonstrate the superiority of transductive inference over the standard inductive setting.Here is the result in Simplified Chinese text:
for: 本文关注自然语言处理的实际应用，具体是几个shot类型的分类，并解决了 propriety 和关闭 API 的问题。
methods: 本文提出了一种推uctive推理学习模式，利用无标注数据，并提出了一种无参数的推uctive规范基于 Fisher-Rao 损失。
results: 本文通过使用 eight 个基础模型和一千个 episodic 评估，展示了推uctive推理在标准 inductive 设定下的超越。

Abstract
Proprietary and closed APIs are becoming increasingly common to process natural language, and are impacting the practical applications of natural language processing, including few-shot classification. Few-shot classification involves training a model to perform a new classification task with a handful of labeled data. This paper presents three contributions. First, we introduce a scenario where the embedding of a pre-trained model is served through a gated API with compute-cost and data-privacy constraints. Second, we propose a transductive inference, a learning paradigm that has been overlooked by the NLP community. Transductive inference, unlike traditional inductive learning, leverages the statistics of unlabeled data. We also introduce a new parameter-free transductive regularizer based on the Fisher-Rao loss, which can be used on top of the gated API embeddings. This method fully utilizes unlabeled data, does not share any label with the third-party API provider and could serve as a baseline for future research. Third, we propose an improved experimental setting and compile a benchmark of eight datasets involving multiclass classification in four different languages, with up to 151 classes. We evaluate our methods using eight backbone models, along with an episodic evaluation over 1,000 episodes, which demonstrate the superiority of transductive inference over the standard inductive setting.

摘要
专有和关闭API在处理自然语言方面变得越来越普遍，这对自然语言处理的实际应用产生了影响，包括几个shot分类。本文提出了三个贡献。首先，我们介绍了一种情况，在这种情况下，一个预训练模型的 embedding 被通过一个限制 compute-cost 和数据隐私的 gat API 提供。第二，我们提议了一种被 NLP 社区忽视的学习模式——推uctive inference。推uctive inference 不同于传统的 inductive learning，它利用不标注数据的统计特性。我们还介绍了一个新的参数-free 推uctive regularizer，基于 Fisher-Rao 损失函数，可以在gat API 中使用。这种方法可以完全利用无标注数据，不需要与第三方 API 提供者共享标签，并且可以作为未来研究的基准。第三，我们提出了一个改进的实验设定，并编译了八个dataset，包括四种语言，最多 151 个分类。我们使用八种背部bone模型进行评估，并在1,000个 episodic 评估中，发现推uctive inference 在标准 inductive 设定下表现出优异性。

Emulating the Human Mind: A Neural-symbolic Link Prediction Model with Fast and Slow Reasoning and Filtered Rules

paper_url: http://arxiv.org/abs/2310.13996
repo_url: None
paper_authors: Mohammad Hossein Khojasteh, Najmeh Torabian, Ali Farjami, Saeid Hosseini, Behrouz Minaei-Bidgoli
for: 这个研究的目的是解决知识граフ（KG）中的不完整性问题，并提出了一个新的神经几何模型named FaSt-FLiP，以提高链接预测的性能和可解性。
methods: 这个模型使用了“常识推理”和“快速思考”两种人类认知方面的特点，并结合了逻辑和神经网络模型，以提高链接预测的精度和可解性。
results: 研究结果显示，FaSt-FLiP模型在链接预测中的表现较高，并能够提供更可靠的解释。另外，模型还能够自动检测和删除逻辑模型生成的错误规则。

Abstract
Link prediction is an important task in addressing the incompleteness problem of knowledge graphs (KG). Previous link prediction models suffer from issues related to either performance or explanatory capability. Furthermore, models that are capable of generating explanations, often struggle with erroneous paths or reasoning leading to the correct answer. To address these challenges, we introduce a novel Neural-Symbolic model named FaSt-FLiP (stands for Fast and Slow Thinking with Filtered rules for Link Prediction task), inspired by two distinct aspects of human cognition: "commonsense reasoning" and "thinking, fast and slow." Our objective is to combine a logical and neural model for enhanced link prediction. To tackle the challenge of dealing with incorrect paths or rules generated by the logical model, we propose a semi-supervised method to convert rules into sentences. These sentences are then subjected to assessment and removal of incorrect rules using an NLI (Natural Language Inference) model. Our approach to combining logical and neural models involves first obtaining answers from both the logical and neural models. These answers are subsequently unified using an Inference Engine module, which has been realized through both algorithmic implementation and a novel neural model architecture. To validate the efficacy of our model, we conducted a series of experiments. The results demonstrate the superior performance of our model in both link prediction metrics and the generation of more reliable explanations.

摘要
链接预测是知识 graphs（KG）的重要任务，以解决知识 Graphs 的不完整性问题。现有的链接预测模型受到性能和可解释能力的限制。而且，可以生成解释的模型经常会遇到错误的路径或理由，导致正确答案。为 Addressing these challenges, we propose a novel Neural-Symbolic model named FaSt-FLiP（快速思维与筛选规则 для链接预测任务）， drawing inspiration from two aspects of human cognition："通常的思维"和"快速和慢速的思考。our objective is to combine a logical and neural model for enhanced link prediction. To tackle the challenge of dealing with incorrect paths or rules generated by the logical model, we propose a semi-supervised method to convert rules into sentences. These sentences are then subjected to assessment and removal of incorrect rules using an NLI（自然语言推理）model. Our approach to combining logical and neural models involves first obtaining answers from both the logical and neural models. These answers are subsequently unified using an Inference Engine module, which has been realized through both algorithmic implementation and a novel neural model architecture. To validate the efficacy of our model, we conducted a series of experiments. The results demonstrate the superior performance of our model in both link prediction metrics and the generation of more reliable explanations.

A Novel Information-Theoretic Objective to Disentangle Representations for Fair Classification

paper_url: http://arxiv.org/abs/2310.13990
repo_url: None
paper_authors: Pierre Colombo, Nathan Noiry, Guillaume Staerman, Pablo Piantanida
for: The paper aims to learn abstract representations of reality from the observation of multiple contextual situations, specifically disentangled representations that are low-dimensional and independent of sensitive attributes such as gender or age.
methods: The paper proposes a novel family of regularizers called CLINIC, which minimizes the mutual information between the latent representation and the sensitive attribute conditional to the target. This approach is parameter-free and easier to train than previous techniques.
results: The paper demonstrates that the proposed CLINIC losses offer a better disentanglement/accuracy trade-off than previous techniques and generalize better than training with cross-entropy loss, provided that the disentanglement task is not too constraining.Here is the simplified Chinese version of the three key information points:
for: 该论文目标是从多个 contextual situations 中学习抽象的 reality 表示，具体来说是EXTRACT 独立的表示，即低维度且独立的概念表示。
methods: 该论文提出一种新的 family of regularizers called CLINIC，该regularizers 的目标是将敏感特征（如性别或年龄）与 latent representation 的相关性降低到最低。这种方法是 parameter-free 的， easier 和 faster than previous techniques。
results: 该论文的实验结果显示，提出的 CLINIC losses 可以比 previous techniques 提供更好的 disentanglement/accuracy 的负荷平衡，并且在不太紧张的disentanglement task 下可以更好地泛化。

Abstract
One of the pursued objectives of deep learning is to provide tools that learn abstract representations of reality from the observation of multiple contextual situations. More precisely, one wishes to extract disentangled representations which are (i) low dimensional and (ii) whose components are independent and correspond to concepts capturing the essence of the objects under consideration (Locatello et al., 2019b). One step towards this ambitious project consists in learning disentangled representations with respect to a predefined (sensitive) attribute, e.g., the gender or age of the writer. Perhaps one of the main application for such disentangled representations is fair classification. Existing methods extract the last layer of a neural network trained with a loss that is composed of a cross-entropy objective and a disentanglement regularizer. In this work, we adopt an information-theoretic view of this problem which motivates a novel family of regularizers that minimizes the mutual information between the latent representation and the sensitive attribute conditional to the target. The resulting set of losses, called CLINIC, is parameter free and thus, it is easier and faster to train. CLINIC losses are studied through extensive numerical experiments by training over 2k neural networks. We demonstrate that our methods offer a better disentanglement/accuracy trade-off than previous techniques, and generalize better than training with cross-entropy loss solely provided that the disentanglement task is not too constraining.

摘要
一个深度学习的核心目标是提供能够从多个上下文中学习抽象的现实表示的工具。更具体地说，希望从多个上下文中提取独立的表示，其维度低、并且其组成部分独立、对象本身的核心特征capturing (Locatello et al., 2019b).一种实现这个奢侈目标的方法是通过对敏感特征（如作者的性别或年龄）进行分离表示学习。可能这种独立表示的主要应用是公平分类。现有方法通常是提取一个通过混合Entropy目标和分离regularizer进行训练的神经网络的最后一层。在这个工作中，我们采用信息论视角来解决这个问题，并提出了一个新的家族征识器，它将 conditional于目标进行抽象的mutual information minimization。这些loss Functions被称为CLINIC，它们是无参数的，因此更容易和更快地训练。我们通过对2k个神经网络进行数字实验来研究CLINIC loss。我们示示了我们的方法可以在训练过程中提供更好的抽象/准确性质和平衡，并且在训练cross-entropy loss solo不可能达到的情况下，能够更好地泛化。

GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4

paper_url: http://arxiv.org/abs/2310.13988
repo_url: None
paper_authors: Tom Kocmi, Christian Federmann
for: 评估翻译质量错误
methods: 使用GPT模型，三次预览技术，无需人工参考翻译
results: 达到系统排名的最佳准确率，但需要注意GPT模型的商业性和黑盒特性，不建议用于学术论文中证明其他方法的改进。

Abstract
This paper introduces GEMBA-MQM, a GPT-based evaluation metric designed to detect translation quality errors, specifically for the quality estimation setting without the need for human reference translations. Based on the power of large language models (LLM), GEMBA-MQM employs a fixed three-shot prompting technique, querying the GPT-4 model to mark error quality spans. Compared to previous works, our method has language-agnostic prompts, thus avoiding the need for manual prompt preparation for new languages. While preliminary results indicate that GEMBA-MQM achieves state-of-the-art accuracy for system ranking, we advise caution when using it in academic works to demonstrate improvements over other methods due to its dependence on the proprietary, black-box GPT model.

摘要
这篇论文介绍了GEMBA-MQM，一种基于GPT的评估指标，用于探测翻译质量错误。这种指标不需要人工参考翻译，可以在质量估计设置中使用。通过大语言模型（LLM）的力量，GEMBA-MQM使用固定的三批提问技术，让GPT-4模型标识错误质量段落。与前一代方法相比，我们的方法有语言共享的提问，因此不需要为新语言手动准备提问。然而，我们建议在学术论文中使用GEMBA-MQM时需要谨慎，因为它取决于商业化、黑盒子GPT模型，这可能会导致使用它来证明其他方法的优越性的问题。

HateRephrase: Zero- and Few-Shot Reduction of Hate Intensity in Online Posts using Large Language Models

paper_url: http://arxiv.org/abs/2310.13985
repo_url: None
paper_authors: Vibhor Agarwal, Yu Chen, Nishanth Sastry
for: 这篇论文旨在提出一种新的、简单有效的方法，即在发布前对潜在仇恨言论内容进行重写。
methods: 这篇论文使用了大语言模型（LLMs），并比较了不同的提示方法，包括任务描述、仇Definition、少量示例和思维链。
results: 研究发现，使用几shot示例提示的LLMs最perform最好，并且在不同的提示方法下都有良好的表现。而且，对于不同的提示方法，GPT-3.5表现最佳。此外，人工评估表明，GPT-3.5生成的重写内容甚至超过了人类生成的ground truth重写内容。

Abstract
Hate speech has become pervasive in today's digital age. Although there has been considerable research to detect hate speech or generate counter speech to combat hateful views, these approaches still cannot completely eliminate the potential harmful societal consequences of hate speech -- hate speech, even when detected, can often not be taken down or is often not taken down enough; and hate speech unfortunately spreads quickly, often much faster than any generated counter speech. This paper investigates a relatively new yet simple and effective approach of suggesting a rephrasing of potential hate speech content even before the post is made. We show that Large Language Models (LLMs) perform well on this task, outperforming state-of-the-art baselines such as BART-Detox. We develop 4 different prompts based on task description, hate definition, few-shot demonstrations and chain-of-thoughts for comprehensive experiments and conduct experiments on open-source LLMs such as LLaMA-1, LLaMA-2 chat, Vicuna as well as OpenAI's GPT-3.5. We propose various evaluation metrics to measure the efficacy of the generated text and ensure the generated text has reduced hate intensity without drastically changing the semantic meaning of the original text. We find that LLMs with a few-shot demonstrations prompt work the best in generating acceptable hate-rephrased text with semantic meaning similar to the original text. Overall, we find that GPT-3.5 outperforms the baseline and open-source models for all the different kinds of prompts. We also perform human evaluations and interestingly, find that the rephrasings generated by GPT-3.5 outperform even the human-generated ground-truth rephrasings in the dataset. We also conduct detailed ablation studies to investigate why LLMs work satisfactorily on this task and conduct a failure analysis to understand the gaps.

摘要
仇恨言语在当今数字时代已经成为普遍存在的问题。尽管有很多研究检测仇恨言语或生成对抗仇恨观点的counter speech，但这些方法仍然无法完全消除仇恨言语的社会后果 -- 仇恨言语，即使检测到了，通常不能或很难被移除，而且仇恨言语往往很快就会扩散，常常比生成的counter speech更快。这篇论文研究了一种新的、简单而有效的方法，即在投稿之前，使用大语言模型（LLMs）来提议修改潜在的仇恨言语内容。我们证明了LLMs在这个任务上表现良好，超过了现有的基线模型如BART-Detox。我们开发了4个不同的提示，基于任务描述、仇Definition、几个示例和串联思维，进行了广泛的实验。我们使用了开源的LLaMA-1、LLaMA-2 chat、Vicuna以及OpenAI的GPT-3.5等模型进行实验。我们提出了多种评价指标，以确保生成的文本减少了仇恨程度，而不会毁灭语意。我们发现，使用几个示例提示的LLMs最为有效，能够生成接受的仇恨重新写文本， semantic meaning与原文相似。总之，我们发现GPT-3.5在所有不同的提示上都超过了基线和开源模型。我们还进行了人工评价， Interestingly，我们发现GPT-3.5生成的重新写文本甚至超过了人工生成的基准重新写文本。我们还进行了细化的抽象研究和失败分析，以解释LLMs在这个任务上的成功原因。

Automatic Pronunciation Assessment – A Review

paper_url: http://arxiv.org/abs/2310.13974
repo_url: None
paper_authors: Yassine El Kheir, Ahmed Ali, Shammur Absar Chowdhury
for: 这篇论文主要是为了探讨计算机辅助发音训练（CAPT）中的发音评估方法和其应用。
methods: 这篇论文评论了在发音评估方面使用的方法，包括phonemic和prosodic两种方法。
results: 论文认为，现有的研究存在一些挑战和局限性，并提出了未来研究的可能性。

Abstract
Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.

摘要
声音评估和计算机辅助声音训练（CAPT）在最近几年中得到了很大的进步。随着语言处理和深度学习技术的快速发展，需要进行更新的评估。本文介绍了声音评估中使用的方法，包括音节和语言流行的评估。我们分类了主要的研究趋势中的挑战，并高亮现有的限制和可用资源。然后是对未来工作的残余挑战和可能的方向的讨论。Note: Simplified Chinese is the standard writing system used in mainland China, while Traditional Chinese is used in Taiwan and Hong Kong.

AITA Generating Moral Judgements of the Crowd with Reasoning

paper_url: http://arxiv.org/abs/2310.18336
repo_url: None
paper_authors: Osama Bsher, Ameer Sabri
for: This paper aims to generate comments with moral reasoning for stories with moral dilemmas using the AITA subreddit as a dataset.methods: The authors will leverage the vast amount of data on the forum and use state-of-the-art seq2seq text generation models to generate coherent comments that align with the norms and values of the AITA community.results: The authors aim to evaluate the ability of these models to make moral judgments similarly to humans and produce concise comments providing clear moral stances and advice for the poster.

Abstract
Morality is a fundamental aspect of human behavior and ethics, influencing how we interact with each other and the world around us. When faced with a moral dilemma, a person's ability to make clear moral judgments can be clouded. Due to many factors such as personal biases, emotions and situational factors people can find it difficult to decide their best course of action. The AmITheAsshole (AITA) subreddit is a forum on the social media platform Reddit that helps people get clarity and objectivity on their predicaments. In the forum people post anecdotes about moral dilemmas they are facing in their lives, seeking validation for their actions or advice on how to navigate the situation from the community. The morality of the actions in each post is classified based on the collective opinion of the community into mainly two labels, "Not The Asshole" (NTA) and "You Are The Asshole" (YTA). This project aims to generate comments with moral reasoning for stories with moral dilemmas using the AITA subreddit as a dataset. While past literature has explored the classification of posts into labels (Alhassan et al., 2022), the generation of comments remains a novel and challenging task. It involves understanding the complex social and ethical considerations in each situation. To address this challenge, we will leverage the vast amount of data on the forum with the goal of generating coherent comments that align with the norms and values of the AITA community. In this endeavor, we aim to evaluate state-of-the-art seq2seq text generation models for their ability to make moral judgments similarly to humans, ultimately producing concise comments providing clear moral stances and advice for the poster.

摘要
人类行为中的道德是一个基本方面，影响我们如何与其他人和世界around us interact。当面临道德困难时，人们可能会受到个人偏见、情感和情况因素的影响，导致困难做出明确的道德判断。为了帮助人们得到清晰性和 объекivity，Reddit上的AmITheAsshole（AITA）子社区成为了一个有用的平台。在这个社区中，人们会分享他们面临的道德困难，并请求 validation for their actions或对 Situation 的 Navigation 建议。根据社区的共同意见，每篇文章将被分类为主要两个标签：“Not The Asshole”（NTA）和“You Are The Asshole”（YTA）。这个项目的目标是使用 AITA 子社区的数据生成文章中的评论，以提供清晰的道德观点和建议。在这个任务中，我们将利用社区数据的庞大量，以生成一致的评论，与 AITA 社区的 norms 和价值观念相一致。为了解决这个挑战，我们将使用现代 seq2seq 文本生成模型，以模拟人类的道德判断能力。最终，我们希望通过生成简洁明了的评论，为poster提供清晰的道德观点和建议。

Linguistically Motivated Sign Language Segmentation

paper_url: http://arxiv.org/abs/2310.13960
repo_url: https://github.com/sign-language-processing/transcription
paper_authors: Amit Moryossef, Zifan Jiang, Mathias Müller, Sarah Ebling, Yoav Goldberg
for: 本文旨在提出一种新的手语分割方法，以便在手语处理系统中进行下游任务，如手语识别、 транскриpción和机器翻译。
methods: 我们提出的方法基于手语 corpora 中观察到的语言学 clue，使用 BIO 标记替换传统的 IO 标记，以考虑手语的连续性。我们还explore使用光流特征来捕捉手语语法的 просодические特征。
results: 我们发现，使用 BIO 标记可以更好地模型手语 bounding box。在某些深度模型中，使用光流特征可以提高分割质量，但在更深的模型中，这些特征的贡献几乎可以忽略不计。通过精心调整模型的解码算法，我们可以进一步提高分割质量。我们的最终模型可以在不同的手语语言下进行零基础学习，并在不同的视频内容上进行渠道。

Abstract
Sign language segmentation is a crucial task in sign language processing systems. It enables downstream tasks such as sign recognition, transcription, and machine translation. In this work, we consider two kinds of segmentation: segmentation into individual signs and segmentation into phrases, larger units comprising several signs. We propose a novel approach to jointly model these two tasks. Our method is motivated by linguistic cues observed in sign language corpora. We replace the predominant IO tagging scheme with BIO tagging to account for continuous signing. Given that prosody plays a significant role in phrase boundaries, we explore the use of optical flow features. We also provide an extensive analysis of hand shapes and 3D hand normalization. We find that introducing BIO tagging is necessary to model sign boundaries. Explicitly encoding prosody by optical flow improves segmentation in shallow models, but its contribution is negligible in deeper models. Careful tuning of the decoding algorithm atop the models further improves the segmentation quality. We demonstrate that our final models generalize to out-of-domain video content in a different signed language, even under a zero-shot setting. We observe that including optical flow and 3D hand normalization enhances the robustness of the model in this context.

摘要
签语段落是指文本处理系统中的签语分割任务。它可以帮助下游任务，如签语识别、转写和机器翻译。在这个工作中，我们考虑了两种类型的分割：即分割成个体签语和分割成短语，后者是由多个签语组成的更大单位。我们提出了一种新的方法，旨在同时解决这两种任务。我们的方法受到了签语 Corpora 中的语言学cue的启发。我们将主流的 IO 标记方案改为 BIO 标记，以考虑不间断的签语。由于语言的气息在短语边界上发挥重要作用，我们尝试使用光流特征。我们还提供了详细的手势分析和3D手势normalization。我们发现，使用 BIO 标记是必要的，以便模型签语边界。使用光流特征可以在浅度模型中提高分割质量，但在深度模型中，其贡献几乎可以忽略不计。通过精细调整模型顶部的解码算法，可以进一步提高分割质量。我们展示了我们的最终模型可以在不同的指语言中进行零基础学习，并在无预训练情况下保持良好的分割质量。我们发现，包含光流和3D手势normalization可以提高模型在这种情况下的Robustness。

Values, Ethics, Morals? On the Use of Moral Concepts in NLP Research

paper_url: http://arxiv.org/abs/2310.13915
repo_url: None
paper_authors: Karina Vida, Judith Simon, Anne Lauscher
for: 本研究旨在探讨NLPT中的伦理问题，尤其是语言模型的道德评价。
methods: 本研究使用文献综述和系统性分析方法，探讨NLPT中 morality 的定义和基础。
results: 研究发现，大多数文献没有提供明确的定义，也没有遵循哲学定义。此外，研究还给出了三个建议，以促进NLPT中的道德讨论。

Abstract
With language technology increasingly affecting individuals' lives, many recent works have investigated the ethical aspects of NLP. Among other topics, researchers focused on the notion of morality, investigating, for example, which moral judgements language models make. However, there has been little to no discussion of the terminology and the theories underpinning those efforts and their implications. This lack is highly problematic, as it hides the works' underlying assumptions and hinders a thorough and targeted scientific debate of morality in NLP. In this work, we address this research gap by (a) providing an overview of some important ethical concepts stemming from philosophy and (b) systematically surveying the existing literature on moral NLP w.r.t. their philosophical foundation, terminology, and data basis. For instance, we analyse what ethical theory an approach is based on, how this decision is justified, and what implications it entails. Our findings surveying 92 papers show that, for instance, most papers neither provide a clear definition of the terms they use nor adhere to definitions from philosophy. Finally, (c) we give three recommendations for future research in the field. We hope our work will lead to a more informed, careful, and sound discussion of morality in language technology.

摘要

Providing an overview of important ethical concepts from philosophy.2. Systematically surveying the existing literature on moral NLP, examining their philosophical foundations, terminology, and data basis.3. Offering three recommendations for future research in the field.Our survey of 92 papers found that most papers do not provide clear definitions of the terms they use, and few adhere to definitions from philosophy. We hope that our work will contribute to a more informed, careful, and sound discussion of morality in language technology.

RTSUM: Relation Triple-based Interpretable Summarization with Multi-level Salience Visualization

paper_url: http://arxiv.org/abs/2310.13895
repo_url: https://github.com/sjyyj/sjyyj
paper_authors: Seonglae Cho, Yonggi Cho, HoonJae Lee, Myungha Jang, Jinyoung Yeo, Dongha Lee
for: 本文提出了一种无监督概要框架，利用关系 triplets 为概要的基本单元。
methods: 输入文档后，本方法首先选择了突出的关系 triplets via 多级评分，然后通过文本-文本语言模型生成了简洁的概要。
results: 基于本方法，我们还开发了一款可解释性概要工具，提供了细致的解释与输出概要。用户可以自定义选择不同的级别，以便在文本单元层次上visual化文本的重要性。代码公开available。

Abstract
In this paper, we present RTSUM, an unsupervised summarization framework that utilizes relation triples as the basic unit for summarization. Given an input document, RTSUM first selects salient relation triples via multi-level salience scoring and then generates a concise summary from the selected relation triples by using a text-to-text language model. On the basis of RTSUM, we also develop a web demo for an interpretable summarizing tool, providing fine-grained interpretations with the output summary. With support for customization options, our tool visualizes the salience for textual units at three distinct levels: sentences, relation triples, and phrases. The codes,are publicly available.

摘要
在这篇论文中，我们提出了一种无监督摘要框架，称为RTSUM，它利用关系三元组作为摘要的基本单元。给定输入文档，RTSUM首先选择了突出的关系三元组via多级重要性分数，然后使用文本到文本语言模型生成了一份简洁的摘要。基于RTSUM，我们还开发了一个可视化摘要工具，可以提供细化的解释。这个工具支持自定义选项，可以在三级层次（句子、关系三元组、短语）上Visualize文本单元的重要性。代码publicly available。

RECAP: Towards Precise Radiology Report Generation via Dynamic Disease Progression Reasoning

paper_url: http://arxiv.org/abs/2310.13864
repo_url: https://github.com/wjhou/recap
paper_authors: Wenjun Hou, Yi Cheng, Kaishuai Xu, Wenjie Li, Jiang Liu
for: 实现 радиialogists 的工作负担减轻methods: 使用动态疾病进程理解和历史纪录融合results: 精确地生成专业医疗报告

Abstract
Automating radiology report generation can significantly alleviate radiologists' workloads. Previous research has primarily focused on realizing highly concise observations while neglecting the precise attributes that determine the severity of diseases (e.g., small pleural effusion). Since incorrect attributes will lead to imprecise radiology reports, strengthening the generation process with precise attribute modeling becomes necessary. Additionally, the temporal information contained in the historical records, which is crucial in evaluating a patient's current condition (e.g., heart size is unchanged), has also been largely disregarded. To address these issues, we propose RECAP, which generates precise and accurate radiology reports via dynamic disease progression reasoning. Specifically, RECAP first predicts the observations and progressions (i.e., spatiotemporal information) given two consecutive radiographs. It then combines the historical records, spatiotemporal information, and radiographs for report generation, where a disease progression graph and dynamic progression reasoning mechanism are devised to accurately select the attributes of each observation and progression. Extensive experiments on two publicly available datasets demonstrate the effectiveness of our model.

摘要
自动化放射学报告生成可以减轻放射学家的工作负担。先前的研究主要集中在实现极简观察结果，而忽略了疾病严重程度决定的精确属性（例如肿胸积液）。由于错误的属性会导致不准确的放射学报告，因此需要加强生成过程中的精确属性模型。此外，历史记录中的时间信息，如评估病人当前状况中的心脏大小是否变化（例如心脏大小不变），也被大量忽略。为解决这些问题，我们提议RECAP，它通过动态疾病进程逻辑来生成精确和准确的放射学报告。具体来说，RECAP首先预测两个连续的放射ogram的观察结果和进程（即空间时间信息）。然后，它将历史记录、空间时间信息和放射ogram组合起来，通过疾病进程图和动态进程逻辑机制来准确选择每个观察结果和进程的属性。我们对公共数据集进行了广泛的实验，结果表明RECAP的效果。

S	M	T	W	T	F	S
« December
29	30	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31	1	2
3	4	5	6	7	8	9