cs.CL - 2023-11-22

Surpassing GPT-4 Medical Coding with a Two-Stage Approach

paper_url: http://arxiv.org/abs/2311.13735
repo_url: None
paper_authors: Zhichao Yang, Sanjit Singh Batra, Joel Stremmel, Eran Halperin
for: 临床应用，如临床决策和诊断建议
methods: 使用两个阶段方法：首先使用一个LLM生成证据提议，然后使用LSTM验证阶段，学习 both LLM的高准确率和人工专家的高精度
results: 实现了在医疗编码精度、罕见编码精度和句子级证据标识等三个方面的最新状态，无需人工标注证据。

Abstract
Recent advances in large language models (LLMs) show potential for clinical applications, such as clinical decision support and trial recommendations. However, the GPT-4 LLM predicts an excessive number of ICD codes for medical coding tasks, leading to high recall but low precision. To tackle this challenge, we introduce LLM-codex, a two-stage approach to predict ICD codes that first generates evidence proposals using an LLM and then employs an LSTM-based verification stage. The LSTM learns from both the LLM's high recall and human expert's high precision, using a custom loss function. Our model is the only approach that simultaneously achieves state-of-the-art results in medical coding accuracy, accuracy on rare codes, and sentence-level evidence identification to support coding decisions without training on human-annotated evidence according to experiments on the MIMIC dataset.

摘要
近期大语言模型（LLM）的进步显示它们在医疗应用中具有潜力，如临床决策支持和药品推荐。然而，GPT-4 LLM预测医疗代码的数量过多，导致高回归但低精度。为解决这个挑战，我们介绍了LLM-codex，一种基于两个阶段的方法，首先使用LLM生成证据提案，然后使用LSTM验证阶段来验证。LSTM学习了LLM的高回归和人工专家的高精度，使用自定义损失函数。我们的模型是唯一一种同时实现了医疗代码准确率、罕见代码准确率和句子水平证据标识，以支持代码决策而不需要人工标注证据，根据MIMIC数据集的实验结果。

Comparison of pipeline, sequence-to-sequence, and GPT models for end-to-end relation extraction: experiments with the rare disease use-case

paper_url: http://arxiv.org/abs/2311.13729
repo_url: https://github.com/shashank140195/raredis
paper_authors: Shashank Gupta, Xuguang Ai, Ramakanth Kavuluru
for: 这篇论文的目的是比较三种常见的 END-TO-END 关系EXTRACTION (E2ERE) 方法，使用一个集中精确的 rare disease 资料集。
methods: 这篇论文使用了三种不同的方法，包括 NER ⇒ RE 管道模型，连续序列模型，以及生成预训练 Transformer (GPT) 模型。
results: 结果显示，管道模型仍然是最好的，而连续序列模型仅落后不多；GPT 模型具有八倍的参数数量，却输给管道模型和连续序列模型，并落后超过 10 个 F1 分。另外，管道模型在第二个 E2ERE 资料集上还有良好的表现。

Abstract
End-to-end relation extraction (E2ERE) is an important and realistic application of natural language processing (NLP) in biomedicine. In this paper, we aim to compare three prevailing paradigms for E2ERE using a complex dataset focused on rare diseases involving discontinuous and nested entities. We use the RareDis information extraction dataset to evaluate three competing approaches (for E2ERE): NER $\rightarrow$ RE pipelines, joint sequence to sequence models, and generative pre-trained transformer (GPT) models. We use comparable state-of-the-art models and best practices for each of these approaches and conduct error analyses to assess their failure modes. Our findings reveal that pipeline models are still the best, while sequence-to-sequence models are not far behind; GPT models with eight times as many parameters are worse than even sequence-to-sequence models and lose to pipeline models by over 10 F1 points. Partial matches and discontinuous entities caused many NER errors contributing to lower overall E2E performances. We also verify these findings on a second E2ERE dataset for chemical-protein interactions. Although generative LM-based methods are more suitable for zero-shot settings, when training data is available, our results show that it is better to work with more conventional models trained and tailored for E2ERE. More innovative methods are needed to marry the best of the both worlds from smaller encoder-decoder pipeline models and the larger GPT models to improve E2ERE. As of now, we see that well designed pipeline models offer substantial performance gains at a lower cost and carbon footprint for E2ERE. Our contribution is also the first to conduct E2ERE for the RareDis dataset.

摘要
END-TO-END关系抽取（E2ERE）是生物医学自然语言处理（NLP）中重要和现实的应用。在这篇论文中，我们想比较三种常见的E2ERE方法，使用一个复杂的数据集，涉及罕见疾病，其中包括不连续和嵌入的实体。我们使用了RareDis信息抽取数据集来评估三种竞争方法（E2ERE）：NER⇒RE管道模型、联合序列到序列模型和生成预训练变换器（GPT）模型。我们使用了相似的状态艺术模型和最佳实践，并进行错误分析以评估它们的失败模式。我们的发现是，管道模型仍然是最好的，而序列到序列模型也不远落后；GPT模型的8倍参数量比序列到序列模型还要差，并且与管道模型相比，失去了10个F1分点。偶极匹配和不连续实体导致了NER错误，从而降低了总E2E性能。我们还在第二个E2ERE数据集上进行了验证。虽然生成LM基于方法更适合零shot设定，但当有训练数据时，我们发现，与较小的encoder-decoder管道模型和更大的GPT模型结合更有优势。我们的贡献还是第一次对RareDis数据集进行E2ERE。

Dynamic Analysis Method for Hidden Dangers in Substation Based on Knowledge Graph

paper_url: http://arxiv.org/abs/2311.13708
repo_url: None
paper_authors: Weiwei Li, Xing Liu, Wei Wang, Lu Chen, Sizhe Li, Hui Fan
for: 本研究旨在Addressing the challenge of identifying and understanding hidden dangers in substations from unstructured text data.
methods: 该方法使用了动态分析方法，首先从不结构化文本中提取数据，然后使用灵活分布式数据搜索引擎基于Elastic-Search处理这些信息。接着，使用隐藏马尔可夫模型训练数据内部引擎。最后，使用Viterbi算法解读隐藏状态序列，以便对存在隐藏危险的实体进行分类和标注。
results: 该方法的效果通过使用特定的 substation 隐藏危险数据进行示例分析，证明了其可以准确地识别和理解隐藏危险。

Abstract
To address the challenge of identifying and understanding hidden dangers in substations from unstructured text data, a novel dynamic analysis method is proposed. This approach begins by analyzing and extracting data from the unstructured text related to hidden dangers. It then leverages a flexible, distributed data search engine built on Elastic-Search to handle this information. Following this, the hidden Markov model is employed to train the data within the engine. The Viterbi algorithm is integrated to decipher the hidden state sequences, facilitating the segmentation and labeling of entities related to hidden dangers. The final step involves using the Neo4j graph database to dynamically create a knowledge map that visualizes hidden dangers in the substation. This method's effectiveness is demonstrated through an example analysis using data from a specific substation's hidden dangers.

摘要
要解决 substation 中隐藏的危险的识别和理解问题，我们提出了一种新的动态分析方法。这种方法首先从不结构化文本数据中提取和分析有关隐藏危险的信息。然后，我们使用基于 Elastic-Search 的灵活分布式数据搜索引擎来处理这些信息。接着，我们使用隐藏马尔可夫模型来训练数据 dentro 引擎。最后，我们使用 Neo4j 图Database 来动态创建一个知识地图，可视化 substation 中隐藏的危险。这种方法的效果通过一个具体的例子分析，使用特定 substation 中隐藏的危险的数据来证明。

Efficient Transformer Knowledge Distillation: A Performance Review

paper_url: http://arxiv.org/abs/2311.13657
repo_url: None
paper_authors: Nathan Brown, Ashton Williamson, Tahj Anderson, Logan Lawrence
for: 本研究旨在评估模型压缩via知识传递的影响，以提高效果和减少计算成本。
methods: 本研究使用了知识传递来压缩state-of-the-art的高效注意力模型，并对其进行成本-性能评估。
results: 研究发现，压缩后的高效注意力模型可以保持大约98.6%的原始模型性能，同时降低执行时间为57.8%。此外，本研究还 introduce了一个新的长ContextNamed Entity Recognition（GONERD）数据集，用于训练和测试长序列Named Entity Recognition（NER）模型。

Abstract
As pretrained transformer language models continue to achieve state-of-the-art performance, the Natural Language Processing community has pushed for advances in model compression and efficient attention mechanisms to address high computational requirements and limited input sequence length. Despite these separate efforts, no investigation has been done into the intersection of these two fields. In this work, we provide an evaluation of model compression via knowledge distillation on efficient attention transformers. We provide cost-performance trade-offs for the compression of state-of-the-art efficient attention architectures and the gains made in performance in comparison to their full attention counterparts. Furthermore, we introduce a new long-context Named Entity Recognition dataset, GONERD, to train and test the performance of NER models on long sequences. We find that distilled efficient attention transformers can preserve a significant amount of original model performance, preserving up to 98.6% across short-context tasks (GLUE, SQUAD, CoNLL-2003), up to 94.6% across long-context Question-and-Answering tasks (HotpotQA, TriviaQA), and up to 98.8% on long-context Named Entity Recognition (GONERD), while decreasing inference times by up to 57.8%. We find that, for most models on most tasks, performing knowledge distillation is an effective method to yield high-performing efficient attention models with low costs.

摘要
为了满足计算机需求和输入序列长度的限制，自然语言处理社区一直在努力提高模型压缩和高效注意机制。然而，这两个领域的研究没有 overlap。在这个工作中，我们对知识塑造进行评估，以实现高性能的压缩。我们提供了高性能的压缩模型的成本-性能交换，以及与全注意模型相比的性能提升。此外，我们还引入了一个新的长ContextNamed Entity Recognition（GONERD）数据集，用于训练和测试长序列的NER模型。我们发现，通过知识塑造，高效注意转换器可以保持大量原始模型性能，在短Context任务（GLUE、SQUAD、CoNLL-2003）上保持98.6%，在长Context问答任务（HotpotQA、TriviaQA）上保持94.6%，在长ContextNamed Entity Recognition（GONERD）任务上保持98.8%，而且降低执行时间，最多降低57.8%。我们发现，对大多数模型和任务来说，进行知识塑造是一种有效的方法，以便实现高性能的压缩模型。

Language Model Inversion

paper_url: http://arxiv.org/abs/2311.13647
repo_url: https://github.com/jxmorris12/vec2text
paper_authors: John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, Alexander M. Rush
for: 本研究旨在利用语言模型的下一个token分布来恢复隐藏的提问token。
methods: 作者使用语言模型 inverse 问题，并证明了下一个token的概率包含隐藏的提问文本中的许多信息。他们还提出了一种方法，可以通过搜索来从模型的当前分布输出中恢复未知的提问。
results: 在 LLama-2 7b 上，作者的恢复方法可以重建提问，其 Bleu 分数为 59，token-level F1 分数为 78，并能够 precisley 恢复27%的提问。

Abstract
Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.

摘要
Language models produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of language model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of 59 and token-level F1 of 78 and recovers 27% of prompts exactly. 码可以在 http://github.com/jxmorris12/vec2text 上复制。

PaSS: Parallel Speculative Sampling

paper_url: http://arxiv.org/abs/2311.13581
repo_url: None
paper_authors: Giovanni Monea, Armand Joulin, Edouard Grave
for: 提高语言模型的批处理能力，以提高模型的性能。
methods: 使用并行解码方法，不需要Extra computational cost和第二个模型。
results: 可以达到$30%$的速度提升，只需要$O(d_{emb})$的额外参数。

Abstract
Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases. Moreover, executing a forward pass for multiple tokens in parallel often takes nearly the same time as it does for just one token. These two observations lead to the development of speculative sampling, where a second smaller model is used to draft a few tokens, that are then validated or rejected using a single forward pass of the large model. Unfortunately, this method requires two models that share the same tokenizer and thus limits its adoption. As an alternative, we propose to use parallel decoding as a way to draft multiple tokens from a single model with no computational cost, nor the need for a second model. Our approach only requires an additional input token that marks the words that will be generated simultaneously. We show promising performance (up to $30\%$ speed-up) while requiring only as few as $O(d_{emb})$ additional parameters.

摘要
大量化语言模型的参数数量到十亿级别，带来了广泛的任务表现。在生成时，这些模型通过自动递归方式使用，需要每个生成的token进行前向传播，并且从内存中读取整个集合的参数。这种内存访问成为生成的主要瓶颈，随着模型的尺寸增大，这种瓶颈也变得更加严重。此外，在多个token同时执行多个token的前向传播通常需要相当于单个token的时间。这两个观察导致了推测采样的发展，其中使用一个较小的模型来预测一些token，然后通过单个大模型的前向传播进行验证或拒绝。然而，这种方法需要两个模型共享同一个分词器，因此限制了其广泛的应用。为了解决这个问题，我们提议使用并行解码来生成多个token。我们的方法只需要添加一个额外的输入token，用于标识将被同时生成的词。我们的方法可以实现高达30%的速度提升，并且只需要额外的$O(d_{emb})$参数。

Efficient Deep Speech Understanding at the Edge

paper_url: http://arxiv.org/abs/2311.17065
repo_url: https://github.com/Aryia-Behroziuan/neurons
paper_authors: Rongxiang Wang, Felix Lin
for: 提高边缘设备上的语音理解性能，并且能够有效地处理大量语音输入。
methods: 提出了三种创新解决方案，包括：1. 延迟Contextualization：在输入摄取期间并行执行模型的注意力采集器。2. 导航解码：解决输入过载的延迟问题。3. 自适应偏移：根据部分输出序列做出偏移决策。
results: 在配备6-8核Arm处理器的平台上实现了State-of-the-Art（SOTA）精度，降低了端到端延迟时间，并将偏移需求减半。

Abstract
Contemporary Speech Understanding (SU) involves a sophisticated pipeline: capturing real-time voice input, the pipeline encompasses a deep neural network with an encoder-decoder architecture enhanced by beam search. This network periodically assesses attention and Connectionist Temporal Classification (CTC) scores in its autoregressive output. This paper aims to enhance SU performance on edge devices with limited resources. It pursues two intertwined goals: accelerating on-device execution and efficiently handling inputs that surpass the on-device model's capacity. While these objectives are well-established, we introduce innovative solutions that specifically address SU's distinctive challenges: 1. Late contextualization: Enables the parallel execution of a model's attentive encoder during input ingestion. 2. Pilot decoding: Alleviates temporal load imbalances. 3. Autoregression offramps: Facilitate offloading decisions based on partial output sequences. Our techniques seamlessly integrate with existing SU models, pipelines, and frameworks, allowing for independent or combined application. Together, they constitute a hybrid solution for edge SU, exemplified by our prototype, XYZ. Evaluated on platforms equipped with 6-8 Arm cores, our system achieves State-of-the-Art (SOTA) accuracy, reducing end-to-end latency by 2x and halving offloading requirements.

摘要
现代语音理解（SU）包括一个复杂的管道：在实时语音输入时，管道包括一个深度神经网络，其中包括一个encoder-decoder架构，并且使用搜索 beam 来提高性能。这个网络在其自动回归输出中不断评估注意力和 Connectionist Temporal Classification（CTC）分数。这篇论文旨在在边缘设备上提高 SU 性能。它追求两个相互关联的目标：加速设备上的执行和有效地处理输入流量超出设备模型的范围。虽然这些目标已经得到了广泛的研究，但我们在 SU 的特殊挑战方面提出了创新的解决方案：1. 延迟contextualization：允许在输入搬运期间并行执行模型的注意力搜索encoder。2. 预测解码：减轻时间负荷不均衡。3. 回归折衣：便利基于部分输出序列的卸载决策。我们的技术可以独立或集成应用，与现有的 SU 模型、管道和框架集成，形成一个混合解决方案。我们的诊断系统XYZ在配备6-8Arm核心的平台上进行了测试，实现了State-of-the-Art（SOTA）精度，将总端到端延迟时间减少一半，并将卸载需求减半。

Current Topological and Machine Learning Applications for Bias Detection in Text

paper_url: http://arxiv.org/abs/2311.13495
repo_url: None
paper_authors: Colleen Farrelly, Yashbir Singh, Quincy A. Hathaway, Gunnar Carlsson, Ashok Choudhary, Rahul Paul, Gianfranco Doretto, Yassine Himeur, Shadi Atalls, Wathiq Mansoor
for: 本研究旨在探讨语言模型嵌入和几何模型如何影响偏见模型的准确性。
methods: 本研究使用RedditBias数据库分析文本偏见，并 comparing四种 transformer 模型，包括 BERT 和 RoBERTa 变体。post-embedding 和 t-SNE 技术实现了数据的二维视化，而 KNN 分类器则用于区分偏见类型。
results: 研究发现，BERT 特别是 mini BERT 在偏见分类中表现出色，而多语言模型则落后于其他模型。结论是要进一步改进单语言模型和探索领域偏见。

Abstract
Institutional bias can impact patient outcomes, educational attainment, and legal system navigation. Written records often reflect bias, and once bias is identified; it is possible to refer individuals for training to reduce bias. Many machine learning tools exist to explore text data and create predictive models that can search written records to identify real-time bias. However, few previous studies investigate large language model embeddings and geometric models of biased text data to understand geometry's impact on bias modeling accuracy. To overcome this issue, this study utilizes the RedditBias database to analyze textual biases. Four transformer models, including BERT and RoBERTa variants, were explored. Post-embedding, t-SNE allowed two-dimensional visualization of data. KNN classifiers differentiated bias types, with lower k-values proving more effective. Findings suggest BERT, particularly mini BERT, excels in bias classification, while multilingual models lag. The recommendation emphasizes refining monolingual models and exploring domain-specific biases.

摘要
Translated into Simplified Chinese:机构偏见可以影响病人结果、教育水平和法律系统导航。文本记录frequently受到偏见影响，一旦偏见被发现，就可以对个人提供培训以降低偏见。许多机器学习工具可以探索文本数据并创建预测模型，以便在实时偏见检测中搜索文本记录。然而，前一些研究很少研究大型自然语言模型嵌入和几何模型偏见数据的影响。为了解决这个问题，这个研究利用RedditBias数据库分析文本偏见。四个变换器模型，包括BERT和RoBERTa变体，被探索。Post-embedding，t-SNE允许二维可视化数据。KNN分类器分化偏见类型，更低的k值更有效。结果表明BERT，特别是mini BERT，在偏见分类方面表现出色，而多语言模型落后。建议强调加工单语言模型和探索领域特定的偏见。

Machine Translation to Control Formality Features in the Target Language

paper_url: http://arxiv.org/abs/2311.13475
repo_url: None
paper_authors: Harshita Tyagi, Prashasta Jung, Hyowon Lee
for: 本研究旨在解决在机器学习翻译技术中翻译从英语到有形式语言时，缺失形式信息的问题。
methods: 本研究使用了翻译自动注释技术来增加训练数据的大小，并使用了变换器模型，这种模型在自然语言处理任务中已经证明了其效果。
results: 我们的研究显示，使用形式控制的翻译模型可以更好地考虑目标语言中的形式，从而提供更加灵活的翻译策略，适用于多种语言交流场景。

Abstract
Formality plays a significant role in language communication, especially in low-resource languages such as Hindi, Japanese and Korean. These languages utilise formal and informal expressions to convey messages based on social contexts and relationships. When a language translation technique is used to translate from a source language that does not pertain the formality (e.g. English) to a target language that does, there is a missing information on formality that could be a challenge in producing an accurate outcome. This research explores how this issue should be resolved when machine learning methods are used to translate from English to languages with formality, using Hindi as the example data. This was done by training a bilingual model in a formality-controlled setting and comparing its performance with a pre-trained multilingual model in a similar setting. Since there are not a lot of training data with ground truth, automated annotation techniques were employed to increase the data size. The primary modeling approach involved leveraging transformer models, which have demonstrated effectiveness in various natural language processing tasks. We evaluate the official formality accuracy(ACC) by comparing the predicted masked tokens with the ground truth. This metric provides a quantitative measure of how well the translations align with the desired outputs. Our study showcases a versatile translation strategy that considers the nuances of formality in the target language, catering to diverse language communication needs and scenarios.

摘要
Formality plays a significant role in language communication, especially in low-resource languages such as Hindi, Japanese, and Korean. These languages use formal and informal expressions to convey messages based on social contexts and relationships. When translating from a source language that does not have formality (e.g., English) to a target language that does, there is a lack of information on formality that can be a challenge in producing an accurate outcome. This research explores how to resolve this issue when using machine learning methods to translate from English to languages with formality, using Hindi as the example data.We trained a bilingual model in a formality-controlled setting and compared its performance with a pre-trained multilingual model in a similar setting. Since there were not many training data with ground truth, automated annotation techniques were employed to increase the data size. The primary modeling approach involved leveraging transformer models, which have demonstrated effectiveness in various natural language processing tasks. We evaluated the official formality accuracy (ACC) by comparing the predicted masked tokens with the ground truth. This metric provides a quantitative measure of how well the translations align with the desired outputs. Our study showcases a versatile translation strategy that considers the nuances of formality in the target language, catering to diverse language communication needs and scenarios.

Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting

paper_url: http://arxiv.org/abs/2311.13314
repo_url: None
paper_authors: Xinyan Guan, Yanjiang Liu, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun
for: Mitigating the hallucination of large language models (LLMs) during the reasoning process.
methods: Knowledge Graph-based Retrofitting (KGR) framework that incorporates LLMs with knowledge graphs (KGs) to retrofit the initial draft responses of LLMs based on factual knowledge stored in KGs.
results: Significant improvement in the performance of LLMs on factual question answering benchmarks, especially for complex reasoning processes, demonstrating the effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.Here’s the Chinese version:
for: mitigating LLMs的幻觉 during the reasoning process.
methods: based on KGs to retrofit the initial draft responses of LLMs.
results: significant improvement in LLMs的性能 on factual question answering benchmarks, especially for complex reasoning processes, demonstrating the necessity and effectiveness of KGR.

Abstract
Incorporating factual knowledge in knowledge graph is regarded as a promising approach for mitigating the hallucination of large language models (LLMs). Existing methods usually only use the user's input to query the knowledge graph, thus failing to address the factual hallucination generated by LLMs during its reasoning process. To address this problem, this paper proposes Knowledge Graph-based Retrofitting (KGR), a new framework that incorporates LLMs with KGs to mitigate factual hallucination during the reasoning process by retrofitting the initial draft responses of LLMs based on the factual knowledge stored in KGs. Specifically, KGR leverages LLMs to extract, select, validate, and retrofit factual statements within the model-generated responses, which enables an autonomous knowledge verifying and refining procedure without any additional manual efforts. Experiments show that KGR can significantly improve the performance of LLMs on factual QA benchmarks especially when involving complex reasoning processes, which demonstrates the necessity and effectiveness of KGR in mitigating hallucination and enhancing the reliability of LLMs.

摘要
使用知识图（KG） incorporating 在大语言模型（LLM）中的实际知识是一种有前途的方法，以避免 LLMS 的幻觉。现有方法通常仅基于用户输入来查询知识图，因此无法解决 LLMS durante 其理解过程中生成的幻觉。为解决这个问题，本文提出了知识图基于的 Retrofitting（KGR）框架，该框架将 LLMS 与 KG 集成，以适应 LLMS 的理解过程中的幻觉。specifically，KGR 利用 LLMS 提取、选择、验证和更正模型生成的答案中的实际声明，以实现无需人工干预的自主知识验证和修正过程。实验显示，KGR 可以大幅提高 LLMS 在实际问答标准 bencmarks 上的表现，特别是在复杂的理解过程中，这表明了 KGR 在降低幻觉和提高 LLMS 的可靠性的必要和有效性。

Rethinking Radiology Report Generation via Causal Reasoning and Counterfactual Augmentation

paper_url: http://arxiv.org/abs/2311.13307
repo_url: None
paper_authors: Xiao Song, Jiafan Liu, Yun Li, Wenbin Lei, Ruxin Wang
for: 这篇论文的目的是解决受到病症共occurrence的干扰，使 radiology report generation (RRG) 更加 precisione 和可靠。
methods: 这篇论文使用了一种新的Counterfactual augmentation strategy，包括Counterfactual Sample Synthesis和Counterfactual Report Reconstruction two sub-methods，以Break two aspects of spurious effects，即Joint Vision Coupling和Conditional Sentence Coherence Coupling。
results: 实验结果和进一步分析表明，提出的方法可以减少 errors in RRG，并且提高了报告的准确性和可靠性。

Abstract
Radiology Report Generation (RRG) draws attention as an interaction between vision and language fields. Previous works inherited the ideology of vision-to-language generation tasks,aiming to generate paragraphs with high consistency as reports. However, one unique characteristic of RRG, the independence between diseases, was neglected, leading to the injection of the spurious confounder, i.e., the disease co-occurrence. Unfortunately, this confounder confuses the process of report generation worse because of the biased RRG data distribution. In this paper, to rethink this issue thoroughly, we reason about its causes and effects from a novel perspective of statistics and causality, where the Joint Vision Coupling and the Conditional Sentence Coherence Coupling are two aspects prone to implicitly decrease the accuracy of reports. Then, a counterfactual augmentation strategy that contains the Counterfactual Sample Synthesis and the Counterfactual Report Reconstruction sub-methods is proposed to break these two aspects of spurious effects. Experimental results and further analyses on two widely used datasets justify our reasoning and proposed methods.

摘要

Intention and Context Elicitation with Large Language Models in the Legal Aid Intake Process

paper_url: http://arxiv.org/abs/2311.13281
repo_url: None
paper_authors: Nick Goodson, Rongfei Lu
for: 这种研究旨在提高法律援助机构的工作效率和成本，使法律援助更加可 accessible для更广泛的人群。
methods: 该研究使用大语言模型（LLMs）和 чат博тов来自动化法律投入过程，以解决当前 LLMS 的问题，即在不充分理解客户的意图和法律情况时，提供不准确的答案。
results: 研究人员通过使用自由形式的语言交互来引出和推理客户的含义和法律情况，并提出未来研究方向，以自动包含客户意图和情况询问在 chatbot 中无需显式提示。

Abstract
Large Language Models (LLMs) and chatbots show significant promise in streamlining the legal intake process. This advancement can greatly reduce the workload and costs for legal aid organizations, improving availability while making legal assistance more accessible to a broader audience. However, a key challenge with current LLMs is their tendency to overconfidently deliver an immediate 'best guess' to a client's question based on the output distribution learned over the training data. This approach often overlooks the client's actual intentions or the specifics of their legal situation. As a result, clients may not realize the importance of providing essential additional context or expressing their underlying intentions, which are crucial for their legal cases. Traditionally, logic based decision trees have been used to automate intake for specific access to justice issues, such as immigration and eviction. But those solutions lack scalability. We demonstrate a proof-of-concept using LLMs to elicit and infer clients' underlying intentions and specific legal circumstances through free-form, language-based interactions. We also propose future research directions to use supervised fine-tuning or offline reinforcement learning to automatically incorporate intention and context elicitation in chatbots without explicit prompting.

摘要
Traditionally, logic-based decision trees have been used to automate intake for specific access to justice issues, such as immigration and eviction, but these solutions lack scalability. We have demonstrated a proof-of-concept using LLMs to elicit and infer clients' underlying intentions and specific legal circumstances through free-form, language-based interactions. We also propose future research directions to use supervised fine-tuning or offline reinforcement learning to automatically incorporate intention and context elicitation in chatbots without explicit prompting.

Enhancing Summarization Performance through Transformer-Based Prompt Engineering in Automated Medical Reporting

paper_url: http://arxiv.org/abs/2311.13274
repo_url: None
paper_authors: Daphne van Zandvoort, Laura Wiersema, Tom Huibers, Sandra van Dulmen, Sjaak Brinkkemper
for: 这个研究旨在提高自动医疗报告的品质和相关性，以减少医疗专业人员的时间投入。
methods: 研究使用了两种不同的提示策略，namely shot prompting和pattern prompting，以提高自动医疗报告的效果。
results: 研究发现，两架提示策略的结合使用，在与scope和域别文件上进行评估时，实现了最高的ROUGE分数和人类评价标准。但是，自动报告相对于人类参考集的长度约为2倍。

Abstract
Customized medical prompts enable Large Language Models (LLM) to effectively address medical dialogue summarization. The process of medical reporting is often time-consuming for healthcare professionals. Implementing medical dialogue summarization techniques presents a viable solution to alleviate this time constraint by generating automated medical reports. The effectiveness of LLMs in this process is significantly influenced by the formulation of the prompt, which plays a crucial role in determining the quality and relevance of the generated reports. In this research, we used a combination of two distinct prompting strategies, known as shot prompting and pattern prompting to enhance the performance of automated medical reporting. The evaluation of the automated medical reports is carried out using the ROUGE score and a human evaluation with the help of an expert panel. The two-shot prompting approach in combination with scope and domain context outperforms other methods and achieves the highest score when compared to the human reference set by a general practitioner. However, the automated reports are approximately twice as long as the human references, due to the addition of both redundant and relevant statements that are added to the report.

摘要
自定义医疗提示可以使大语言模型（LLM）有效地处理医疗对话摘要。医疗报告过程经常耗时consuming healthcare professionals。实施医疗对话摘要技术可以解决这个时间约束，生成自动化的医疗报告。 LLMS 在这个过程中的表现受到提示的形式度量影响，这个因素对生成的报告质量和相关性具有关键性。在这项研究中，我们使用了两种不同的提示策略，称为shot prompting和pattern prompting，以提高自动化医疗报告的性能。我们使用ROUGE分数和由专家组成的评审组来评估自动化报告。两极提示方法在组合使用范围和域名上下文时，与其他方法相比，达到了最高分。然而，自动生成的报告约为人参照集的两倍长，这是因为自动生成的报告包含了 redundant 和相关的声明。

Comparative Experimentation of Accuracy Metrics in Automated Medical Reporting: The Case of Otitis Consultations

paper_url: http://arxiv.org/abs/2311.13273
repo_url: None
paper_authors: Wouter Faber, Renske Eline Bootsma, Tom Huibers, Sandra van Dulmen, Sjaak Brinkkemper
for: 这项研究的目的是提高医疗专业人员的行政负担，通过使用生成性人工智能自动生成医疗报告。
methods: 该研究使用了多种精度度量来评估生成的医疗报告的准确性。
results: 研究发现，基于对听力报告的对比，ROUGE-L和Word Mover’s Distance度量是最佳选择，与之前的研究不同。这些结果可以帮助确定生成的医疗报告的准确性，从而促进生成医疗报告的系统的开发。

Abstract
Generative Artificial Intelligence (AI) can be used to automatically generate medical reports based on transcripts of medical consultations. The aim is to reduce the administrative burden that healthcare professionals face. The accuracy of the generated reports needs to be established to ensure their correctness and usefulness. There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting. A comparative experimentation of 10 accuracy metrics has been performed on AI generated medical reports against their corresponding General Practitioner's (GP) medical reports concerning Otitis consultations. The number of missing, incorrect, and additional statements of the generated reports have been correlated with the metric scores. In addition, we introduce and define a Composite Accuracy Score which produces a single score for comparing the metrics within the field of automated medical reporting. Findings show that based on the correlation study and the Composite Accuracy Score, the ROUGE-L and Word Mover's Distance metrics are the preferred metrics, which is not in line with previous work. These findings help determine the accuracy of an AI generated medical report, which aids the development of systems that generate medical reports for GPs to reduce the administrative burden.

摘要
使用生成式人工智能（AI）自动生成医疗报告，以减轻医疗专业人员的行政负担。为确保报告的正确性和有用性，需要确定生成报告的精度。然而，在医疗报告中应用这些精度度量的少有研究。我们对使用10个精度度量进行对比性实验，测试AI生成的医疗报告与相关的一般医生（GP）医疗报告的 Otitis 诊断报告。对生成报告中缺失、错误和附加信息进行相关分析，并对这些度量之间的相关性进行定义和分析。结果表明，基于相关性研究和组合精度分数，ROUGE-L和Word Mover's Distance度量是首选度量，与之前的研究不符。这些发现可以帮助确定AI生成的医疗报告的精度，从而降低GP的行政负担。Note: The text has been translated into Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore.

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation

paper_url: http://arxiv.org/abs/2311.13258
repo_url: https://github.com/yangyi-chen/vi-struct
paper_authors: Yangyi Chen, Xingyao Wang, Manling Li, Derek Hoiem, Heng Ji
for: 这个研究是为了提高视觉结构知识提取的效果，特别是对象之间的关系。
methods: 这个研究使用了两种新的设计：首先，利用程序语言中的自然结构来表示视觉结构信息，以确保视觉结构信息的显式和系统化表示，并且可以在多级别中表示不同粒度的视觉结构信息，如概念、关系和事件。其次，提出了学习授时序学习的方法，从基本视觉知识到复杂事件结构的理解。
results: 在视觉结构预测任务中，ViStruct表现出色，证明了其在视觉结构理解方面的效果。

Abstract
State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts, relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to progressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training, capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks, demonstrating its effectiveness in improving the understanding of visual structures. The code is public at \url{https://github.com/Yangyi-Chen/vi-struct}.

摘要
现代视觉语言模型（VLM）在结构知识提取方面仍有限制性，如对 объектов之间关系的理解。在这项工作中，我们提出了 ViStruct 训练框架，用于学习高效的视觉结构知识提取。我们做出了两项创新性的设计：首先，我们提议利用程序语言的内在结构来表示视觉结构信息。这种方法可以提供明确、系统化的视觉结构信息，包括概念、关系和事件等多级划分。其次，我们引入了学习步骤，使 VLM 逐步理解视觉结构，从基本视觉概念到复杂事件结构。我们的直觉是，更低级的知识可能会帮助 VLM 更好地理解视觉结构。此外，我们编译了一个适用于视觉结构知识提取的数据集，并采用了弱监督学习方法，直接将图文对应的视觉事件结构生成为 ViStruct 训练所需。在实验中，我们证明 ViStruct 在视觉结构预测任务中表现出色，demonstrating its effectiveness in improving the understanding of visual structures. code 可以在中获取。

Automatic Instruction Optimization for Open-source LLM Instruction Tuning

paper_url: http://arxiv.org/abs/2311.13246
repo_url: https://github.com/lunyiliu/coachlm
paper_authors: Yilun Liu, Shimin Tao, Xiaofeng Zhao, Ming Zhu, Wenbing Ma, Junhao Zhu, Chang Su, Yutai Hou, Miao Zhang, Min Zhang, Hongxia Ma, Li Zhang, Hao Yang, Yanfei Jiang
for: 这 paper 是为了提高语言学习模型（LLM）在响应人类指令的能力而写的。
methods: 这 paper 使用了自动生成指令对的方法来训练开源 LLM。
results: 这 paper 表明，使用 CoachLM 可以大幅提高 LLM 生成的指令集的质量，从而提高 LLM 的指令遵从能力，并在实际应用中实现了20%的效率提高。

Abstract
Instruction tuning is crucial for enabling Language Learning Models (LLMs) in responding to human instructions. The quality of instruction pairs used for tuning greatly affects the performance of LLMs. However, the manual creation of high-quality instruction datasets is costly, leading to the adoption of automatic generation of instruction pairs by LLMs as a popular alternative in the training of open-source LLMs. To ensure the high quality of LLM-generated instruction datasets, several approaches have been proposed. Nevertheless, existing methods either compromise dataset integrity by filtering a large proportion of samples, or are unsuitable for industrial applications. In this paper, instead of discarding low-quality samples, we propose CoachLM, a novel approach to enhance the quality of instruction datasets through automatic revisions on samples in the dataset. CoachLM is trained from the samples revised by human experts and significantly increases the proportion of high-quality samples in the dataset from 17.7% to 78.9%. The effectiveness of CoachLM is further assessed on various real-world instruction test sets. The results show that CoachLM improves the instruction-following capabilities of the instruction-tuned LLM by an average of 29.9%, which even surpasses larger LLMs with nearly twice the number of parameters. Furthermore, CoachLM is successfully deployed in a data management system for LLMs at Huawei, resulting in an efficiency improvement of up to 20% in the cleaning of 40k real-world instruction pairs. We release the training data and code of CoachLM (https://github.com/lunyiliu/CoachLM).

摘要
听说是非常重要的，以便语言学习模型（LLM）可以响应人类的指令。指令对集用于调教的质量直接影响LLM的性能。然而，手动创建高质量的指令对集是非常昂贵的，导致了开源LLM的训练中广泛采用自动生成指令对的方法。为保证LLM生成的指令对集质量，许多方法已经被提议。然而，现有的方法可能会COMPROMISE数据集的完整性，或者不适用于实际应用。在这篇论文中，我们提出了一种新的方法，即CoachLM，以提高指令对集的质量。CoachLM是通过人工专家修改样本来训练的，并有效地提高了指令对集中高质量样本的比例从17.7%提高到78.9%。我们还评估了CoachLM的效果在各种实际指令测试集上，结果表明CoachLM可以提高LLM对指令的遵从能力，平均提高29.9%，甚至超过了大型LLM的性能。此外，CoachLM成功地部署在Huawei数据管理系统中，提高了LLM的清洁效率，最高提高20%。我们在 GitHub 上发布了CoachLM 的训练数据和代码（https://github.com/lunyiliu/CoachLM）。

AutoKG: Efficient Automated Knowledge Graph Generation for Language Models

paper_url: http://arxiv.org/abs/2311.14740
repo_url: https://github.com/wispcarey/autokg
paper_authors: Bohan Chen, Andrea L. Bertozzi
for: 提高大语言模型（LLM）与知识库（KG）之间的关系表示，以增强 LLM 生成的相关性和有用性。
methods: 使用自动建构知识图（AutoKG），一种轻量级、高效的方法，通过文本块对关键词EXTRACTING和关系权重评估来构建 KG。
results: 比较传统的semantic similarity搜索方法，AutoKG 可以更好地捕捉复杂的关系动态，提供更全面和相互连接的知识检索机制，使 LLM 的输出更加有用和相关。

Abstract
Traditional methods of linking large language models (LLMs) to knowledge bases via the semantic similarity search often fall short of capturing complex relational dynamics. To address these limitations, we introduce AutoKG, a lightweight and efficient approach for automated knowledge graph (KG) construction. For a given knowledge base consisting of text blocks, AutoKG first extracts keywords using a LLM and then evaluates the relationship weight between each pair of keywords using graph Laplace learning. We employ a hybrid search scheme combining vector similarity and graph-based associations to enrich LLM responses. Preliminary experiments demonstrate that AutoKG offers a more comprehensive and interconnected knowledge retrieval mechanism compared to the semantic similarity search, thereby enhancing the capabilities of LLMs in generating more insightful and relevant outputs.

摘要
传统的方法通常是通过semantic similarity search来链接大型语言模型（LLM）到知识库，但这些方法通常无法捕捉复杂的关系动态。为解决这些限制，我们引入AutoKG，一种轻量级和高效的自动知识图（KG）建构方法。对于一个包含文本块的知识库，AutoKG首先使用LLM提取关键词，然后使用图拉卷学计算每对关键词之间的关系权重。我们使用混合搜索方案，结合向量相似和图Structured Association来增强LLM的回答。初步实验表明，AutoKG可以提供更全面和联通的知识检索机制，因此可以增强LLM的输出更加深入和相关。

On the Calibration of Large Language Models and Alignment

paper_url: http://arxiv.org/abs/2311.13240
repo_url: None
paper_authors: Chiwei Zhu, Benfeng Xu, Quan Wang, Yongdong Zhang, Zhendong Mao
for: 这个论文主要用于探讨对适用于语言模型的对齐过程中的模型准确性calibration的系统性分析。methods: 该论文使用了不同的训练设置和数据集来研究在不同阶段的对齐训练过程中模型准确性的影响。results: 该论文发现了各种对齐训练过程和训练设置对模型准确性的影响，并对三个关键方面进行了全面的评估：生成、事实性和理解。

Abstract
As large language models attract increasing attention and find widespread application, concurrent challenges of reliability also arise at the same time. Confidence calibration, an effective analysis method for gauging the reliability of deep models, serves as a crucial tool for assessing and improving their reliability. However, such investigation has been comparatively underexplored. In this work, we conduct a systematic examination of the calibration of aligned language models throughout the entire construction process, including pretraining and alignment training. At each stage, we investigate how different training settings, such as parameter scales and training data, affect model calibration. To thoroughly assess model calibration, we evaluate models on three most concerned aspects: generation, factuality and understanding. Our work sheds light on whether popular LLMs are well-calibrated and how the training process influences model calibration.

摘要
如大语言模型在应用中吸引越来越多的注意力，同时也面临着可靠性问题的挑战。对深度模型的可靠性进行有效分析是评估和改进其可靠性的重要工具。然而，这一研究至今尚未得到充分的探讨。在这项工作中，我们进行了系统性的检查对适配语言模型的整个建构过程中的准确性，包括预训练和对适配训练。在每个阶段，我们研究不同的训练设置，如参数缩放和训练数据，对模型准确性的影响。为了全面评估模型准确性，我们评估模型在三个最关心的方面：生成、事实和理解。我们的工作照明了流行的 LLMS 是否具有良好的准确性，以及训练过程对模型准确性的影响。

AS-LLM: When Algorithm Selection Meets Large Language Model

paper_url: http://arxiv.org/abs/2311.13184
repo_url: None
paper_authors: Xingyu Wu, Yan Zhong, Jibin Wu, Kay Chen Tan
for: 本研究的目的是解决自动机器学习（AutoML）中的算法选择问题，即在解决特定问题之前，可以快速和准确地选择最适合的算法。
methods: 本研究提议一种将算法表示integrated到算法选择过程中的方法，包括使用预训练的语言模型（LLMs）来捕捉算法的特征。
results: 实验结果不仅证明了提议的模型的有效性，还展示了不同的预训练LLMs的表现，表明该方法有可能成为自动机器学习中的基准任务，用于评估代码表示能力。

Abstract
Algorithm selection aims to identify the most suitable algorithm for solving a specific problem before execution, which has become a critical process of the AutoML. Current mainstream algorithm selection techniques rely heavily on feature representations of various problems and employ the performance of each algorithm as supervised information. However, there is a significant research gap concerning the consideration of algorithm features. This gap is primarily attributed to the inherent complexity of algorithms, making it particularly challenging to find a universally effective feature extraction method that is applicable across a diverse range of algorithms. Unfortunately, neglecting this aspect undoubtedly impacts the accuracy of algorithm selection and indirectly necessitates an increased volume of problem data for training purposes. This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees. Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.

摘要
This paper takes a significant stride towards addressing this gap by proposing an approach that integrates algorithm representation into the algorithm selection process. Specifically, our proposed model employs distinct modules to extract representations of both problems and algorithms, where the algorithm representation leverages the capabilities of pre-trained LLMs in the realm of code comprehension. Following the extraction of embedding vectors for both algorithms and problems, the most suitable algorithm is determined through calculations of matching degrees.Our experiments not only validate the effectiveness of the proposed model but also showcase the performance of different embedded pre-trained LLMs, which suggests that the proposed algorithm selection framework holds the potential to serve as a baseline task for evaluating the code representation capabilities of LLMs.Translated into Simplified Chinese:算法选择目标是在执行之前确定最适合的算法来解决特定问题，这已成为自动化机器学习（AutoML）的关键过程。当前主流的算法选择技术都是基于问题特征的表示，并利用每个算法的性能作为监督信息。然而，忽略算法特征的研究空白仍然存在，这主要归结于算法的自然复杂性，使得找到一种适用于多种算法的通用有效特征提取方法变得极其困难。忽略这一点会导致算法选择精度下降，并间接需要更多的问题数据进行训练。本文提出了一种缓解这个研究空白的方法，即通过将算法和问题的表示integrated到算法选择过程中。具体来说，我们的提议模型包括了分解问题和算法的特征提取模块，其中算法表示Module leverages预训练的LLMs在代码理解领域的能力。接下来，我们从问题和算法中提取出嵌入向量，然后通过计算匹配度来确定最适合的算法。我们的实验不仅证明了我们的方法的有效性，还展示了不同预训练LLMs的表现，这表明我们的算法选择框架具有评估代码表示能力的基准任务的潜力。

Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper

paper_url: http://arxiv.org/abs/2311.13126
repo_url: None
paper_authors: Chengyu Wang, Junbing Yan, Wei Zhang, Jun Huang
for: 本研究聚焦在大语言模型（LLM）中的参数效率精细调整（PEFT）问题上，尤其是LLM的实际应用中的可行性和扩展性问题。
methods: 本研究使用了现有的PEFT架构和学习环境，并探讨了PEFT与模型压缩技术的结合以及多模态LLM中PEFT的应用。
results: 本研究认为，PEFT是一种有前途的研究方向，但还需要解决许多挑战和开问题，如novel PEFT架构、PEFT在不同学习环境下的应用、PEFT与模型压缩技术的结合以及多模态LLM中PEFT的应用。

Abstract
This paper delves into the pressing need in Parameter-Efficient Fine-Tuning (PEFT) for Large Language Models (LLMs). While LLMs possess remarkable capabilities, their extensive parameter requirements and associated computational demands hinder their practicality and scalability for real-world applications. Our position paper highlights current states and the necessity of further studying into the topic, and recognizes significant challenges and open issues that must be addressed to fully harness the powerful abilities of LLMs. These challenges encompass novel efficient PEFT architectures, PEFT for different learning settings, PEFT combined with model compression techniques, and the exploration of PEFT for multi-modal LLMs. By presenting this position paper, we aim to stimulate further research and foster discussions surrounding more efficient and accessible PEFT for LLMs.

摘要
Translated into Simplified Chinese:这篇论文探讨了大语言模型（LLM）的参数高效精细调整（PEFT）的急需。虽然LLM具有卓越的能力，但它们的参数要求和相关的计算需求限制了它们在实际应用中的实用性和可扩展性。我们的位置论文指出了当前情况和PEFT的必要性，并认可了需要更多的研究来充分利用LLM的强大能力。这些挑战包括开发新的PEFT架构、PEFT在不同的学习环境下、PEFT与模型压缩技术相结合、以及探索多模态LLM中的PEFT。通过发表这篇位置论文，我们希望可以激发更多的研究和讨论，以便更好地实现更高效和可 accessible的PEFT для LLM。

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

paper_url: http://arxiv.org/abs/2311.13110
repo_url: None
paper_authors: Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma
for: 本研究的目标是学习 representation 的可靠性和简洁性。
methods: 本文提出了一种基于 Gaussian mixture 的 representation learning 方法，并使用了 alternating optimization 来优化这个目标。此外，文章还提出了一种名为 CRATE 的白盒式 transformer-like 深度网络 architecture，该 architecture 可以实现数据压缩和简洁化。
results: 实验表明，CRATE 网络可以很好地压缩和简洁化大规模的图像和文本数据集，并且与高度工程化的 transformer-based 模型具有相似的性能。

Abstract
In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression. Code is available at: https://ma-lab-berkeley.github.io/CRATE .

摘要
在这篇论文中，我们认为一个自然的表示学习目标是压缩和变换数据集（例如字符集）的分布，使其变为一个低维度的高斯混合支持不可逆的子空间。我们可以使用一种原则性的评价指标，即稀疏率减少，同时最大化表示中的内在信息增加和外在稀疏性。从这个角度来看，流行的深度网络架构，包括转换器，可以被视为实现循环优化算法，以便最大化这个指标。特别是，我们从代数优化的角度解释了转换器块，并 derivated a family of white-box transformer-like deep network architectures，named CRATE，which are fully mathematically interpretable。我们通过一种新的连接 Between Denoising and Compression 显示，这些白盒结构可以实现对压缩编码的逆变换。因此，这些白盒结构是 universal to both encoders and decoders。实验表明，这些网络，尽管简单，实际上可以压缩和稀疏表示大规模的实际世界图像和文本数据集，并达到高度定制化的 transformer-based 模型：ViT、MAE、DINO、BERT 和 GPT2 的性能。我们认为这个计算框架在深度学习理论和实践之间的桥接具有很大的潜力。代码可以在上获取。

Perceptual Structure in the Absence of Grounding for LLMs: The Impact of Abstractedness and Subjectivity in Color Language

paper_url: http://arxiv.org/abs/2311.13105
repo_url: None
paper_authors: Pablo Loyola, Edison Marrese-Taylor, Andres Hoyos-Idobro
for: investigate the problem of grounding in language understanding, using color perception and color language as a test bed
methods: collect a large dataset of colors and their descriptions, and perform an empirical analysis comparing two types of alignments (inter-space and intra-space)
results: find that while color space alignment holds for simple, pragmatic color descriptions, it drops significantly in the presence of examples with subjective or abstract elements, suggesting that grounding may be necessary in such cases.Here’s the summary in Traditional Chinese as well, for your reference:
for: investigate the problem of grounding in language understanding, using color perception and color language as a test bed
methods: collect a large dataset of colors and their descriptions, and perform an empirical analysis comparing two types of alignments (inter-space and intra-space)
results: find that while color space alignment holds for simple, pragmatic color descriptions, it drops significantly in the presence of examples with subjective or abstract elements, suggesting that grounding may be necessary in such cases.I hope this helps! Let me know if you have any other questions.

Abstract
The need for grounding in language understanding is an active research topic. Previous work has suggested that color perception and color language appear as a suitable test bed to empirically study the problem, given its cognitive significance and showing that there is considerable alignment between a defined color space and the feature space defined by a language model. To further study this issue, we collect a large scale source of colors and their descriptions, containing almost a 1 million examples , and perform an empirical analysis to compare two kinds of alignments: (i) inter-space, by learning a mapping between embedding space and color space, and (ii) intra-space, by means of prompting comparatives between color descriptions. Our results show that while color space alignment holds for monolexemic, highly pragmatic color descriptions, this alignment drops considerably in the presence of examples that exhibit elements of real linguistic usage such as subjectivity and abstractedness, suggesting that grounding may be required in such cases.

摘要
需要语言理解的基础是一个活跃的研究话题。先前的工作表明，色彩识别和色语言是一个适合employn为实验研究这个问题的测试床，因为它们在认知上具有重要性，并且显示了语言模型定义的特征空间与定义的颜色空间之间存在较大的匹配。为进一步研究这个问题，我们收集了一个大规模的颜色和其描述的源泉，包含约一百万个示例，并进行了一种实验分析，比较两种对齐方法：（i）间隔颜色空间，通过学习颜色空间和嵌入空间之间的映射，以及（ii）内隔颜色空间，通过对颜色描述进行比较。我们的结果表明，当只有单一的、高度 Pragmatic 的颜色描述时，颜色空间对齐具有较高的准确率，但在包含语言使用的元素，如主观性和抽象性时，对齐率明显下降，这表示在这些情况下，可能需要进行基础。

Detecting out-of-distribution text using topological features of transformer-based language models

paper_url: http://arxiv.org/abs/2311.13102
repo_url: https://github.com/andrespollano/neural_nets-tda
paper_authors: Andres Pollano, Anupam Chaudhuri, Anj Simmons
for: 检测transformer语言模型中的异常输入（Out-of-distribution，OOD）样本
methods: 应用Topological Data Analysis（TDA）方法对transformer语言模型的注意力地图进行分析
results: TDA方法在BERT语言模型上的评估表明，与传统的BERT CLS embeddings方法相比，TDA方法在分类异常输入数据（IMDB评论）和同频域输入数据（HuffPost新闻）之间的差异更大，但是在近异常输入数据（CNN/Dailymail）和同频域输入数据（HuffPost商业新闻）上，TDA方法的效果逐渐下降。I hope that helps! Let me know if you have any other questions.

Abstract
We attempt to detect out-of-distribution (OOD) text samples though applying Topological Data Analysis (TDA) to attention maps in transformer-based language models. We evaluate our proposed TDA-based approach for out-of-distribution detection on BERT, a transformer-based language model, and compare the to a more traditional OOD approach based on BERT CLS embeddings. We found that our TDA approach outperforms the CLS embedding approach at distinguishing in-distribution data (politics and entertainment news articles from HuffPost) from far out-of-domain samples (IMDB reviews), but its effectiveness deteriorates with near out-of-domain (CNN/Dailymail) or same-domain (business news articles from HuffPost) datasets.

摘要
我们使用 topological data analysis (TDA) 方法检测 transformer 基于语言模型中的 out-of-distribution (OOD) 样本。我们对 BERT 语言模型进行了评估，并与基于 BERT CLS 嵌入的传统 OOD 方法进行比较。我们发现 TDA 方法在 politics 和 entertainment 新闻文章与 HuffPost 领域的内容分类任务中表现更好，但是在 near out-of-domain (CNN/Dailymail) 或 same-domain (business 新闻文章 from HuffPost) 数据集中，其效果会下降。