methods: 该文章使用了一种“learn your tokens”的方法,将字节/字符 pooling 到单词表示,然后将单词表示传递给主语言模型进行解码。
results: compared to字节/字符模型和子词模型,该文章的中等表达能力和速度的综合型tokenizer在下一个词预测任务中表现出色,特别是在罕见词术中表现出三十倍的提升。Abstract
Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the limitations of such a tokenization strategy, particularly for documents not written in English and for representing numbers. On the other extreme, byte/character-level language models are much less restricted but suffer from increased sequence description lengths and a subsequent quadratic expansion in self-attention computation. Recent attempts to compress and limit these context lengths with fixed size convolutions is helpful but completely ignores the word boundary. This paper considers an alternative 'learn your tokens' scheme which utilizes the word boundary to pool bytes/characters into word representations, which are fed to the primary language model, before again decoding individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperform by over 300% both subwords and byte/character models over the intrinsic language modeling metric of next-word prediction across datasets. It particularly outshines on rare words, outperforming by a factor of 30! We extensively study the language modeling setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.
摘要
This paper proposes an alternative "learn your tokens" scheme, which utilizes word boundaries to pool bytes/characters into word representations, which are then fed to the primary language model. The model then decodes individual characters/bytes per word in parallel. We find that our moderately expressive and moderately fast end-to-end tokenizer outperforms both subwords and byte/character models by over 300% in next-word prediction accuracy across datasets, particularly in rare words, with a factor of 30.We extensively study the language modeling setup for all three categories of tokenizers and theoretically analyze how our end-to-end models can also be a strong trade-off in efficiency and robustness.
An Optimistic-Robust Approach for Dynamic Positioning of Omnichannel Inventories
results: 实验结果表明,BI备货策略可以在一个实际的大型美国多渠道零售商中实现至少15%的利润提高,同时保持实际最坏情况性能。Abstract
We introduce a new class of data-driven and distribution-free optimistic-robust bimodal inventory optimization (BIO) strategy to effectively allocate inventory across a retail chain to meet time-varying, uncertain omnichannel demand. While prior Robust optimization (RO) methods emphasize the downside, i.e., worst-case adversarial demand, BIO also considers the upside to remain resilient like RO while also reaping the rewards of improved average-case performance by overcoming the presence of endogenous outliers. This bimodal strategy is particularly valuable for balancing the tradeoff between lost sales at the store and the costs of cross-channel e-commerce fulfillment, which is at the core of our inventory optimization model. These factors are asymmetric due to the heterogenous behavior of the channels, with a bias towards the former in terms of lost-sales cost and a dependence on network effects for the latter. We provide structural insights about the BIO solution and how it can be tuned to achieve a preferred tradeoff between robustness and the average-case. Our experiments show that significant benefits can be achieved by rethinking traditional approaches to inventory management, which are siloed by channel and location. Using a real-world dataset from a large American omnichannel retail chain, a business value assessment during a peak period indicates over a 15% profitability gain for BIO over RO and other baselines while also preserving the (practical) worst case performance.
摘要
我们介绍了一种新的数据驱动、分布自由的乐观稳定优化策略(BIO),用于在零售链中有效分配存储,以满足时变、不确定的多渠道需求。在优化模型中,BIO策略通过考虑两个模式(下行和上行)来实现乐观稳定性,同时也能够保持优秀的平均情况性。这种策略对于平衡店铺产生损失和多渠道电商配送成本的负担进行了有利的均衡。我们提供了BIO解决方案的结构性分析,以及如何根据需要调整BIO来实现想要的负载均衡。我们的实验结果表明,通过重新思考传统的存储管理方法,可以实现显著的利益提升,并且保持了实际最坏情况性。使用一个大型美国多渠道零售商的实际数据,我们在峰值期进行了业务价值评估,发现BIO相比RO和其他基eline,可以获得超过15%的利润增加,同时保持实际最坏情况性。
Unveiling the General Intelligence Factor in Language Models: A Psychometric Approach
results: 发现了语言模型中的g因素,提供了一个统一的评估指标,开启了更加稳定、g基础能力评估的新途径。这些发现将心理测量理论应用于人工通用智能领域奠定基础,有实践意义 для模型评估和开发。Abstract
This study uncovers the factor of general intelligence, or g, in language models, extending the psychometric theory traditionally applied to humans and certain animal species. Utilizing factor analysis on two extensive datasets - Open LLM Leaderboard with 1,232 models and General Language Understanding Evaluation (GLUE) Leaderboard with 88 models - we find compelling evidence for a unidimensional, highly stable g factor that accounts for 85% of the variance in model performance. The study also finds a moderate correlation of .48 between model size and g. The discovery of g in language models offers a unified metric for model evaluation and opens new avenues for more robust, g-based model ability assessment. These findings lay the foundation for understanding and future research on artificial general intelligence from a psychometric perspective and have practical implications for model evaluation and development.
摘要
Learning a Hierarchical Planner from Humans in Multiple Generations
results: simulations and human experiments (n=360) 表明,自然编程可以强大地组合来自不同用户和Contexts的程序,并且在对Contexts进行更改时更快地适应,并解决更复杂的任务。相比于程序编程基线,自然编程在解决复杂任务方面表现出了明显的优势。Abstract
A typical way in which a machine acquires knowledge from humans is by programming. Compared to learning from demonstrations or experiences, programmatic learning allows the machine to acquire a novel skill as soon as the program is written, and, by building a library of programs, a machine can quickly learn how to perform complex tasks. However, as programs often take their execution contexts for granted, they are brittle when the contexts change, making it difficult to adapt complex programs to new contexts. We present natural programming, a library learning system that combines programmatic learning with a hierarchical planner. Natural programming maintains a library of decompositions, consisting of a goal, a linguistic description of how this goal decompose into sub-goals, and a concrete instance of its decomposition into sub-goals. A user teaches the system via curriculum building, by identifying a challenging yet not impossible goal along with linguistic hints on how this goal may be decomposed into sub-goals. The system solves for the goal via hierarchical planning, using the linguistic hints to guide its probability distribution in proposing the right plans. The system learns from this interaction by adding newly found decompositions in the successful search into its library. Simulated studies and a human experiment (n=360) on a controlled environment demonstrate that natural programming can robustly compose programs learned from different users and contexts, adapting faster and solving more complex tasks when compared to programmatic baselines.
摘要
一般来说,机器学习知识从人类那里是通过编程的方式。与学习示例或经验相比,编程学习可以让机器快速学习新的技能,只需要编写一份程序即可,并且通过建立一个程序库,机器可以快速学习完成复杂任务。然而,由于程序经常假设执行上下文,因此它们在上下文变化时会变得不稳定,使得复杂程序适应新上下文变得困难。我们提出了自然编程,一种基于程序库学习系统,它将程序学习与层次规划结合。自然编程保留一个库中的分解,包括目标、语言描述如何将目标分解成子目标,以及具体实例的分解。用户通过课程建设来教育系统,选择一个具有挑战性但并不是不可能的目标,并提供语言提示如何将目标分解成子目标。系统使用层次规划解决目标,使用语言提示来引导搜索的概率分布。系统从这种互动中学习,将成功搜索的新分解添加到库中。 simulated studies and human experiment (n=360) on a controlled environment show that natural programming can robustly compose programs learned from different users and contexts, adapting faster and solving more complex tasks than programmatic baselines.
Language Models as Zero-Shot Trajectory Generators
methods: 该论文使用了 GPT-4 Language Model,并通过对象检测和分割视觉模型来提供输入。
results: 研究发现,单个任务无关的提示可以在 26 个实际世界语言任务中表现出色,例如 “打开瓶cap” 和 “wipe plate with sponge”。此外,研究还发现,LLM 实际上具有足够的低级控制知识,可以执行许多常见任务,并且可以检测失败并重新规划轨迹。Abstract
Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation skills, when given access to only object detection and segmentation vision models. We study how well a single task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers, can perform across 26 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigate which design choices in this prompt are the most effective. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, code, and prompts are available at: https://www.robot-learning.uk/language-models-trajectory-generators.
摘要
The Efficacy of Transformer-based Adversarial Attacks in Security Domains
results: 我们发现, adversarial examples 生成于 transformer 上的攻击力最高,可以在其他模型上传输25.7%的攻击力。同时,生成于其他模型上的 adversarial examples 在 transformer 上的传输率只有56.7%。这些结果强调了对 transformer 架构的研究在安全领域中的重要性,并建议在传输攻击 Setting 中使用它们作为主要架构。Abstract
Today, the security of many domains rely on the use of Machine Learning to detect threats, identify vulnerabilities, and safeguard systems from attacks. Recently, transformer architectures have improved the state-of-the-art performance on a wide range of tasks such as malware detection and network intrusion detection. But, before abandoning current approaches to transformers, it is crucial to understand their properties and implications on cybersecurity applications. In this paper, we evaluate the robustness of transformers to adversarial samples for system defenders (i.e., resiliency to adversarial perturbations generated on different types of architectures) and their adversarial strength for system attackers (i.e., transferability of adversarial samples generated by transformers to other target models). To that effect, we first fine-tune a set of pre-trained transformer, Convolutional Neural Network (CNN), and hybrid (an ensemble of transformer and CNN) models to solve different downstream image-based tasks. Then, we use an attack algorithm to craft 19,367 adversarial examples on each model for each task. The transferability of these adversarial examples is measured by evaluating each set on other models to determine which models offer more adversarial strength, and consequently, more robustness against these attacks. We find that the adversarial examples crafted on transformers offer the highest transferability rate (i.e., 25.7% higher than the average) onto other models. Similarly, adversarial examples crafted on other models have the lowest rate of transferability (i.e., 56.7% lower than the average) onto transformers. Our work emphasizes the importance of studying transformer architectures for attacking and defending models in security domains, and suggests using them as the primary architecture in transfer attack settings.
摘要
今天,许多领域的安全性都依赖于机器学习来探测威胁、 indentify漏洞和保护系统从攻击中保护。现在,transformer架构已经提高了一系列任务的状态之末性,如恶意软件检测和网络侵入检测。但是,在switch到transformer架构之前,必须了解它们的性质和影响在安全应用程序中。在这篇论文中,我们评估了transformer架构对抗攻击样本的Robustness(可以抗拒的性)和攻击者的攻击力(可以转移到其他目标模型)。为此,我们首先精度调整一组预训练过的transformer、Convolutional Neural Network(CNN)和混合(一个ensemble of transformer和CNN)模型,以解决不同的图像基于任务。然后,我们使用一种攻击算法来生成19367个攻击样本 для每个任务。我们使用这些攻击样本来测量每个模型之间的 transferred(可以转移的性)。我们发现,在transformer架构上生成的攻击样本具有最高的转移率(即25.7% higher than the average),而其他模型上生成的攻击样本具有最低的转移率(即56.7% lower than the average)。我们的工作强调了在安全领域使用transformer架构进行攻击和防御模型的研究,并建议在转移攻击设置中使用它们作为主要架构。
WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks
results: 对比比较常见的背门攻击方法,WaveAttack 可以 дости得高度隐蔽和效果,同时在图像质量上也可以达到28.27%的提升(PSNR)、1.61%的提升(SSIM)和70.59%的减少(IS)。Abstract
Due to the popularity of Artificial Intelligence (AI) technology, numerous backdoor attacks are designed by adversaries to mislead deep neural network predictions by manipulating training samples and training processes. Although backdoor attacks are effective in various real scenarios, they still suffer from the problems of both low fidelity of poisoned samples and non-negligible transfer in latent space, which make them easily detectable by existing backdoor detection algorithms. To overcome the weakness, this paper proposes a novel frequency-based backdoor attack method named WaveAttack, which obtains image high-frequency features through Discrete Wavelet Transform (DWT) to generate backdoor triggers. Furthermore, we introduce an asymmetric frequency obfuscation method, which can add an adaptive residual in the training and inference stage to improve the impact of triggers and further enhance the effectiveness of WaveAttack. Comprehensive experimental results show that WaveAttack not only achieves higher stealthiness and effectiveness, but also outperforms state-of-the-art (SOTA) backdoor attack methods in the fidelity of images by up to 28.27\% improvement in PSNR, 1.61\% improvement in SSIM, and 70.59\% reduction in IS.
摘要
Translated into Simplified Chinese:由于人工智能技术的普及,许多敌对者设计了多种后门攻击,以诱导深度神经网络预测结果的扰乱。尽管后门攻击在实际场景中有效,但它们仍然受到低精度毒品样本和 latent space 中的非至遇抗减 Transfer 的问题,这使得它们可以轻松被现有的后门检测算法检测出来。为了解决这个弱点,本文提出了一种基于频率的后门攻击方法,名为 WaveAttack,它使用 Discrete Wavelet Transform (DWT) 获取图像高频特征来生成后门触发器。此外,我们还引入了不对称频率隐藏方法,可以在训练和推理阶段添加adaptive residual来提高触发器的影响,进一步提高 WaveAttack 的效果。经过全面的实验结果表明,WaveAttack 不仅实现了更高的隐身度和效果,还比 State-of-the-art (SOTA) 后门攻击方法在图像的精度上提高了28.27%的PSNR,1.61%的SSIM,和70.59%的IS。
Adversarial Robustness Unhardening via Backdoor Attacks in Federated Learning
for: This paper focuses on addressing security challenges in federated learning, specifically the issues of poisoning and backdoor attacks, by exploring the intersection of adversarial training and backdoor attacks.
methods: The paper introduces a new attack called Adversarial Robustness Unhardening (ARU) and evaluates its impact on adversarial training and existing robust aggregation defenses against poisoning and backdoor attacks through extensive empirical experiments.
results: The paper finds that ARU can intentionally undermine model robustness during decentralized training, rendering models susceptible to a broader range of evasion attacks, and highlights the limitations of existing defenses against ARU. The findings offer insights into bolstering defenses against ARU and the need for further research in this area.Here is the information in Simplified Chinese text:
results: 论文发现ARU可以在分布式训练中故意削弱模型的鲁棒性,使模型面临更广泛的欺骗攻击,并指出现有防御措施对ARU的限制。发现为防止ARU提供了新的策略和研究方向。Abstract
In today's data-driven landscape, the delicate equilibrium between safeguarding user privacy and unleashing data potential stands as a paramount concern. Federated learning, which enables collaborative model training without necessitating data sharing, has emerged as a privacy-centric solution. This decentralized approach brings forth security challenges, notably poisoning and backdoor attacks where malicious entities inject corrupted data. Our research, initially spurred by test-time evasion attacks, investigates the intersection of adversarial training and backdoor attacks within federated learning, introducing Adversarial Robustness Unhardening (ARU). ARU is employed by a subset of adversaries to intentionally undermine model robustness during decentralized training, rendering models susceptible to a broader range of evasion attacks. We present extensive empirical experiments evaluating ARU's impact on adversarial training and existing robust aggregation defenses against poisoning and backdoor attacks. Our findings inform strategies for enhancing ARU to counter current defensive measures and highlight the limitations of existing defenses, offering insights into bolstering defenses against ARU.
摘要
今天的数据驱动时代,保护用户隐私和解释数据潜力之间的细腻平衡已成为首要问题。联邦学习,它允许无需数据共享进行模型训练的协同方法,已经出现为隐私中心的解决方案。这种分布式方法会产生安全挑战,包括腐化和后门攻击,即恶意实体插入假数据。我们的研究,起始于测试时间攻击,探讨在联邦学习中的对抗训练和后门攻击的交叉点,并提出了对抗训练不稳定性的技术——不稳定性脱困(ARU)。ARU可以在分布式训练中被一部分对手使用,故意下降模型的可靠性,使模型对更广泛的欺骗攻击变得感受。我们对ARU的影响进行了广泛的实验研究,包括对抗训练和现有的毒素攻击防御措施的评估。我们的发现可以帮助改进ARU,抗衡当前的防御措施,并 highlighted了现有防御措施的局限性,为加强防御提供了新的思路。
Automated Evaluation of Personalized Text Generation using Large Language Models
results: 比较LLMs和人工标注的评估结果,发现LLMs能够更准确地评估个性化文本生成器的质量,并且存在较好的一致性和效率。Abstract
Personalized text generation presents a specialized mechanism for delivering content that is specific to a user's personal context. While the research progress in this area has been rapid, evaluation still presents a challenge. Traditional automated metrics such as BLEU and ROUGE primarily measure lexical similarity to human-written references, and are not able to distinguish personalization from other subtle semantic aspects, thus falling short of capturing the nuances of personalized generated content quality. On the other hand, human judgments are costly to obtain, especially in the realm of personalized evaluation. Inspired by these challenges, we explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context. We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects. To validate the effectiveness of AuPEL, we design carefully controlled experiments and compare the accuracy of the evaluation judgments made by LLMs versus that of judgements made by human annotators, and conduct rigorous analyses of the consistency and sensitivity of the proposed metric. We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task. Our work suggests that using LLMs as the evaluators of personalized text generation is superior to traditional text similarity metrics, even though interesting new challenges still remain.
摘要
个人化文本生成技术提供了特殊的内容交付机制,以便为用户的个人背景提供特定的内容。虽然这一领域的研究进步快速,但评估仍然存在挑战。传统的自动化指标如BLEU和ROUGE主要测量人工写出的参考文本和自动生成内容之间的语言相似性,而不能分辨个人化的特点,因此无法捕捉个人化生成内容质量的细微差别。人类评估更加costly,尤其是在个人化评估领域。我们 inspirited by these challenges,explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context.我们提出了一种新的评估方法,称为AuPEL,它可以自动测量生成文本中的三个主要semantic aspect:个人化、质量和相关性。为了验证AuPEL的有效性,我们设计了仔细控制的实验,并将LLMs的评估判断与人类标注者的评估判断进行比较,进行了严格的分析。我们发现,相比现有的评估指标,AuPEL不仅更准确地分辨和排序模型的个人化能力,而且具有了很好的一致性和效率。我们的工作表明,使用LLMs作为个人化文本生成评估的方法比传统的文本相似度指标更为有利,尽管还有一些新的挑战 waits ahead。
Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition
results: 这篇论文的结果显示,使用这三种方法可以将单任问题adapter组合在多任问题ASR中,并且可以提高语音识别精度。 Specifically, the results show that the proposed methods can improve the speech recognition accuracy by 8% on average, compared to full fine-tuning. Additionally, the proposed methods are non-destructive and parameter-efficient, as they only update 17% of the model parameters.Abstract
Adapters are an efficient, composable alternative to full fine-tuning of pre-trained models and help scale the deployment of large ASR models to many tasks. In practice, a task ID is commonly prepended to the input during inference to route to single-task adapters for the specified task. However, one major limitation of this approach is that the task ID may not be known during inference, rendering it unsuitable for most multi-task settings. To address this, we propose three novel task-ID-free methods to combine single-task adapters in multi-task ASR and investigate two learning algorithms for training. We evaluate our methods on 10 test sets from 4 diverse ASR tasks and show that our methods are non-destructive and parameter-efficient. While only updating 17% of the model parameters, our methods can achieve an 8% mean WER improvement relative to full fine-tuning and are on-par with task-ID adapter routing.
摘要
adapter 是一种高效、可 compose 的替代方案,帮助扩大预训练模型的部署范围。在实践中,通常会在推理时预先 prepend 任务 ID 到输入,以便将输入 routed 到适应的单任务 adapter。然而,这种方法有一个主要的限制,即在多任务 setting 中,任务 ID 可能不知道,这 Render 其不适用。为了解决这个问题,我们提出了三种新的任务 ID libre 方法,并 investigate 了两种学习算法 для训练。我们在 10 个测试集上从 4 种不同的 ASR 任务中选择了,并显示了我们的方法是非破坏性的和参数有效的。只需更新 17% 的模型参数,我们的方法可以实现一个 8% 的语音识别错误率下降,与完全 fine-tuning 相当,并且与任务 ID adapter 路由相比。
paper_authors: Belinda Z. Li, Alex Tamkin, Noah Goodman, Jacob Andreas
for: 本研究用LM来指导任务规定过程。
methods: 本研究提出了Generative Active Task Elicitation(GATE)学习框架,用LM自己进行自由语言互动,以引出用户的意图和行为。
results: 在邮箱验证、内容推荐和道德决策等三个领域中,通过LM自己生成问题或总结特殊情况,可以更好地引出用户的想法和需求,并且用户报告需要更少的努力和更多的创新。Abstract
Language models (LMs) can be directed to perform target tasks by using labeled examples or natural language prompts. But selecting examples or writing prompts for can be challenging--especially in tasks that involve unusual edge cases, demand precise articulation of nebulous preferences, or require an accurate mental model of LM behavior. We propose to use *LMs themselves* to guide the task specification process. In this paper, we introduce **Generative Active Task Elicitation (GATE)**: a learning framework in which models elicit and infer intended behavior through free-form, language-based interaction with users. We study GATE in three domains: email validation, content recommendation, and moral reasoning. In preregistered experiments, we show that LMs prompted to perform GATE (e.g., by generating open-ended questions or synthesizing informative edge cases) elicit responses that are often more informative than user-written prompts or labels. Users report that interactive task elicitation requires less effort than prompting or example labeling and surfaces novel considerations not initially anticipated by users. Our findings suggest that LM-driven elicitation can be a powerful tool for aligning models to complex human preferences and values.
摘要
语言模型(LM)可以通过使用标记的示例或自然语言提示来实现目标任务。但选择示例或编写提示可以是困难的——特别是在涉及到特殊的边界情况、需要精确表达抽象的偏好或需要语言模型行为的准确理解。我们提议使用LM自己来指导任务规定过程。在这篇论文中,我们介绍了**生成活动任务提取(GATE)**:一种学习框架,在用户与LM进行自由形式、语言基于的互动中,LM可以透过生成开放式问题或合成有用的边界情况来引导用户提供任务要求。我们在三个领域中进行了预先注册的实验,并证明了LM通过执行GATE(例如,生成开放式问题或合成有用的边界情况)可以从用户提供更多有用信息的回应。用户报告称,与LM进行交互式任务提取比起用户编写提示或标记,需要更少的努力,并且可以浮现用户没有首先预期的考虑。我们的发现表明,LM驱动的提取可以是一种强大的工具,用于将模型与人类偏好和价值相互Alignment。
When Rigidity Hurts: Soft Consistency Regularization for Probabilistic Hierarchical Time Series Forecasting
results: 在评估这个方法的实验中,发现这个方法可以在各种不同的数据集上提供41-88%的更好的性能,并且在数据集中缺失10%的时间序列Data时,预测的性能仍然保持在良好的水平,而其他方法在这种情况下的性能则会严重下降超过70%。Abstract
Probabilistic hierarchical time-series forecasting is an important variant of time-series forecasting, where the goal is to model and forecast multivariate time-series that have underlying hierarchical relations. Most methods focus on point predictions and do not provide well-calibrated probabilistic forecasts distributions. Recent state-of-art probabilistic forecasting methods also impose hierarchical relations on point predictions and samples of distribution which does not account for coherency of forecast distributions. Previous works also silently assume that datasets are always consistent with given hierarchical relations and do not adapt to real-world datasets that show deviation from this assumption. We close both these gap and propose PROFHiT, which is a fully probabilistic hierarchical forecasting model that jointly models forecast distribution of entire hierarchy. PROFHiT uses a flexible probabilistic Bayesian approach and introduces a novel Distributional Coherency regularization to learn from hierarchical relations for entire forecast distribution that enables robust and calibrated forecasts as well as adapt to datasets of varying hierarchical consistency. On evaluating PROFHiT over wide range of datasets, we observed 41-88% better performance in accuracy and significantly better calibration. Due to modeling the coherency over full distribution, we observed that PROFHiT can robustly provide reliable forecasts even if up to 10% of input time-series data is missing where other methods' performance severely degrade by over 70%.
摘要
“probabilistic hierarchical time-series forecasting是一种重要的时间序列预测变体,其目的是模型和预测多变量时间序列的层次关系。大多数方法都是专注于点预测,而不提供准确的probabilistic预测分布。当前的状态艺术probabilistic预测方法也强制实施层次关系于点预测和样本分布,而不考虑预测分布的协调性。过去的工作也假设了所有数据都一定程度上遵循给定的层次关系,而不会适应实际的数据,其中很多数据可能会出现层次不一致的情况。我们 closure这两个 gap,并提出PROFHiT,这是一种完全probabilistic层次预测模型,可以同时模型整个层次预测分布。PROFHiT使用 flexible probabilistic Bayesian方法,并引入一种新的分布协调regularization,以学习层次关系,并生成准确和协调的预测。对于各种数据集进行评估,我们发现PROFHiT在精度和准确性方面表现出41-88%的提升,同时也能够更好地适应数据集的层次不一致情况。由于模型整个分布的协调性,我们发现PROFHiT可以在数据损失情况下提供可靠的预测,甚至在数据损失10%以上时仍能保持良好的预测性能,而其他方法在这种情况下会导致预测性能下降超过70%。”
results: 本研究实现了一个基于 CityGML KG 框架的方法,可以快速地将CityGML数据转换为 KG 格式,并且可以与其他数据源进行集成。这种方法可以帮助解决城市地理信息查询和推理的问题,并且可以提高城市规划和管理的效率。Abstract
CityGML is a widely adopted standard by the Open Geospatial Consortium (OGC) for representing and exchanging 3D city models. The representation of semantic and topological properties in CityGML makes it possible to query such 3D city data to perform analysis in various applications, e.g., security management and emergency response, energy consumption and estimation, and occupancy measurement. However, the potential of querying CityGML data has not been fully exploited. The official GML/XML encoding of CityGML is only intended as an exchange format but is not suitable for query answering. The most common way of dealing with CityGML data is to store them in the 3DCityDB system as relational tables and then query them with the standard SQL query language. Nevertheless, for end users, it remains a challenging task to formulate queries over 3DCityDB directly for their ad-hoc analytical tasks, because there is a gap between the conceptual semantics of CityGML and the relational schema adopted in 3DCityDB. In fact, the semantics of CityGML itself can be modeled as a suitable ontology. The technology of Knowledge Graphs (KGs), where an ontology is at the core, is a good solution to bridge such a gap. Moreover, embracing KGs makes it easier to integrate with other spatial data sources, e.g., OpenStreetMap and existing (Geo)KGs (e.g., Wikidata, DBPedia, and GeoNames), and to perform queries combining information from multiple data sources. In this work, we describe a CityGML KG framework to populate the concepts in the CityGML ontology using declarative mappings to 3DCityDB, thus exposing the CityGML data therein as a KG. To demonstrate the feasibility of our approach, we use CityGML data from the city of Munich as test data and integrate OpenStreeMap data in the same area.
摘要
CityGML 是一种广泛采用的标准,由开放地理空间协会(OGC)用于表示和交换 3D 城市模型。CityGML 中的 semantics 和 topologic 属性的表示使得可以对这些 3D 城市数据进行查询,并且可以在不同应用中进行分析,如安全管理和紧急应急处理、能源消耗和估算、和占用量测量。然而,CityGML 数据的查询潜力尚未得到完全利用。官方的 GML/XML 编码仅供交换用途,不适用于查询 answered。通常情况下,CityGML 数据会被存储在 3DCityDB 系统中的关系表格中,然后使用标准 SQL 查询语言进行查询。然而,为终端用户来直接对 3DCityDB 进行查询是一项困难的任务,因为 CityGML 的概念 semantics 和 3DCityDB 的关系schema 之间存在一个差距。实际上,CityGML 的 semantics 本身可以被视为一个适当的 ontology。知识图(KGs)技术,其核心是 ontology,是一个好的解决方案,可以bridge 这个差距。此外,采用 KGs 可以轻松地与其他空间数据源集成,如 OpenStreetMap 和现有(Geo)KGs(如 Wikidata、DBPedia 和 GeoNames),并在多个数据源之间进行查询。在这项工作中,我们描述了一个 CityGML KG 框架,通过声明映射来将 CityGML 数据与 3DCityDB 关系表格相关联,从而将 CityGML 数据转换为 KG。为证明我们的方法的可行性,我们使用了 Munich 城市的 CityGML 数据作为测试数据,并将 OpenStreetMap 数据与其同一地区集成。
Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback
results: 我们的结果在比较之下显著超过现状态艺术:一个计算效率较低的算法由Kong et al. [2023]提出,其后悔性为 $\widetilde{\mathcal{O}\left(K^{\frac{4}{5}+poly\left(\frac{1}{\lambda_{\min}\right) \right)$,其中$\lambda_{\min}$是问题依赖的常数,可以是任意的小数。另一个计算效率较高的算法由Sherman et al. [2023b]提出,其后悔性为 $\widetilde{\mathcal{O}\left(K^{\frac{6}{7} \right)$。Abstract
We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of $\widetilde{\mathcal{O}\left(\sqrt{K}\right)$, where $K$ is the number of episodes. This is the first result with the optimal $K$ dependence in the considered setting. The second algorithm, which is based on the policy optimization framework, guarantees a regret of $\widetilde{\mathcal{O}\left(K^{\frac{3}{4} \right)$ and is computationally efficient. Both our results significantly improve over the state-of-the-art: a computationally inefficient algorithm by Kong et al. [2023] with $\widetilde{\mathcal{O}\left(K^{\frac{4}{5}+poly\left(\frac{1}{\lambda_{\min}\right) \right)$ regret, for some problem-dependent constant $\lambda_{\min}$ that can be arbitrarily close to zero, and a computationally efficient algorithm by Sherman et al. [2023b] with $\widetilde{\mathcal{O}\left(K^{\frac{6}{7} \right)$ regret.
摘要
我们研究在线强化学习在线马尔可夫遇到冲突损失和随机反馈下,无需对过程或模拟器的知识。我们介绍了两种算法,它们在比现有方法更好的停损性表现。第一种算法,尽管计算效率低下,但能 garantie $\widetilde{\mathcal{O}\left(\sqrt{K}\right)$ 的停损性,其中 $K$ 是集的集数。这是在考虑的设定中的第一个 $\sqrt{K}$ 依赖性的结果。第二种算法,基于策略优化框架,保证 $\widetilde{\mathcal{O}\left(K^{\frac{3}{4} \right)$ 的停损性,并且计算效率高。我们的结果在现有的state-of-the-art之上进行了显著改进:一个计算不fficient的算法由孔等人(2023)提出的 $\widetilde{\mathcal{O}\left(K^{\frac{4}{5}+poly\left(\frac{1}{\lambda_{\min}\right) \right)$ 的停损性,其中 $\lambda_{\min}$ 是问题依赖的常数,可以是arbitrarily close to zero。以及一个计算效率高的算法由战等人(2023b)提出的 $\widetilde{\mathcal{O}\left(K^{\frac{6}{7} \right)$ 的停损性。
MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning
for: 这个论文旨在提出一种语言特征EXTRACTION方法,特别是自动将多语言词语 syllabified into phonetic transcriptions,并设计与蒙特利尔强制对齐器(MFA)兼容。
methods: 该方法在文本和音频频谱领域均注重提取phonetic transcriptions from text,包括强制对齐器(MFA)中的音频频谱。
results: 通过ablation study,我们证明了我们的方法在多种语言(英语、法语和西班牙语)中自动 syllabify 词语的 efficacy。此外,我们还应用了这种技术到CMU ARCTIC数据集的转录中,生成了在线可用的笔记 Footnote \url{https://github.com/noetits/MUST_P-SRL},这些笔记非常有用于speech representation learning、speech unit discovery和speech factor的分离在多种speech-related领域。Abstract
In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.
摘要
在这篇论文中,我们提出了一种语言特征提取方法,特别是自动将多语言单词分割成音节,并设计 compatibles 于蒙特利尔强制对齐器(MFA)。在文本和音乐频域中,我们的方法专注于从文本中提取音调标记和统一自动分割(在文本和音乐频域)。系统使用开源组件和资源构建。通过减少研究,我们证明了我们的方法在多种语言(英语、法语和西班牙语)中自动分割单词的效果。此外,我们将该技术应用到CMU ARCTIC数据集的转录中,生成了在线可用的注释\footnotemark[\url{https://github.com/noetits/MUST_P-SRL}],这些注释非常有价值,适用于语音表示学习、语音单位发现和多种语音相关领域的分离 speech factor。
Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach
results: 论文表明,如果学习代理人模仿专家的行为策略,就可以在最小化累累 regret 方面做出substantially 更好的表现,并且可以提供$\tilde{O}(\sqrt{T})$的上界 regret bound。Abstract
In this paper, we study the problem of efficient online reinforcement learning in the infinite horizon setting when there is an offline dataset to start with. We assume that the offline dataset is generated by an expert but with unknown level of competence, i.e., it is not perfect and not necessarily using the optimal policy. We show that if the learning agent models the behavioral policy (parameterized by a competence parameter) used by the expert, it can do substantially better in terms of minimizing cumulative regret, than if it doesn't do that. We establish an upper bound on regret of the exact informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a novel prior-dependent regret analysis of Bayesian online learning algorithms for the infinite horizon setting. We then propose an approximate Informed RLSVI algorithm that we can interpret as performing imitation learning with the offline dataset, and then performing online learning.
摘要
在这篇论文中,我们研究了在无穷远Setting中的高效在线强化学习问题,当有一个离线数据集可以作为开始。我们假设离线数据集由专家生成,但专家的水平不确定,即不是完美的和不一定使用优化策略。我们证明,如果学习代理模仿专家的行为策略(参数化为能力参数),它可以在误差总和方面做出较好的表现,比如果不做这样。我们提出了一个新的先验依赖的误差分析方法,并采用了一种新的准确度估计技术。我们then propose一个近似 Informed RLSVI算法,可以视为在线学习后,进行模仿学习。Here's a breakdown of the translation:* "infinite horizon setting" 是 translated as "无穷远Setting" (wú jì yáo jiāng)* "offline dataset" is translated as "离线数据集" (liú xiàn numérical dataset)* "expert" is translated as "专家" (zhuān shī)* "behavioral policy" is translated as "行为策略" (xíng wù zhèng yì)* "competence parameter" is translated as "能力参数" (néng lì jiāng xiàng)* "online learning" is translated as "在线学习" (zhèng xiàng xué xí)* "imitation learning" is translated as "模仿学习" (mó shì xué xí)Note that the translation is in Simplified Chinese, which is the most commonly used variety of Chinese in mainland China. If you need Traditional Chinese, please let me know.
Group Preference Optimization: Few-Shot Alignment of Large Language Models
results: 论文通过对不同群体的人类意见项目进行评估,证明了GPO的可行性和高效性,并且需要 fewer group-specific preferences 和 less training and inference computing resources,比较出Perform existing strategies such as in-context steering and fine-tuning methods.Abstract
Many applications of large language models (LLMs), ranging from chatbots to creative writing, require nuanced subjective judgments that can differ significantly across different groups. Existing alignment algorithms can be expensive to align for each group, requiring prohibitive amounts of group-specific preference data and computation for real-world use cases. We introduce Group Preference Optimization (GPO), an alignment framework that steers language models to preferences of individual groups in a few-shot manner. In GPO, we augment the base LLM with an independent transformer module trained to predict the preferences of a group for the LLM generations. For few-shot learning, we parameterize this module as an in-context autoregressive transformer and train it via meta-learning on several groups. We empirically validate the efficacy of GPO through rigorous evaluations using LLMs with varied sizes on three human opinion adaptation tasks. These tasks involve adapting to the preferences of US demographic groups, global countries, and individual users. Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources, outperforming existing strategies such as in-context steering and fine-tuning methods.
摘要
许多大语言模型(LLM)的应用,从聊天机器人到创作写作,需要具有细化的主观评价,这些评价可能会因不同的群体而异常分布。现有的对Alignment算法可能需要较多的群体特定偏好数据和计算资源,这限制了实际应用场景中的使用。我们介绍了一种名为Group Preference Optimization(GPO)的对Alignment框架,可以在几个尝试中导引语言模型到各个群体的偏好。在GPO中,我们将基础的LLM加上一个独立的转换器模块,用于预测群体的偏好。为了进行几个尝试学习,我们将这个模块参数化为受 Context-Aware Transformer 的扩展,并通过元学习训练在多个群体上。我们通过对LLMs With varied sizes进行实质性验证,证明GPO不仅更准确地对齐模型,还需要 fewer group-specific preferences,并且需要更少的训练和推理计算资源,超越现有的方法,如在Context中引导和精度调整方法。
Guarantees for Self-Play in Multiplayer Games via Polymatrix Decomposability
results: 论文显示了在多智能体系统中,通过使用自我玩家来学习,可以生成高性能的策略,并且这些策略具有 bounded vulnerability。此外,论文还提供了一种Structural property的定义,可以确保这些策略的性能。Abstract
Self-play is a technique for machine learning in multi-agent systems where a learning algorithm learns by interacting with copies of itself. Self-play is useful for generating large quantities of data for learning, but has the drawback that the agents the learner will face post-training may have dramatically different behavior than the learner came to expect by interacting with itself. For the special case of two-player constant-sum games, self-play that reaches Nash equilibrium is guaranteed to produce strategies that perform well against any post-training opponent; however, no such guarantee exists for multiplayer games. We show that in games that approximately decompose into a set of two-player constant-sum games (called constant-sum polymatrix games) where global $\epsilon$-Nash equilibria are boundedly far from Nash equilibria in each subgame (called subgame stability), any no-external-regret algorithm that learns by self-play will produce a strategy with bounded vulnerability. For the first time, our results identify a structural property of multiplayer games that enable performance guarantees for the strategies produced by a broad class of self-play algorithms. We demonstrate our findings through experiments on Leduc poker.
摘要
自我玩家是一种机器学习技术,用于多代理系统中的学习。在这种情况下,学习算法通过与自己的 копиënteract来学习。自我玩家有利于生成大量数据,但是它们可能会导致学习后面对的代理行为与学习期间预期的行为有很大差异。在特殊的两player常数游戏情况下,如果自我玩家达到尼什平衡,那么生成的策略将在任何后期代理对抗中表现出色。然而,对多代理游戏来说,没有类似的保证。我们表明,在 Approximately decomposable 多代理游戏中,如果每个子游戏中的尼什平衡是 global $\epsilon$-尼什平衡的 boundedly far,那么任何没有外部回报折损的自我玩家学习算法将生成有 bounded vulnerability 的策略。这是我们的结果所示,我们的结论标志着多代理游戏中一种结构性质,允许提供性能保证的策略。我们通过启示 Leduc poker 的实验来证明我们的结论。
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
results: 比对状元模型和检索补充模型在多种任务上表现出色,特别是在开放领域问答、理解和事实核查任务中表现出较高的准确性和引用精度。Abstract
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
摘要
尽管大型语言模型(LLM)具有remarkable的能力,但它们经常生成包含错误信息的响应,这是因为它们仅仅依靠自己学习的参数知识。 Retrieval-Augmented Generation(RAG)是一种补做approach,它在LLM中添加了Retrieval的功能,从而减少这些问题。然而,不经过考虑地检索并 incorporate一定数量的检索到的段落,可能会减少LLM的灵活性,或者导致生成异常的响应。我们提出了一种新的框架,叫做Self-Reflective Retrieval-Augmented Generation(Self-RAG),它可以提高LLM的质量和准确性。我们的框架通过在检索和生成过程中适时使用特殊的 tokens, called reflection tokens,来让LLM在推理阶段变得可控。通过生成reflection tokens,LLM可以根据不同的任务要求进行自适应的调整。我们的实验表明,Self-RAG(7B和13B参数)在多种任务上比state-of-the-art LLMs和检索增强模型表现出色,具体来说,Self-RAG在开放领域问答、理解和事实核实任务上表现出色,并且在长形生成中提高了事实准确性和引用率。
CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations
results: 研究发现,在某些情况下(如政治和受难群体)和主题(通用不带争议)中,GPT-4模拟的人物有较高的轮廓化风险。Abstract
Recent work has aimed to capture nuances of human behavior by using LLMs to simulate responses from particular demographics in settings like social science experiments and public opinion surveys. However, there are currently no established ways to discuss or evaluate the quality of such LLM simulations. Moreover, there is growing concern that these LLM simulations are flattened caricatures of the personas that they aim to simulate, failing to capture the multidimensionality of people and perpetuating stereotypes. To bridge these gaps, we present CoMPosT, a framework to characterize LLM simulations using four dimensions: Context, Model, Persona, and Topic. We use this framework to measure open-ended LLM simulations' susceptibility to caricature, defined via two criteria: individuation and exaggeration. We evaluate the level of caricature in scenarios from existing work on LLM simulations. We find that for GPT-4, simulations of certain demographics (political and marginalized groups) and topics (general, uncontroversial) are highly susceptible to caricature.
摘要
Translation notes:* "LLMs" is translated as "人工智能语言模型" (Rényou Jīngjì Yǔyán Módel), which is a more common term in Simplified Chinese.* "simulations" is translated as "模拟" (Móxīng), which is a more general term that can refer to any kind of imitation or representation.* "caricature" is translated as "轮廓" (Lúnkè), which is a more precise term that specifically refers to an exaggerated or distorted representation of someone or something.* "individuation" is translated as "个体化" (Gètǐ Huà), which is a term that specifically refers to the process of creating unique and distinct individuals.* "exaggeration" is translated as "夸大" (Kuòdà), which is a term that specifically refers to the act of amplifying or enlarging something beyond its normal size or proportion.
Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective
results: 我们的evaluaions across four benchmarks validate the efficacy of our proposed method, highlighting the critical factors contributing to the process of parametric knowledge transfer and underscoring the transferability of model parameters across LLMs of different scales.Abstract
Large Language Models (LLMs) inherently encode a wealth of knowledge within their parameters through pre-training on extensive corpora. While prior research has delved into operations on these parameters to manipulate the underlying implicit knowledge (encompassing detection, editing, and merging), there remains an ambiguous understanding regarding their transferability across models with varying scales. In this paper, we seek to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we employ sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales. We release code and data at \url{https://github.com/maszhongming/ParaKnowTransfer}.
摘要
大型语言模型(LLM)内置了丰富的知识于其参数中,通过前期训练于广泛的文本Corpus。 although prior research has explored operations on these parameters to manipulate the underlying implicit knowledge (including detection, editing, and merging), there is still an ambiguous understanding of their transferability across models with varying scales. In this paper, we aim to empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. To achieve this, we use sensitivity-based techniques to extract and align knowledge-specific parameters between different LLMs. Moreover, the LoRA module is used as the intermediary mechanism for injecting the extracted knowledge into smaller models. Evaluations across four benchmarks validate the efficacy of our proposed method. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer, underscoring the transferability of model parameters across LLMs of different scales. We release code and data at \url{https://github.com/maszhongming/ParaKnowTransfer}.Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. If you prefer Traditional Chinese, please let me know and I can provide the translation in that format as well.
Explaining Deep Neural Networks for Bearing Fault Detection with Vibration Concepts
results: 本研究的评估结果表明,通过使用概念基于的解释方法,可以提供人类可理解的和直观的探测结果,但是需要首先验证下面的假设。Abstract
Concept-based explanation methods, such as Concept Activation Vectors, are potent means to quantify how abstract or high-level characteristics of input data influence the predictions of complex deep neural networks. However, applying them to industrial prediction problems is challenging as it is not immediately clear how to define and access appropriate concepts for individual use cases and specific data types. In this work, we investigate how to leverage established concept-based explanation techniques in the context of bearing fault detection with deep neural networks trained on vibration signals. Since bearings are prevalent in almost every rotating equipment, ensuring the reliability of intransparent fault detection models is crucial to prevent costly repairs and downtimes of industrial machinery. Our evaluations demonstrate that explaining opaque models in terms of vibration concepts enables human-comprehensible and intuitive insights about their inner workings, but the underlying assumptions need to be carefully validated first.
摘要
通用概念解释方法,如概念启动向量,是复杂深度神经网络预测结果的强大手段。然而,在实际应用中存在许多挑战,因为不清楚如何定义和访问特定用例和数据类型的适当概念。在这种情况下,我们研究如何在深度神经网络对振荡信号进行报废疾病检测时,使用已有的概念基本解释技术。由于机器设备中的承轮很普遍,因此确保报废模型的可靠性是关键,以避免高昂的维护和机器停机成本。我们的评估表明,通过将深度神经网络解释为振荡概念的方式,可以提供人类可理解和直观的内部启示,但是下面的假设需要首先进行验证。
Large Language Model Prediction Capabilities: Evidence from a Real-World Forecasting Tournament
results: GPT-4的预测不如人群预测准确,并且没有显著的偏好 towards any particular answer.Abstract
Accurately predicting the future would be an important milestone in the capabilities of artificial intelligence. However, research on the ability of large language models to provide probabilistic predictions about future events remains nascent. To empirically test this ability, we enrolled OpenAI's state-of-the-art large language model, GPT-4, in a three-month forecasting tournament hosted on the Metaculus platform. The tournament, running from July to October 2023, attracted 843 participants and covered diverse topics including Big Tech, U.S. politics, viral outbreaks, and the Ukraine conflict. Focusing on binary forecasts, we show that GPT-4's probabilistic forecasts are significantly less accurate than the median human-crowd forecasts. We find that GPT-4's forecasts did not significantly differ from the no-information forecasting strategy of assigning a 50% probability to every question. We explore a potential explanation, that GPT-4 might be predisposed to predict probabilities close to the midpoint of the scale, but our data do not support this hypothesis. Overall, we find that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts. A potential explanation for this underperformance is that in real-world forecasting tournaments, the true answers are genuinely unknown at the time of prediction; unlike in other benchmark tasks like professional exams or time series forecasting, where strong performance may at least partly be due to the answers being memorized from the training data. This makes real-world forecasting tournaments an ideal environment for testing the generalized reasoning and prediction capabilities of artificial intelligence going forward.
摘要
如果可以准确预测未来,那将是人工智能的重要突破口。然而,关于大语言模型是否能够提供关于未来事件的 probabilistic 预测的研究仍然处于早期阶段。为了实证这一能力,我们在2023年7月至10月的三个月时间内参加了OpenAI的state-of-the-art大语言模型GPT-4在Metaculus平台上的三个月预测赛。这场赛事吸引了843名参与者,涵盖了多个话题,包括大科技、美国政治、艾滋疫情和乌克兰冲突。我们对binary forecast进行了分析,发现GPT-4的 probabilistic 预测与人群预测的 median 相比远不准确。我们发现GPT-4的预测与无信息预测策略(每个问题的概率为50%)没有显著差异。我们还探讨了一个可能的解释:GPT-4可能倾向于预测概率较 бли至中点的情况,但我们的数据不支持这一假设。总的来说,我们发现GPT-4在真实世界的预测任务中表现 significatively 差,与人群预测的 median 相比。一个可能的解释是,在真实世界预测赛事中,答案并不是在预测时已知的;与其他benchmark任务like专业考试或时间序列预测不同,在这些任务中,AI的出色表现可能归结于训练数据中Memorization。这些因素使得真实世界预测赛事成为测试人工智能总化逻辑和预测能力的理想环境。
Functional Invariants to Watermark Large Transformers
results: 实验表明该方法可以保持模型的输出不变,同时具有耐变性和隐蔽性,可以实用地保护大型模型的完整性和所有权Abstract
The rapid growth of transformer-based models increases the concerns about their integrity and ownership insurance. Watermarking addresses this issue by embedding a unique identifier into the model, while preserving its performance. However, most existing approaches require to optimize the weights to imprint the watermark signal, which is not suitable at scale due to the computational cost. This paper explores watermarks with virtually no computational cost, applicable to a non-blind white-box setting (assuming access to both the original and watermarked networks). They generate functionally equivalent copies by leveraging the models' invariance, via operations like dimension permutations or scaling/unscaling. This enables to watermark models without any change in their outputs and remains stealthy. Experiments demonstrate the effectiveness of the approach and its robustness against various model transformations (fine-tuning, quantization, pruning), making it a practical solution to protect the integrity of large models.
摘要
“transformer模型的快速增长带来了其完整性和所有权保险的问题。水印可以解决这个问题,通过在模型中嵌入唯一标识符,保持其性能。但现有的方法通常需要优化权重来印制水印信号,这会在大规模执行中带来计算成本问题。这篇论文探讨了免计算成本的水印方法,适用于非盲目白盒设定(即可以访问原始网络和水印网络)。它通过利用模型的变换不变性,通过维度重新排序或缩放/减速等操作,生成功能相同的模型复制。这些复制品可以隐蔽地携带水印信号,无需改变模型的输出。实验表明该方法的有效性和鲁棒性,可以实际地保护大型模型的完整性。”
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
results: GPT-4V与SoM相比,在RefCOCOg zero-shot Setting中表现出比州前进referring segmentation模型更高的性能Abstract
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.
摘要
我团队在这篇论文中提出了一种新的视觉提示方法,即Set-of-Mark(SoM),用于解 liberate大型多 modal模型(LMMs)的视觉固定能力。如图1(右)所示,我们使用商业化的交互式分割模型,如SAM,将图像分成不同粒度的区域,并将这些区域 overlay 以不同的标记,如字母、面具、盒子。使用这些标记的图像作为输入,GPT-4V可以回答需要视觉固定的问题。我们进行了广泛的实验研究,以验证SoM在多种细化视觉和多模态任务中的效果。例如,我们的实验表明,GPT-4V与SoM在RefCOCOg中的零基础情况下,与现有的全部finetune Referring Segmentation模型相比,表现出了更好的性能。
Understanding deep neural networks through the lens of their non-linearity
paper_authors: Quentin Bouniot, Ievgen Redko, Anton Mallasto, Charlotte Laclau, Karol Arndt, Oliver Struckmeier, Markus Heinonen, Ville Kyrki, Samuel Kaski
results: 实验结果表明,提出的 affinity score 能够帮助我们更好地理解深度神经网络中的非线性传播,并且具有广泛的实际应用前景。Abstract
The remarkable success of deep neural networks (DNN) is often attributed to their high expressive power and their ability to approximate functions of arbitrary complexity. Indeed, DNNs are highly non-linear models, and activation functions introduced into them are largely responsible for this. While many works studied the expressive power of DNNs through the lens of their approximation capabilities, quantifying the non-linearity of DNNs or of individual activation functions remains an open problem. In this paper, we propose the first theoretically sound solution to track non-linearity propagation in deep neural networks with a specific focus on computer vision applications. Our proposed affinity score allows us to gain insights into the inner workings of a wide range of different architectures and learning paradigms. We provide extensive experimental results that highlight the practical utility of the proposed affinity score and its potential for long-reaching applications.
摘要
深度神经网络(DNN)的异常成功常被归结于它们的高表达力和能够近似任何复杂性函数的能力。实际上,DNN是非线性模型,活动函数引入到它们中央对此负有责任。虽然许多研究对DNN的表达力进行了研究,但量化DNN的非线性或各个活动函数的非线性仍然是一个开放的问题。在这篇论文中,我们提出了首个理论上有sound的解决方案,用于跟踪深度神经网络中非线性的传播。我们提出的相互关系分数允许我们深入了解各种不同架构和学习方法的内部工作机制。我们提供了广泛的实验结果, highlighting the practical utility of the proposed affinity score and its potential for long-reaching applications.
Evaluating LLMs for Privilege-Escalation Scenarios
results: 研究发现,LLM 可以在特权提升方面具有很高的能力,但也存在一些挑战,如维持测试中的集中性、处理错误等。Abstract
Penetration testing, an essential component of cybersecurity, allows organizations to proactively identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against potential cyberattacks. One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs). We explore the intersection of LLMs and penetration testing to gain insight into their capabilities and challenges in the context of privilige escalation. We create an automated Linux privilege-escalation benchmark utilizing local virtual machines. We introduce an LLM-guided privilege-escalation tool designed for evaluating different LLMs and prompt strategies against our benchmark. We analyze the impact of different prompt designs, the benefits of in-context learning, and the advantages of offering high-level guidance to LLMs. We discuss challenging areas for LLMs, including maintaining focus during testing, coping with errors, and finally comparing them with both stochastic parrots as well as with human hackers.
摘要
To evaluate the performance of LLMs in privilege escalation, we created an automated Linux privilege-escalation benchmark using local virtual machines. We also developed an LLM-guided privilege-escalation tool that assesses different LLMs and prompt strategies against our benchmark. Our analysis reveals the impact of various prompt designs, the benefits of in-context learning, and the advantages of providing high-level guidance to LLMs.However, LLMs also face challenges, such as maintaining focus during testing, coping with errors, and comparing with both stochastic parrots and human hackers. By understanding these challenges and the capabilities of LLMs, we can better utilize them in penetration testing to enhance the security of our systems.
Neural Attention: Enhancing QKV Calculation in Self-Attention Mechanism with Neural Networks
methods: 这篇论文使用了一个 modificated Marian model,通过实验证明了其对IWSLT 2017德国英语翻译任务数据集的表现有所提高。此外,这篇论文还证明了它的方法在训练Roberta模型时,对Wikitext-103数据集的表现也有所改善。
results: 实验结果显示,这篇论文的方法可以提高BLEU scores,并且在训练Roberta模型时,对模型的误差率有所降低。这些实验结果不仅证明了这篇论文的方法的有效性,而且还显示了对自注意力机制的优化可以通过神经网络基础的QKV计算,对未来的研究和实际应用具有广泛的应用前景。Abstract
In the realm of deep learning, the self-attention mechanism has substantiated its pivotal role across a myriad of tasks, encompassing natural language processing and computer vision. Despite achieving success across diverse applications, the traditional self-attention mechanism primarily leverages linear transformations for the computation of query, key, and value (QKV), which may not invariably be the optimal choice under specific circumstances. This paper probes into a novel methodology for QKV computation-implementing a specially-designed neural network structure for the calculation. Utilizing a modified Marian model, we conducted experiments on the IWSLT 2017 German-English translation task dataset and juxtaposed our method with the conventional approach. The experimental results unveil a significant enhancement in BLEU scores with our method. Furthermore, our approach also manifested superiority when training the Roberta model with the Wikitext-103 dataset, reflecting a notable reduction in model perplexity compared to its original counterpart. These experimental outcomes not only validate the efficacy of our method but also reveal the immense potential in optimizing the self-attention mechanism through neural network-based QKV computation, paving the way for future research and practical applications. The source code and implementation details for our proposed method can be accessed at https://github.com/ocislyjrti/NeuralAttention.
摘要
在深度学习领域,自注意机制在多种任务中发挥了重要作用,包括自然语言处理和计算机视觉。尽管在多种应用中取得成功,传统的自注意机制通常使用线性变换来计算查询、关键和值(QKV),这可能不一定是最佳选择在特定情况下。这篇论文探讨了一种新的QKV计算方法-通过特制的神经网络结构来实现。我们使用修改后的马里安模型进行实验,并在IWSLT 2017德语英语翻译任务数据集上对我们的方法与传统方法进行了比较。实验结果显示我们的方法可以显著提高BLEU分数,而且我们的方法也在使用Wikitext-103数据集训练Roberta模型时表现出了明显的降低模型困惑度,相比其原始对手。这些实验结果不仅证明了我们的方法的有效性,还暴露了优化自注意机制的可能性,铺开了未来研究和实践应用的道路。我们的代码和实现细节可以通过https://github.com/ocislyjrti/NeuralAttention访问。
Towards Automatic Satellite Images Captions Generation Using Large Language Models
methods: using large language models (LLMs) to guide the description of object annotations
results: effective collection of captions for remote sensing imagesAbstract
Automatic image captioning is a promising technique for conveying visual information using natural language. It can benefit various tasks in satellite remote sensing, such as environmental monitoring, resource management, disaster management, etc. However, one of the main challenges in this domain is the lack of large-scale image-caption datasets, as they require a lot of human expertise and effort to create. Recent research on large language models (LLMs) has demonstrated their impressive performance in natural language understanding and generation tasks. Nonetheless, most of them cannot handle images (GPT-3.5, Falcon, Claude, etc.), while conventional captioning models pre-trained on general ground-view images often fail to produce detailed and accurate captions for aerial images (BLIP, GIT, CM3, CM3Leon, etc.). To address this problem, we propose a novel approach: Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images by guiding LLMs to describe their object annotations. We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images. Our evaluation demonstrates the effectiveness of our approach for collecting captions for remote sensing images.
摘要
自动图像描述技术是落实视觉信息使用自然语言的有前途技术。它可以帮助卫星遥感任务中的各种任务,如环境监测、资源管理、灾害管理等。然而,这个领域的主要挑战之一是缺乏大规模的图像描述数据集,因为需要大量的人工专业和努力来创建。latest research on large language models (LLMs) has shown their impressive performance in natural language understanding and generation tasks. However, most of them cannot handle images (GPT-3.5, Falcon, Claude, etc.), while conventional captioning models pre-trained on general ground-view images often fail to produce detailed and accurate captions for aerial images (BLIP, GIT, CM3, CM3Leon, etc.). To address this problem, we propose a novel approach: Automatic Remote Sensing Image Captioning (ARSIC) to automatically collect captions for remote sensing images by guiding LLMs to describe their object annotations. We also present a benchmark model that adapts the pre-trained generative image2text model (GIT) to generate high-quality captions for remote-sensing images. Our evaluation demonstrates the effectiveness of our approach for collecting captions for remote sensing images.
End-to-End real time tracking of children’s reading with pointer network
results: 这个论文的结果表明,使用一种基于强制对齐的神经网络模型可以达到至少与Montreal强制对齐器相同的对齐精度,并且 surprisingly 是一个更好的训练信号 для指针网络。结果表明,在一个成人语音数据集(TIMIT)和两个儿童语音数据集(CMU Kids和Reading Races)上,这个最佳模型可以高精度地跟踪成人语音(87.8%)和儿童语音(77.1%)。Abstract
In this work, we explore how a real time reading tracker can be built efficiently for children's voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children's speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children's speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset.
摘要
在这项工作中,我们探讨了如何fficiently构建儿童语音实时读写追踪器。之前的提议的读写追踪器都是基于ASR顺序的搅拌方法,而我们提议一个完全端到端模型,使其更少受到语音追踪延迟。我们使用一个指针网络,直接学习 predict文本中的位置,条件于流动的Speech。为了训练这个指针网络,我们生成了ground truth训练信号,通过强制对READING speech和被读文本之间进行对齐。我们explore了不同的强制对齐模型,发现一个神经网络注意力基于模型,与蒙特利尔强制对齐器准确性相当,但是 surprisingly是一个更好的训练信号 для指针网络。我们的结果在一个成人语音数据(TIMIT)和两个儿童语音数据集(CMU Kids和Reading Races)上被报告。我们的最佳模型可以准确地跟踪成人语音87.8%的准确率,以及更难和不稳定的儿童语音77.1%的准确率在CMU Kids数据集上,以及65.3%的准确率在Reading Races数据集上。
The effect of stemming and lemmatization on Portuguese fake news text classification
results: 结果显示,预处理步骤对伪新闻分类有着重要的影响,lemmatization和stemming等技术可以帮助提高伪新闻检测的精度。Abstract
With the popularization of the internet, smartphones and social media, information is being spread quickly and easily way, which implies bigger traffic of information in the world, but there is a problem that is harming society with the dissemination of fake news. With a bigger flow of information, some people are trying to disseminate deceptive information and fake news. The automatic detection of fake news is a challenging task because to obtain a good result is necessary to deal with linguistics problems, especially when we are dealing with languages that not have been comprehensively studied yet, besides that, some techniques can help to reach a good result when we are dealing with text data, although, the motivation of detecting this deceptive information it is in the fact that the people need to know which information is true and trustful and which one is not. In this work, we present the effect the pre-processing methods such as lemmatization and stemming have on fake news classification, for that we designed some classifier models applying different pre-processing techniques. The results show that the pre-processing step is important to obtain betters results, the stemming and lemmatization techniques are interesting methods and need to be more studied to develop techniques focused on the Portuguese language so we can reach better results.
摘要
In this work, we investigate the effect of pre-processing methods, such as lemmatization and stemming, on fake news classification. We designed several classifier models using different pre-processing techniques and analyzed the results. Our findings show that the pre-processing step is crucial for achieving better results. The stemming and lemmatization techniques are promising methods that deserve further study, particularly for the Portuguese language, to improve the accuracy of fake news detection.
Dual Cognitive Architecture: Incorporating Biases and Multi-Memory Systems for Lifelong Learning
paper_authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani for: 这篇论文的目的是探讨人工神经网络(ANNs)在静止独立数据上的局限性,并提出一种基于人类认知结构和多记忆系统的新框架,以实现人工智能的持久学习能力。methods: 该框架基于多种人类认知结构和多记忆系统,包括多个子系统、隐式和显式知识表示分离、偏见适应和多记忆系统。它还包括一个概率适应学习器,用于编码形态信息,从而避免ANNs学习本地文本的偏好。results: 在不同的设定和数据集上,DUCA显示出了改进的表现,并且不需要额外信息。此外,DUCA还在一个复杂的域逐渐学习数据集DN4IL上表现出优异,证明了其在面临分布转换时的多样化持久学习能力。Abstract
Artificial neural networks (ANNs) exhibit a narrow scope of expertise on stationary independent data. However, the data in the real world is continuous and dynamic, and ANNs must adapt to novel scenarios while also retaining the learned knowledge to become lifelong learners. The ability of humans to excel at these tasks can be attributed to multiple factors ranging from cognitive computational structures, cognitive biases, and the multi-memory systems in the brain. We incorporate key concepts from each of these to design a novel framework, Dual Cognitive Architecture (DUCA), which includes multiple sub-systems, implicit and explicit knowledge representation dichotomy, inductive bias, and a multi-memory system. The inductive bias learner within DUCA is instrumental in encoding shape information, effectively countering the tendency of ANNs to learn local textures. Simultaneously, the inclusion of a semantic memory submodule facilitates the gradual consolidation of knowledge, replicating the dynamics observed in fast and slow learning systems, reminiscent of the principles underpinning the complementary learning system in human cognition. DUCA shows improvement across different settings and datasets, and it also exhibits reduced task recency bias, without the need for extra information. To further test the versatility of lifelong learning methods on a challenging distribution shift, we introduce a novel domain-incremental dataset DN4IL. In addition to improving performance on existing benchmarks, DUCA also demonstrates superior performance on this complex dataset.
摘要
results: 研究发现,使用cf-ASE可以准确地评估多个代理的决策对结果的影响,并且可以通过实验 validate 这种方法。Abstract
Establishing causal relationships between actions and outcomes is fundamental for accountable multi-agent decision-making. However, interpreting and quantifying agents' contributions to such relationships pose significant challenges. These challenges are particularly prominent in the context of multi-agent sequential decision-making, where the causal effect of an agent's action on the outcome depends on how the other agents respond to that action. In this paper, our objective is to present a systematic approach for attributing the causal effects of agents' actions to the influence they exert on other agents. Focusing on multi-agent Markov decision processes, we introduce agent-specific effects (ASE), a novel causal quantity that measures the effect of an agent's action on the outcome that propagates through other agents. We then turn to the counterfactual counterpart of ASE (cf-ASE), provide a sufficient set of conditions for identifying cf-ASE, and propose a practical sampling-based algorithm for estimating it. Finally, we experimentally evaluate the utility of cf-ASE through a simulation-based testbed, which includes a sepsis management environment.
摘要
We focus on multi-agent Markov decision processes and introduce agent-specific effects (ASE), a novel causal quantity that measures the effect of an agent's action on the outcome that propagates through other agents. We then introduce the counterfactual counterpart of ASE (cf-ASE) and provide a set of conditions for identifying it. We propose a practical sampling-based algorithm for estimating cf-ASE.We experimentally evaluate the utility of cf-ASE through a simulation-based testbed, including a sepsis management environment. By attributing the causal effects of agents' actions to their influence on other agents, our approach enables accountable decision-making in multi-agent systems.
Key Point-based Orientation Estimation of Strawberries for Robotic Fruit Picking
paper_authors: Justin Le Louëdec, Grzegorz Cielniak
For: This paper aims to address the issue of labor shortages in modern agriculture by developing a robotic harvesting system that can accurately and efficiently pick fruit.* Methods: The proposed method uses key-point-based fruit orientation estimation, which allows for the prediction of 3D orientation from 2D images directly. The method does not require full 3D orientation annotations and can exploit such information for improved accuracy.* Results: The proposed method achieves state-of-the-art performance with an average error of $8^\circ$, improving predictions by $\sim30%$ compared to previous work. The method also has fast inference times of $\sim30$ms, making it suitable for real-time robotic applications.Here is the text in Simplified Chinese:* For: 这篇论文目的是解决现代农业中的劳动力短缺问题,通过开发一种可以准确地和高效地采集水果的 роботизирован采集系统。* Methods: 该方法使用关键点基于的水果方向估计方法,可以直接从2D图像中预测3D方向。该方法不需要全部3D方向注释,可以利用这些信息来提高准确性。* Results: 该方法在两个独立的苺果图像集上取得了状态机器人应用中的最佳性能,错误率为8度,提高了前一个作者在~\cite{wagner2021efficient}中提出的前一个作者的30%。此外,该方法的推理时间为30ms,适用于实时机器人应用。Abstract
Selective robotic harvesting is a promising technological solution to address labour shortages which are affecting modern agriculture in many parts of the world. For an accurate and efficient picking process, a robotic harvester requires the precise location and orientation of the fruit to effectively plan the trajectory of the end effector. The current methods for estimating fruit orientation employ either complete 3D information which typically requires registration from multiple views or rely on fully-supervised learning techniques, which require difficult-to-obtain manual annotation of the reference orientation. In this paper, we introduce a novel key-point-based fruit orientation estimation method allowing for the prediction of 3D orientation from 2D images directly. The proposed technique can work without full 3D orientation annotations but can also exploit such information for improved accuracy. We evaluate our work on two separate datasets of strawberry images obtained from real-world data collection scenarios. Our proposed method achieves state-of-the-art performance with an average error as low as $8^{\circ}$, improving predictions by $\sim30\%$ compared to previous work presented in~\cite{wagner2021efficient}. Furthermore, our method is suited for real-time robotic applications with fast inference times of $\sim30$ms.
摘要
选择性机器人收割是一种有前途的技术解决方案,用于解决现代农业中的劳动力短缺问题。为了实现准确和高效的摘取过程,机器人收割器需要准确地知道果实的位置和方向。现有的果实方向估计方法可以通过多视图注册或完全超级vised学习技术来实现,但这些方法具有困难获得的手动参考方向的缺点。在本文中,我们介绍了一种新的关键点基于果实方向估计方法,可以直接从2D图像中预测3D方向。我们的提议方法不需要全部3D方向注释,但也可以利用这些信息以提高准确性。我们在两个独立的苹果图像集上进行了评估,并达到了现有最佳性能,具体错误为$8^{\circ}$,提高了前一个工作(refer to ~\cite{wagner2021efficient})的预测值 by $\sim30\%$。此外,我们的方法适用于实时机器人应用程序,推理时间为$\sim30$ms。
Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
results: 我们发现,使用不同的提示格式可以导致LLM的性能差异非常大,最大差异可达76个准确率点。此外,我们还发现,不同的模型之间的格式敏感性强相关,这提出了评估LLM性能的方法ологи问题。Abstract
As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.
摘要
大型语言模型(LLM)在语言技术中扮演重要角色,因此精确测量其性能是非常重要的。因为选择提示的设计可以强烈影响模型的行为,因此这个设计过程是使用任何现代预训练的生成语言模型时非常重要。在这个工作中,我们专注于 LLM 对提示格式的敏感性。我们发现了许多常用的开源 LLM 在少量示例设定下表现出敏感性,对 LLLaMA-2-13B 的表现范围为 Up to 76 个精确度点。这些敏感性在提高模型大小、几何示例数量或进行指令调整时仍然存在。我们的分析表明,使用提示方式进行评估 LLM 的方法不应该只报告单一的表现方式,而是应该报告一个范围的表现。此外,我们还证明了不同模型之间的格式表现相互弱相联,这让我们对比模型使用随机选择的固定提示方式的方法存在问题。为了促进系统性的分析,我们提出 FormatSpread,一个算法可以快速评估任务中的可能提示格式,并报告预期的表现范围,不需要访问模型的材料。此外,我们还进行了一系列的分析,以探讨这种敏感性的性质,包括探讨具体的原子变化和内部表现的特点。
Utilising a Large Language Model to Annotate Subject Metadata: A Case Study in an Australian National Research Data Catalogue
results: 本研究表明,使用 GPT-3.5 进行主题元数据注释可以达到可观的效果,但是基于Context learning的模型无法学习专业领域规则,导致一些类别的表现较差。Abstract
In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. Yet, it is a common issue that datasets often lack quality metadata due to limited resources for data curation. Meanwhile, technologies such as artificial intelligence and large language models (LLMs) are progressing rapidly. Recently, systems based on these technologies, such as ChatGPT, have demonstrated promising capabilities for certain data curation tasks. This paper proposes to leverage LLMs for cost-effective annotation of subject metadata through the LLM-based in-context learning. Our method employs GPT-3.5 with prompts designed for annotating subject metadata, demonstrating promising performance in automatic metadata annotation. However, models based on in-context learning cannot acquire discipline-specific rules, resulting in lower performance in several categories. This limitation arises from the limited contextual information available for subject inference. To the best of our knowledge, we are introducing, for the first time, an in-context learning method that harnesses large language models for automated subject metadata annotation.
摘要
为支持开放和可重复的研究,现有的数据集数量在不断增加。随着数据集的可用性增加,高质量的元数据成为发现和重用数据的关键。然而,由于数据整理有限的资源,数据集经常缺乏高质量的元数据。在这种情况下,人工智能和大语言模型(LLM)的技术在不断进步。最近,基于这些技术的系统,如ChatGPT,在某些数据整理任务中表现出了承诺的能力。本文提议利用LLM进行cost-effective的主题元数据注释。我们的方法使用GPT-3.5,并通过特制的提示来注释主题元数据,表现出了可观的自动元数据注释性能。然而,基于内容学习的模型无法学习专业领域的规则,导致在某些类别上表现较差。这种局限性来自于内容学习模型所lack的专业领域知识。根据我们所知,我们是第一次在内容学习中使用大语言模型进行自动主题元数据注释。
Generative error correction for code-switching speech recognition using large language models
results: 实验证明,使用这种方法可以显著提高CS-ASR的精度,降低混合错误率(MER)。同时,LLM在H2T学习中表现出了remarkable的数据效率,可能解决CS-ASR在低资源语言中的数据稀缺问题。Abstract
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task ought to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) mapping by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages.
摘要
��enson switching (CS) ���� speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), CS-ASR is still a challenging task due to the grammatical structure complexity of the phenomenon and the data scarcity of specific training corpus. In this work, we propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem. Specifically, we first employ multiple well-trained ASR models for N-best hypotheses generation, with the aim of increasing the diverse and informative elements in the set of hypotheses. Next, we utilize the LLMs to learn the hypotheses-to-transcription (H2T) mapping by adding a trainable low-rank adapter. Such a generative error correction (GER) method directly predicts the accurate transcription according to its expert linguistic knowledge and N-best hypotheses, resulting in a paradigm shift from the traditional language model rescoring or error correction techniques. Experimental evidence demonstrates that GER significantly enhances CS-ASR accuracy, in terms of reduced mixed error rate (MER). Furthermore, LLMs show remarkable data efficiency for H2T learning, providing a potential solution to the data scarcity problem of CS-ASR in low-resource languages.
MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient
results: 在KITTI 3D物体检测 benchmark 上进行了广泛的实验,并证明了我们的方法可以在挑战性的情况下达到最佳性能,而且无需额外的计算成本。Abstract
Monocular 3D object detection is an inherently ill-posed problem, as it is challenging to predict accurate 3D localization from a single image. Existing monocular 3D detection knowledge distillation methods usually project the LiDAR onto the image plane and train the teacher network accordingly. Transferring LiDAR-based model knowledge to RGB-based models is more complex, so a general distillation strategy is needed. To alleviate cross-modal prob-lem, we propose MonoSKD, a novel Knowledge Distillation framework for Monocular 3D detection based on Spearman correlation coefficient, to learn the relative correlation between cross-modal features. Considering the large gap between these features, strict alignment of features may mislead the training, so we propose a looser Spearman loss. Furthermore, by selecting appropriate distillation locations and removing redundant modules, our scheme saves more GPU resources and trains faster than existing methods. Extensive experiments are performed to verify the effectiveness of our framework on the challenging KITTI 3D object detection benchmark. Our method achieves state-of-the-art performance until submission with no additional inference computational cost. Our codes are available at https://github.com/Senwang98/MonoSKD
摘要
单目3D物体检测是一个自然的问题,因为从单一图像中预测 precisel 3D位置是困难的。现有的单目3D检测知识传承方法通常是将LiDAR投射到图像平面上,然后对教师网络进行训练。将LiDAR基础的模型知识转移到RGB基础的模型上是更加复杂的,因此需要一个通用的传承策略。为了解决两 modal 之间的问题,我们提出了MonoSKD,一个基于Spearman相関系数的知识传承框架 для单目3D检测。由于这两种特征之间的差距很大,严格对特征进行对齐可能会导致训练失败,因此我们提出了一个更宽松的Spearman损失。此外,我们选择了适当的传承位置和移除了额外的模组,使我们的方案可以更好地运用GPU资源,训练速度更快。我们进行了广泛的实验,以验证我们的框架在KITTI 3D物体检测标准 benchmark 上的效果。我们的方法可以在提交前的测试中实现state-of-the-art的性能,并且不需要额外的推论计算成本。我们的代码可以在https://github.com/Senwang98/MonoSKD 上找到。
MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games
results: 我们的实验结果显示,在两个棋盘游戏中,使用更多的 simulations 通常会导致更高的表现。但是,在不同的游戏中,AlphaZero和MuZero的选择可能会因游戏特性而异。在Atari游戏中,MuZero和Gumbel MuZero都是值得考虑的。进步的 simulation 可以将计算资源分配更有效率,并在两个棋盘游戏中获得了明显的提升。Abstract
This paper presents MiniZero, a zero-knowledge learning framework that supports four state-of-the-art algorithms, including AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero. While these algorithms have demonstrated super-human performance in many games, it remains unclear which among them is most suitable or efficient for specific tasks. Through MiniZero, we systematically evaluate the performance of each algorithm in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games. Our empirical findings are summarized as follows. For two board games, using more simulations generally results in higher performance. However, the choice of AlphaZero and MuZero may differ based on game properties. For Atari games, both MuZero and Gumbel MuZero are worth considering. Since each game has unique characteristics, different algorithms and simulations yield varying results. In addition, we introduce an approach, called progressive simulation, which progressively increases the simulation budget during training to allocate computation more efficiently. Our empirical results demonstrate that progressive simulation achieves significantly superior performance in two board games. By making our framework and trained models publicly available, this paper contributes a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in algorithm selection and comparison against these zero-knowledge learning baselines.
摘要
For two board games, using more simulations generally leads to higher performance. However, the choice between AlphaZero and MuZero may depend on the properties of the game. For Atari games, both MuZero and Gumbel MuZero are worth considering. As each game has unique characteristics, different algorithms and simulations produce varying results.We also introduce an approach called progressive simulation, which increases the simulation budget during training to allocate computation more efficiently. Our results show that progressive simulation achieves significantly better performance in two board games. By making our framework and trained models publicly available, this paper provides a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in selecting and comparing against these baselines.
Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models
for: This paper aims to provide a novel framework for clinical problem-solving tools in healthcare, based on a Large Language Model (LLM) that mimics human cognitive processes.
methods: The framework, called BooksMed, utilizes the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework to effectively quantify evidence strength, and is evaluated using a multispecialty clinical benchmark called ExpertMedQA, which consists of open-ended, expert-level clinical questions validated by a diverse group of medical professionals.
results: The paper shows that BooksMed outperforms existing state-of-the-art models Med-PaLM 2, Almanac, and ChatGPT in a variety of medical scenarios, demonstrating the effectiveness of the framework in providing reliable and evidence-based responses to clinical inquiries.Abstract
In response to the pressing need for advanced clinical problem-solving tools in healthcare, we introduce BooksMed, a novel framework based on a Large Language Model (LLM). BooksMed uniquely emulates human cognitive processes to deliver evidence-based and reliable responses, utilizing the GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) framework to effectively quantify evidence strength. For clinical decision-making to be appropriately assessed, an evaluation metric that is clinically aligned and validated is required. As a solution, we present ExpertMedQA, a multispecialty clinical benchmark comprised of open-ended, expert-level clinical questions, and validated by a diverse group of medical professionals. By demanding an in-depth understanding and critical appraisal of up-to-date clinical literature, ExpertMedQA rigorously evaluates LLM performance. BooksMed outperforms existing state-of-the-art models Med-PaLM 2, Almanac, and ChatGPT in a variety of medical scenarios. Therefore, a framework that mimics human cognitive stages could be a useful tool for providing reliable and evidence-based responses to clinical inquiries.
摘要
响应医疗领域的高级临床问题解决工具的急需,我们介绍BooksMed,一种新的框架,基于大型自然语言模型(LLM)。BooksMed模仿人类认知过程,以提供基于证据的可靠回答,使用GRADE(评估、评价、发展和评估)框架来有效地评估证据强度。为了正确评估临床决策,需要一种与临床相关的评价标准,而我们提出ExpertMedQA,一个多学科临床引用库,由多个医疗专业人员组成,并被证明了。通过要求对当前临床文献进行深入理解和批判性评估,ExpertMedQA严格评估LLM性能。BooksMed在多种医疗场景中表现出色,超越了现有的state-of-the-art模型Med-PaLM 2、Almanac和ChatGPT。因此,一个模仿人类认知阶段的框架可能是一种有用的工具,用于提供可靠和基于证据的回答临床问题。
Revealing the Unwritten: Visual Investigation of Beam Search Trees to Address Language Model Prompting Challenges
paper_authors: Thilo Spinner, Rebecca Kehlbeck, Rita Sevastjanova, Tobias Stähle, Daniel A. Keim, Oliver Deussen, Andreas Spitz, Mennatallah El-Assady
for: 本研究旨在探讨大语言模型的生成过程中,如何通过提示方法来引导模型输出。
methods: 本研究使用了一种交互式视觉方法,可以帮助分析大语言模型在生成过程中的决策。
results: 研究发现,通过曝光搜索树可以提供有价值的信息,并提供了五种细化分析场景来解决一些关注点。这些结果 validate 现有结果,并提供了新的视角。Abstract
The growing popularity of generative language models has amplified interest in interactive methods to guide model outputs. Prompt refinement is considered one of the most effective means to influence output among these methods. We identify several challenges associated with prompting large language models, categorized into data- and model-specific, linguistic, and socio-linguistic challenges. A comprehensive examination of model outputs, including runner-up candidates and their corresponding probabilities, is needed to address these issues. The beam search tree, the prevalent algorithm to sample model outputs, can inherently supply this information. Consequently, we introduce an interactive visual method for investigating the beam search tree, facilitating analysis of the decisions made by the model during generation. We quantitatively show the value of exposing the beam search tree and present five detailed analysis scenarios addressing the identified challenges. Our methodology validates existing results and offers additional insights.
摘要
随着生成语言模型的普及,对生成输出的指导方法已经引起了更多的关注。Prompt refinement被认为是影响输出的最有效的方法之一。我们identified several challenges associated with prompting large language models,分为数据特定和模型特定、语言学的和社会语言学的挑战。为了解决这些问题,我们需要进行全面的模型输出的检查,包括 runner-up candidates和其对应的概率。为此,我们引入了一种互动视觉方法,用于调查生成过程中模型做出的决定。我们量化地表明了曝光 beam search tree的价值,并提出了五种细化分析场景,用于解决我们所 identific的挑战。我们的方法证明了现有结果的正确性,同时还提供了更多的洞察。
Leveraging Large Language Model for Automatic Evolving of Industrial Data-Centric R&D Cycle
results: 研究表明,LLM可以快速理解域pecific需求,生成专业意见,使用域specific工具进行实验、解释结果并吸收过去努力的知识来解决新的挑战。示例来自于量化投资研究领域。Abstract
In the wake of relentless digital transformation, data-driven solutions are emerging as powerful tools to address multifarious industrial tasks such as forecasting, anomaly detection, planning, and even complex decision-making. Although data-centric R&D has been pivotal in harnessing these solutions, it often comes with significant costs in terms of human, computational, and time resources. This paper delves into the potential of large language models (LLMs) to expedite the evolution cycle of data-centric R&D. Assessing the foundational elements of data-centric R&D, including heterogeneous task-related data, multi-facet domain knowledge, and diverse computing-functional tools, we explore how well LLMs can understand domain-specific requirements, generate professional ideas, utilize domain-specific tools to conduct experiments, interpret results, and incorporate knowledge from past endeavors to tackle new challenges. We take quantitative investment research as a typical example of industrial data-centric R&D scenario and verified our proposed framework upon our full-stack open-sourced quantitative research platform Qlib and obtained promising results which shed light on our vision of automatic evolving of industrial data-centric R&D cycle.
摘要
在数字变革不断的背景下,数据驱动的解决方案正在赋予企业多种任务,如预测、异常检测、规划和复杂决策等。虽然数据研发总是数据驱动的关键,但它常常带来人工、计算和时间资源的成本。这篇论文探讨了大语言模型(LLM)在加速数据驱动研发的演化过程中的潜力。我们评估了数据驱动研发的基础元素,包括异常任务相关数据、多元领域知识和多种计算函数工具,以 explored how well LLMs can understand domain-specific requirements, generate professional ideas, utilize domain-specific tools to conduct experiments, interpret results, and incorporate knowledge from past endeavors to tackle new challenges.我们选择了投资研究为典型的工业数据驱动研发场景,并在我们的全栈开源 quantitative research 平台Qlib上验证了我们的提议方案,获得了鼓舞人的结果,这有助于我们实现自动化工业数据驱动研发ecycle的vision。
Query2Triple: Unified Query Encoding for Answering Diverse Complex Queries over Knowledge Graphs
results: 实验表明,无需直接模型神经集算器,Q2T方法仍然可以达到多个公共benchmark上的状态作呈现性能。Abstract
Complex Query Answering (CQA) is a challenge task of Knowledge Graph (KG). Due to the incompleteness of KGs, query embedding (QE) methods have been proposed to encode queries and entities into the same embedding space, and treat logical operators as neural set operators to obtain answers. However, these methods train KG embeddings and neural set operators concurrently on both simple (one-hop) and complex (multi-hop and logical) queries, which causes performance degradation on simple queries and low training efficiency. In this paper, we propose Query to Triple (Q2T), a novel approach that decouples the training for simple and complex queries. Q2T divides the training into two stages: (1) Pre-training a neural link predictor on simple queries to predict tail entities based on the head entity and relation. (2) Training a query encoder on complex queries to encode diverse complex queries into a unified triple form that can be efficiently solved by the pretrained neural link predictor. Our proposed Q2T is not only efficient to train, but also modular, thus easily adaptable to various neural link predictors that have been studied well. Extensive experiments demonstrate that, even without explicit modeling for neural set operators, Q2T still achieves state-of-the-art performance on diverse complex queries over three public benchmarks.
摘要
困难任务复杂问答 (CQA) 是知识图 (KG) 的挑战。由于知识图的不完整性,问题嵌入 (QE) 方法被提议,以将查询和实体编码到同一个嵌入空间中,并将逻辑运算视为神经集合运算来获取答案。然而,这些方法同时在简单 (一元) 和复杂 (多元和逻辑) 查询上培训 KG 嵌入和神经集合运算,导致简单查询的性能下降和培训效率低。在这篇论文中,我们提出了查询到三元 (Q2T),一种新的方法,它将培训分解成两个阶段:1. 预训练神经链预测器在简单查询上预测尾实体基于头实体和关系。2. 培训查询编码器在复杂查询上编码多样化的复杂查询,以便由预训神经链预测器有效解决。我们提出的 Q2T 不仅具有高效的培训效率,还具有可组合的特点,因此容易适应各种已研究的神经链预测器。我们的实验证明,无需显式模型神经集合运算,Q2T仍然在多种多样的复杂查询上达到了状态艺术级别的表现。
Rethinking Class-incremental Learning in the Era of Large Pre-trained Models via Test-Time Adaptation
results: 这篇论文的结果显示,TTACIL可以在多个 CIL 标准 benchmark 上进行最佳化,并且在不同的数据损害情况下保持稳定性。此外,TTACIL 还可以避免忘记之前的任务,同时对每个任务都获得良好的表现。Abstract
Class-incremental learning (CIL) is a challenging task that involves continually learning to categorize classes into new tasks without forgetting previously learned information. The advent of the large pre-trained models (PTMs) has fast-tracked the progress in CIL due to the highly transferable PTM representations, where tuning a small set of parameters results in state-of-the-art performance when compared with the traditional CIL methods that are trained from scratch. However, repeated fine-tuning on each task destroys the rich representations of the PTMs and further leads to forgetting previous tasks. To strike a balance between the stability and plasticity of PTMs for CIL, we propose a novel perspective of eliminating training on every new task and instead performing test-time adaptation (TTA) directly on the test instances. Concretely, we propose "Test-Time Adaptation for Class-Incremental Learning" (TTACIL) that first fine-tunes Layer Norm parameters of the PTM on each test instance for learning task-specific features, and then resets them back to the base model to preserve stability. As a consequence, TTACIL does not undergo any forgetting, while benefiting each task with the rich PTM features. Additionally, by design, our method is robust to common data corruptions. Our TTACIL outperforms several state-of-the-art CIL methods when evaluated on multiple CIL benchmarks under both clean and corrupted data.
摘要
通过不断学习新任务的类增量学习(CIL),我们可以让模型在新任务上进行分类。然而,在某些情况下,由于重复地训练每个任务,这会导致模型忘记之前学习的信息。为了找到PTM的稳定和可变性的平衡,我们提出了一种新的思路,即在测试实例上进行测试时适应(TTA)。具体来说,我们提出了一种名为“测试时适应for类增量学习”(TTACIL)的方法,它首先在每个测试实例上使用层 нор的参数进行微调,以学习任务特有的特征,然后将其重置回基本模型,以保持稳定性。因此,TTACIL不会出现忘记现象,同时每个任务都可以benefit于PTM的丰富特征。此外,由设计,我们的方法具有对常见数据损害的Robustness。我们的TTACIL在多个CIL标准测试 benchmark上评估得到了与state-of-the-art CIL方法相比的更好的性能。
RealBehavior: A Framework for Faithfully Characterizing Foundation Models’ Human-like Behavior Mechanisms
methods: 该 paper 使用了一个名为 RealBehavior 的框架,该框架可以评估模型的人类行为是否准确、可重复、内部一致和普适性。
results: 研究发现,直接使用心理学工具不能准确地描述所有人类行为,而 RealBehavior 框架可以评估模型的行为是否准确、可重复、内部一致和普适性。此外, paper 还讨论了将模型与人类和社会价值相对应的影响,并 argue 为多样化对象的定制来避免创建具有限制特征的模型。Abstract
Reports of human-like behaviors in foundation models are growing, with psychological theories providing enduring tools to investigate these behaviors. However, current research tends to directly apply these human-oriented tools without verifying the faithfulness of their outcomes. In this paper, we introduce a framework, RealBehavior, which is designed to characterize the humanoid behaviors of models faithfully. Beyond simply measuring behaviors, our framework assesses the faithfulness of results based on reproducibility, internal and external consistency, and generalizability. Our findings suggest that a simple application of psychological tools cannot faithfully characterize all human-like behaviors. Moreover, we discuss the impacts of aligning models with human and social values, arguing for the necessity of diversifying alignment objectives to prevent the creation of models with restricted characteristics.
摘要
研究人类行为模型的报告正在增长,心理理论提供了持续适用的工具来调查这些行为。然而,当前的研究通常直接采用人类 oriented 工具而不是验证其结果的准确性。在本文中,我们介绍了一个框架,即 RealBehavior,用于准确描述模型的人类行为。除了直接测量行为外,我们的框架还评估结果的准确性基于可重复性、内部和外部一致性以及泛化性。我们的发现表明,简单地应用心理工具不能准确描述所有人类行为。此外,我们还讨论了将模型与人类和社会价值Alignment的影响, arguing for the necessity of diversifying alignment objectives to prevent the creation of models with restricted characteristics.
Contracting Tsetlin Machine with Absorbing Automata
paper_authors: Bimal Bhattarai, Ole-Christoffer Granmo, Lei Jiao, Per-Arne Andersen, Svein Anders Tunheim, Rishad Shafik, Alex Yakovlev
for: 提高逻辑学习的速度和能效性
methods: 使用稀疏TM和吸引TA状态
results: 加速学习,降低能耗Abstract
In this paper, we introduce a sparse Tsetlin Machine (TM) with absorbing Tsetlin Automata (TA) states. In brief, the TA of each clause literal has both an absorbing Exclude- and an absorbing Include state, making the learning scheme absorbing instead of ergodic. When a TA reaches an absorbing state, it will never leave that state again. If the absorbing state is an Exclude state, both the automaton and the literal can be removed from further consideration. The literal will as a result never participates in that clause. If the absorbing state is an Include state, on the other hand, the literal is stored as a permanent part of the clause while the TA is discarded. A novel sparse data structure supports these updates by means of three action lists: Absorbed Include, Include, and Exclude. By updating these lists, the TM gets smaller and smaller as the literals and their TA withdraw. In this manner, the computation accelerates during learning, leading to faster learning and less energy consumption.
摘要
在这篇论文中,我们介绍了一种稀疏的Tsetlin机器(TM),它具有吸收型Tsetlin自动机(TA)状态。简而言之,每个clauseLiteral的TA具有两个吸收状态: exclude状态和include状态。当TA达到一个吸收状态时,它将不再离开该状态。如果吸收状态是exclude状态,那么自动机和Literal都可以从进一步考虑中除去。Literal因此从不会参与到该clause中。如果吸收状态是include状态,那么Literal将被保存为 clause中的永久部分,而TA则被抛弃。一种新的稀疏数据结构支持这些更新,通过三个动作列表:吸收包含、包含和排除。通过更新这些列表,TM会随着Literals和其TA减少,从而使计算加速,导致学习更快速、 consume less energy。
Understanding Fairness Surrogate Functions in Algorithmic Fairness
paper_authors: Wei Yao, Zhanke Zhou, Zhicong Li, Bo Han, Yong Liu
for: 本研究旨在 Mitigating 机器学习算法对certain population groups的偏袋预测,而 achieve comparable accuracy。
methods: 以实现 fairness 定义的 surrogate 函数来解决这个问题。但是,previous work 中的 fairness surrogate function 可能会导致不公平的结果。本研究通过对 demographic parity 的 fairness定义进行 theoretically 和 empirical 分析,发现 surrogate-fairness gap 的存在,这个 gap 直接决定了 surrogate function 是否适合 fairness definition。
results: 我们提出了一个通用的 sigmoid surrogate 函数,具有严格和可靠的 fairness 保证。此外,我们还提出了一个 novel 的算法 Balanced Surrogate,可以逐步减少 surrogate-fairness gap,以改善公平性。最后,我们在三个真实世界数据集上提供了实践证据,显示我们的方法可以更好地保证公平性。Abstract
It has been observed that machine learning algorithms exhibit biased predictions against certain population groups. To mitigate such bias while achieving comparable accuracy, a promising approach is to introduce surrogate functions of the concerned fairness definition and solve a constrained optimization problem. However, an intriguing issue in previous work is that such fairness surrogate functions may yield unfair results. In this work, in order to deeply understand this issue, taking a widely used fairness definition, demographic parity as an example, we both theoretically and empirically show that there is a surrogate-fairness gap between the fairness definition and the fairness surrogate function. The "gap" directly determines whether a surrogate function is an appropriate substitute for a fairness definition. Also, the theoretical analysis and experimental results about the "gap" motivate us that the unbounded surrogate functions will be affected by the points far from the decision boundary, which is the large margin points issue investigated in this paper. To address it, we propose the general sigmoid surrogate with a rigorous and reliable fairness guarantee. Interestingly, the theory also provides insights into two important issues that deal with the large margin points as well as obtaining a more balanced dataset are beneficial to fairness. Furthermore, we elaborate a novel and general algorithm called Balanced Surrogate, which iteratively reduces the "gap" to improve fairness. Finally, we provide empirical evidence showing that our methods achieve better fairness performance in three real-world datasets.
摘要
observations have shown that machine learning algorithms can make biased predictions against certain groups of people. to address this issue, one approach is to use surrogate functions that satisfy certain fairness definitions. however, previous work has shown that these surrogate functions may not always lead to fair results. in this paper, we investigate the reason for this problem and show that there is a gap between the fairness definition and the surrogate function. this gap determines whether the surrogate function is an appropriate substitute for the fairness definition. we also show that the large margin points issue, which is the points far from the decision boundary, affects the unbounded surrogate functions. to address this issue, we propose a general sigmoid surrogate with a rigorous and reliable fairness guarantee. furthermore, we develop a novel and general algorithm called balanced surrogate, which iteratively reduces the gap to improve fairness. finally, we provide empirical evidence showing that our methods achieve better fairness performance in three real-world datasets.
EEG motor imagery decoding: A framework for comparative analysis with channel attention mechanisms
results: 我们的实验表明,使用不同的通道注意力机制可以提高EEG动作干обраoupe的性能,同时保持小的存储容量和低的计算复杂度。我们的框架具有简单性和普适性,可以在多个数据集上进行广泛的实验,以评估不同的注意力机制和基线建模的效果。Abstract
The objective of this study is to investigate the application of various channel attention mechanisms within the domain of brain-computer interface (BCI) for motor imagery decoding. Channel attention mechanisms can be seen as a powerful evolution of spatial filters traditionally used for motor imagery decoding. This study systematically compares such mechanisms by integrating them into a lightweight architecture framework to evaluate their impact. We carefully construct a straightforward and lightweight baseline architecture designed to seamlessly integrate different channel attention mechanisms. This approach is contrary to previous works which only investigate one attention mechanism and usually build a very complex, sometimes nested architecture. Our framework allows us to evaluate and compare the impact of different attention mechanisms under the same circumstances. The easy integration of different channel attention mechanisms as well as the low computational complexity enables us to conduct a wide range of experiments on three datasets to thoroughly assess the effectiveness of the baseline model and the attention mechanisms. Our experiments demonstrate the strength and generalizability of our architecture framework as well as how channel attention mechanisms can improve the performance while maintaining the small memory footprint and low computational complexity of our baseline architecture. Our architecture emphasizes simplicity, offering easy integration of channel attention mechanisms, while maintaining a high degree of generalizability across datasets, making it a versatile and efficient solution for EEG motor imagery decoding within brain-computer interfaces.
摘要
We construct a simple and lightweight baseline architecture that seamlessly integrates different channel attention mechanisms, unlike previous works that only investigate one mechanism and build complex, sometimes nested architectures. Our framework allows us to evaluate and compare the impact of different attention mechanisms under the same circumstances.The easy integration of different channel attention mechanisms and the low computational complexity enable us to conduct a wide range of experiments on three datasets to thoroughly assess the effectiveness of the baseline model and the attention mechanisms. Our experiments demonstrate the strength and generalizability of our architecture framework and how channel attention mechanisms can improve performance while maintaining a small memory footprint and low computational complexity.Our architecture emphasizes simplicity, allowing for easy integration of channel attention mechanisms while maintaining a high degree of generalizability across datasets, making it a versatile and efficient solution for EEG motor imagery decoding within BCIs.
Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding
paper_authors: Lorenzo Jaime Yu Flores, Heyuan Huang, Kejian Shi, Sophie Chheang, Arman Cohan
for: bridging the communication gap in the medical field, where technical jargon and complex constructs are commonly used.
methods: using a new unlikelihood loss and a reranked beam search decoding method to improve the readability of text simplification in the medical domain.
results: better performance on readability metrics on three datasets, offering promising avenues for improving text simplification in the medical field.Abstract
Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study's findings offer promising avenues for improving text simplification in the medical field.
摘要
文本简化在医学领域中已经成为人工智能应用的一个日益有用的应用,用于bridging通信差距。然而,医学简化方法有时会导致生成的文本质量和多样性偏低。在这项工作中,我们探索了如何进一步提高医学简化文本的可读性。我们提出了(1)一种新的不可能损失函数,以便生成更简单的词汇,以及(2)一种重新排序搜索解码方法,以便优化简单性,这两种方法在三个数据集上都达到了更好的可读性指标。这项研究的发现提供了改进医学简化文本的可能的道路。
FocDepthFormer: Transformer with LSTM for Depth Estimation from Focus
results: 我们的模型在多个扫描库评量 dataset 上表现出色,较前一代模型出色,并且可以从视觉对应的单眼RGB深度测量数据中获得更好的预备training。Abstract
Depth estimation from focal stacks is a fundamental computer vision problem that aims to infer depth from focus/defocus cues in the image stacks. Most existing methods tackle this problem by applying convolutional neural networks (CNNs) with 2D or 3D convolutions over a set of fixed stack images to learn features across images and stacks. Their performance is restricted due to the local properties of the CNNs, and they are constrained to process a fixed number of stacks consistent in train and inference, limiting the generalization to the arbitrary length of stacks. To handle the above limitations, we develop a novel Transformer-based network, FocDepthFormer, composed mainly of a Transformer with an LSTM module and a CNN decoder. The self-attention in Transformer enables learning more informative features via an implicit non-local cross reference. The LSTM module is learned to integrate the representations across the stack with arbitrary images. To directly capture the low-level features of various degrees of focus/defocus, we propose to use multi-scale convolutional kernels in an early-stage encoder. Benefiting from the design with LSTM, our FocDepthFormer can be pre-trained with abundant monocular RGB depth estimation data for visual pattern capturing, alleviating the demand for the hard-to-collect focal stack data. Extensive experiments on various focal stack benchmark datasets show that our model outperforms the state-of-the-art models on multiple metrics.
摘要
depth 估计从焦点栈中是计算机视觉的基本问题,它目的是从图像栈中提取焦点信息。大多数现有方法通过应用 convolutional neural networks (CNNs) 的 2D 或 3D 卷积来学习图像栈中的特征。它们的性能受到本地属性的限制,而且只能处理固定长度的栈图像,从而限制了泛化性。为了解决这些限制,我们开发了一种新的 Transformer 基于网络,即 FocDepthFormer,它主要由 Transformer 和 LSTM 模块以及 CNN 解码器组成。自我注意力在 Transformer 中允许学习更有用的特征,而 LSTM 模块可以将栈图像中的表示集成到不同的图像中。为了直接捕捉不同程度的焦点/杂论的低级特征,我们提议使用多尺度的卷积核在早期编码器中。由于 LSTM 的设计,我们的 FocDepthFormer 可以通过大量的monocular RGB 深度估计数据进行预训练,从而减轻硬件难以收集的焦点栈数据的需求。我们在多个焦点栈 benchmark 数据集上进行了广泛的实验,并证明我们的模型在多个指标上超过了当前状态的模型。
Knowledge Extraction and Distillation from Large-Scale Image-Text Colonoscopy Records Leveraging Large Language and Vision Models
results: 在使用多地中的肠穿刺图像记录(约100万张图像)进行验证中,EndoKED显示出了更高的性能,可以更好地训练肠癌检测和分类模型。此外,EndoKED预训练的视觉底层模型可以在数据量少的情况下实现数据效果和泛化,达到专家水平的性能。Abstract
The development of artificial intelligence systems for colonoscopy analysis often necessitates expert-annotated image datasets. However, limitations in dataset size and diversity impede model performance and generalisation. Image-text colonoscopy records from routine clinical practice, comprising millions of images and text reports, serve as a valuable data source, though annotating them is labour-intensive. Here we leverage recent advancements in large language and vision models and propose EndoKED, a data mining paradigm for deep knowledge extraction and distillation. EndoKED automates the transformation of raw colonoscopy records into image datasets with pixel-level annotation. We validate EndoKED using multi-centre datasets of raw colonoscopy records (~1 million images), demonstrating its superior performance in training polyp detection and segmentation models. Furthermore, the EndoKED pre-trained vision backbone enables data-efficient and generalisable learning for optical biopsy, achieving expert-level performance in both retrospective and prospective validation.
摘要
开发人工智能系统用于护肠镜分析通常需要专家标注的图像集。然而,数据集的大小和多样性的限制会阻碍模型的性能和泛化。医疗实践中的图像报告记录,包括数百万张图像和文本报告,可以作为价值的数据源,但是标注它们是劳动密集的。我们利用最新的自然语言和Computer Vision技术,提出了EndoKED,一种深度知识提取和精炼数据挖掘模式。EndoKED自动将护肠镜记录转换为带有像素级别标注的图像集。我们使用多中心的 raw colonoscopy 记录(约100万张图像)进行验证,并证明EndoKED在训练肿瘤检测和分 segmentation 模型时表现出色。此外,EndoKED 预训练的视觉底层模型可以在数据效率和泛化上达到专家水平的性能,在逆向和前向验证中都达到专家水平。
MST-GAT: A Multimodal Spatial-Temporal Graph Attention Network for Time Series Anomaly Detection
results: 实验结果表明,MST-GAT在四个多模态标准数据集上比州元基elines表现出色,并且可以强化异常检测结果的可读性。Abstract
Multimodal time series (MTS) anomaly detection is crucial for maintaining the safety and stability of working devices (e.g., water treatment system and spacecraft), whose data are characterized by multivariate time series with diverse modalities. Although recent deep learning methods show great potential in anomaly detection, they do not explicitly capture spatial-temporal relationships between univariate time series of different modalities, resulting in more false negatives and false positives. In this paper, we propose a multimodal spatial-temporal graph attention network (MST-GAT) to tackle this problem. MST-GAT first employs a multimodal graph attention network (M-GAT) and a temporal convolution network to capture the spatial-temporal correlation in multimodal time series. Specifically, M-GAT uses a multi-head attention module and two relational attention modules (i.e., intra- and inter-modal attention) to model modal correlations explicitly. Furthermore, MST-GAT optimizes the reconstruction and prediction modules simultaneously. Experimental results on four multimodal benchmarks demonstrate that MST-GAT outperforms the state-of-the-art baselines. Further analysis indicates that MST-GAT strengthens the interpretability of detected anomalies by locating the most anomalous univariate time series.
摘要
多模态时间序列异常检测(MTS)是维护工作设备(如水处理系统和航天器)的关键,其数据由多个变量时间序列组成,具有多种模式。虽然最新的深度学习方法显示出了异常检测的潜力,但它们并不直接捕捉多modal时间序列之间的空间-时间关系,导致更多的假阳和假阴。在本文中,我们提出了一种多模态空间-时间Graph注意网络(MST-GAT)来解决这个问题。MST-GAT首先使用多modal Graph注意网络(M-GAT)和一个时间卷积网络来捕捉多modal时间序列之间的空间-时间相关性。具体来说,M-GAT使用多头注意模块和两种关系注意模块(即内模态注意和间模态注意)来模型modal相关性。此外,MST-GAT同时优化了重建和预测模块。实验结果表明,MST-GAT比州时的基elines表现出色,并且进一步分析表明,MST-GAT可以强化异常检测结果的解释性,即 locates最异常的单变量时间序列。
Accurate prediction of international trade flows: Leveraging knowledge graphs and their embeddings
results: 研究发现,通过使用KGE来预测国际贸易链接可以提高预测精度,同时也可以提供嵌入解释性的知识表示。此外,研究还发现embedding方法对其他智能算法产生了影响。Abstract
Knowledge representation (KR) is vital in designing symbolic notations to represent real-world facts and facilitate automated decision-making tasks. Knowledge graphs (KGs) have emerged so far as a popular form of KR, offering a contextual and human-like representation of knowledge. In international economics, KGs have proven valuable in capturing complex interactions between commodities, companies, and countries. By putting the gravity model, which is a common economic framework, into the process of building KGs, important factors that affect trade relationships can be taken into account, making it possible to predict international trade patterns. This paper proposes an approach that leverages Knowledge Graph embeddings for modeling international trade, focusing on link prediction using embeddings. Thus, valuable insights are offered to policymakers, businesses, and economists, enabling them to anticipate the effects of changes in the international trade system. Moreover, the integration of traditional machine learning methods with KG embeddings, such as decision trees and graph neural networks are also explored. The research findings demonstrate the potential for improving prediction accuracy and provide insights into embedding explainability in knowledge representation. The paper also presents a comprehensive analysis of the influence of embedding methods on other intelligent algorithms.
摘要
知识表示(KR)是设计符号notation的关键,以便实现自动化决策任务。知识图(KG)已经出现为知识表示的流行形式,提供了人类化和上下文rich的知识表示。在国际经济中,KGs已经证明了捕捉复杂的贸易关系的价值,例如商品、公司和国家之间的互动。通过将 gravitation model,这是经济框架的一种常见,integrated into the process of building KGs,可以考虑到影响贸易关系的重要因素,从而预测国际贸易模式。这篇论文提出了基于知识图嵌入的国际贸易模型,重点是预测链接使用嵌入。因此,可以为政策制定者、企业和经济学家提供有价值的信息,让他们预测贸易系统的变化的影响。此外,本文还探讨了将传统机器学习方法与KG嵌入结合的可能性,例如决策树和图 neural networks。研究发现, combining KG embeddings with traditional machine learning methods can improve prediction accuracy and provide insights into embedding explainability in knowledge representation. Additionally, the paper presents a comprehensive analysis of the influence of embedding methods on other intelligent algorithms.
Causal discovery using dynamically requested knowledge
paper_authors: Neville K Kitson, Anthony C Constantinou
for: 这 paper 旨在提高 causal Bayesian networks (CBNs) 的结构学习精度,并且研究一种基于机器学习的方法,使得结构学习算法本身可以动态地确定和请求人类知识。
methods: 这 paper 使用了 Tabu 结构学习算法,并将人类知识 integrate 到结构学习中。
results: 研究发现,这种方法可以提高结构学习精度,并且可以更好地使用人类知识。此外,这种方法还可以使结构学习过程更加透明和有效。Abstract
Causal Bayesian Networks (CBNs) are an important tool for reasoning under uncertainty in complex real-world systems. Determining the graphical structure of a CBN remains a key challenge and is undertaken either by eliciting it from humans, using machine learning to learn it from data, or using a combination of these two approaches. In the latter case, human knowledge is generally provided to the algorithm before it starts, but here we investigate a novel approach where the structure learning algorithm itself dynamically identifies and requests knowledge for relationships that the algorithm identifies as uncertain during structure learning. We integrate this approach into the Tabu structure learning algorithm and show that it offers considerable gains in structural accuracy, which are generally larger than those offered by existing approaches for integrating knowledge. We suggest that a variant which requests only arc orientation information may be particularly useful where the practitioner has little preexisting knowledge of the causal relationships. As well as offering improved accuracy, the approach can use human expertise more effectively and contributes to making the structure learning process more transparent.
摘要
causal Bayesian networks (CBNs) 是实际世界系统中不确定性理解的重要工具。确定CBN的图structural structure是一个关键挑战,通常通过从人类获得、使用机器学习从数据中学习或使用这两种方法进行。在后者情况下,人类知识通常会提供给算法之前,但我们在这里调查了一种新的方法,即结构学习算法本身在学习过程中动态确定和请求关系不确定的知识。我们将这种方法集成到Tabu结构学习算法中,并证明了它可以提供较大的结构准确性,通常比现有的知识集成方法更大。我们建议一种只请求路径方向信息的变体可能特别有用,当实践者具有少量先前知识的 causal 关系时。除了提高准确性外,这种方法可以更有效地使用人类专业知识,使结构学习过程更透明。
Uncovering wall-shear stress dynamics from neural-network enhanced fluid flow measurements
paper_authors: Esther Lagemann, Steven L. Brunton, Christian Lagemann for: 这篇论文是为了提供一种准确预测wall-shear stress的方法,以便在交通、公共设施、能源技术和医疗等领域实现可持续发展、资源保存和碳中和。methods: 本论文使用深度光流估计器,结合物理知识来 derive velocity和wall-shear stress场的空间和时间分辨率。results: 该方法可以准确预测wall-shear stress场,并且在实验数据上表明其物理正确性和有效性。Abstract
Friction drag from a turbulent fluid moving past or inside an object plays a crucial role in domains as diverse as transportation, public utility infrastructure, energy technology, and human health. As a direct measure of the shear-induced friction forces, an accurate prediction of the wall-shear stress can contribute to sustainability, conservation of resources, and carbon neutrality in civil aviation as well as enhanced medical treatment of vascular diseases and cancer. Despite such importance for our modern society, we still lack adequate experimental methods to capture the instantaneous wall-shear stress dynamics. In this contribution, we present a holistic approach that derives velocity and wall-shear stress fields with impressive spatial and temporal resolution from flow measurements using a deep optical flow estimator with physical knowledge. The validity and physical correctness of the derived flow quantities is demonstrated with synthetic and real-world experimental data covering a range of relevant fluid flows.
摘要
fluid 动力阻力从一个湍流中过或在一个物体表面或内部具有关键作用,在不同领域中发挥着重要作用,包括交通运输、公共基础设施、能源技术和人类健康。 wall-shear stress 是直接测量摩擦力的力量的直接测量方法,可以贡献到可持续发展、资源保存和碳中和性在民用航空领域,以及更好的医疗治疗血管疾病和癌症。 despite Such importance in modern society, we still lack adequate experimental methods to capture the instantaneous wall-shear stress dynamics. In this contribution, we present a holistic approach that derives velocity and wall-shear stress fields with impressive spatial and temporal resolution from flow measurements using a deep optical flow estimator with physical knowledge. The validity and physical correctness of the derived flow quantities is demonstrated with synthetic and real-world experimental data covering a range of relevant fluid flows.
results: 这篇论文主要描述了现有的同时语音翻译方法的缺点和限制,以及一些可能的解决方案,但没有直接提供实验结果。Abstract
Simultaneous speech translation (SST) aims to provide real-time translation of spoken language, even before the speaker finishes their sentence. Traditionally, SST has been addressed primarily by cascaded systems that decompose the task into subtasks, including speech recognition, segmentation, and machine translation. However, the advent of deep learning has sparked significant interest in end-to-end (E2E) systems. Nevertheless, a major limitation of most approaches to E2E SST reported in the current literature is that they assume that the source speech is pre-segmented into sentences, which is a significant obstacle for practical, real-world applications. This thesis proposal addresses end-to-end simultaneous speech translation, particularly in the long-form setting, i.e., without pre-segmentation. We present a survey of the latest advancements in E2E SST, assess the primary obstacles in SST and its relevance to long-form scenarios, and suggest approaches to tackle these challenges.
摘要
同时语音翻译(SST)目标是在实时翻译说话人说话,就在说话人完成句子之前。传统上,SST通过顺序系统解决,分解任务为多个子任务,包括语音识别、分 segmentation 和机器翻译。然而,深度学习的出现引发了对 E2E 系统的重要兴趣。然而,大多数文献中的 E2E SST 方法假设源语音已经分 segmentation 成句子,这是实际应用中的重要障碍。本论文提案探讨了无 segmentation 的 E2E SST,特别是长形设置下。我们对最新的 E2E SST 进步进行了检查,评估了 SST 的主要障碍和长形场景的相关性,并建议了解决这些挑战的方法。
USDC: Unified Static and Dynamic Compression for Visual Transformer
results: 对多种基eline Visual Transformer模型进行了广泛的实验,证明了我们的方法可以更好地均衡压缩率和模型性能,并且在不同批处理大小下提供了更高的性能稳定性。Abstract
Visual Transformers have achieved great success in almost all vision tasks, such as classification, detection, and so on. However, the model complexity and the inference speed of the visual transformers hinder their deployments in industrial products. Various model compression techniques focus on directly compressing the visual transformers into a smaller one while maintaining the model performance, however, the performance drops dramatically when the compression ratio is large. Furthermore, several dynamic network techniques have also been applied to dynamically compress the visual transformers to obtain input-adaptive efficient sub-structures during the inference stage, which can achieve a better trade-off between the compression ratio and the model performance. The upper bound of memory of dynamic models is not reduced in the practical deployment since the whole original visual transformer model and the additional control gating modules should be loaded onto devices together for inference. To alleviate two disadvantages of two categories of methods, we propose to unify the static compression and dynamic compression techniques jointly to obtain an input-adaptive compressed model, which can further better balance the total compression ratios and the model performances. Moreover, in practical deployment, the batch sizes of the training and inference stage are usually different, which will cause the model inference performance to be worse than the model training performance, which is not touched by all previous dynamic network papers. We propose a sub-group gates augmentation technique to solve this performance drop problem. Extensive experiments demonstrate the superiority of our method on various baseline visual transformers such as DeiT, T2T-ViT, and so on.
摘要
Visual Transformers 已经在各种视觉任务上获得了很大的成功,如分类、检测等。然而,视觉转换器的模型复杂度和推理速度使得它们在工业产品中的部署受到限制。各种模型压缩技术都是直接压缩视觉转换器到更小的模型,以保持模型性能,但当压缩比率较大时,模型性能会下降很快。此外,一些动态网络技术也已经应用于在推理阶段动态压缩视觉转换器,以获得输入适应型的有效子结构,可以更好地平衡压缩率和模型性能。然而,在实际部署中,动态模型的内存顶部不会减少,因为整个原始视觉转换器模型和额外的控制闭合模块都需要在设备上加载。为了解决这两种方法的缺点,我们提议将静态压缩和动态压缩技术联合使用,以获得输入适应型的压缩模型,可以更好地平衡总压缩率和模型性能。此外,在训练和推理阶段的批处理大小不同,通常会导致模型的推理性能下降,这个问题未经所有动态网络文章讨论。我们提议使用 subgroup gates 技术来解决这个问题。我们的方法在多种基eline visual transformers 上进行了广泛的实验,如 DeiT、T2T-ViT 等。
H2O Open Ecosystem for State-of-the-art Large Language Models
paper_authors: Arno Candel, Jon McKinney, Philipp Singer, Pascal Pfeiffer, Maximilian Jeblick, Chun Ming Lee, Marcos V. Conde
for: The paper is written to introduce an open-source ecosystem for developing and testing large language models (LLMs) to address the risks posed by closed-source approaches and to make AI development more accessible, efficient, and trustworthy.
methods: The paper presents a complete open-source ecosystem for LLMs, including a family of fine-tuned LLMs of diverse sizes and a framework and no-code GUI called H2O LLM Studio for efficient fine-tuning, evaluation, and deployment of LLMs using state-of-the-art techniques.
results: The paper introduces h2oGPT, a family of fine-tuned LLMs of diverse sizes, and demonstrates the effectiveness of the H2O LLM Studio for efficient fine-tuning, evaluation, and deployment of LLMs. The demo is available at https://gpt.h2o.ai/.Abstract
Large Language Models (LLMs) represent a revolution in AI. However, they also pose many significant risks, such as the presence of biased, private, copyrighted or harmful text. For this reason we need open, transparent and safe solutions. We introduce a complete open-source ecosystem for developing and testing LLMs. The goal of this project is to boost open alternatives to closed-source approaches. We release h2oGPT, a family of fine-tuned LLMs of diverse sizes. We also introduce H2O LLM Studio, a framework and no-code GUI designed for efficient fine-tuning, evaluation, and deployment of LLMs using the most recent state-of-the-art techniques. Our code and models are fully open-source. We believe this work helps to boost AI development and make it more accessible, efficient and trustworthy. The demo is available at: https://gpt.h2o.ai/
摘要
大型语言模型(LLMs)表示了人工智能领域的革命,但也存在许多重要的风险,如偏见、私有、版权和危险的文本存在。为了解决这些问题,我们需要开放、透明和安全的解决方案。我们介绍了一个完整的开源生态系统,用于开发和测试LLMs。该项目的目标是推动开放的代替方案,以opposeclosed-source方法。我们发布了h2oGPT家族,包括多种大小的精度调整LLMs。我们还介绍了H2O LLM Studio框架和无代码GUI,用于高效地调整、评估和部署LLMs,使用最新的状态艺术技术。我们的代码和模型都是完全开源的。我们认为这项工作将帮助提高人工智能的发展,使其更加可 accessible、高效和可靠。示例可以在以下链接中找到:https://gpt.h2o.ai/
ASP: Automatic Selection of Proxy dataset for efficient AutoML
results: 实验结果显示,这篇论文使用的ASP方法可以在不同的选择比率下比其他数据选择方法取得更好的结果,并且可以节省2x-20x的AutoML处理时间。Abstract
Deep neural networks have gained great success due to the increasing amounts of data, and diverse effective neural network designs. However, it also brings a heavy computing burden as the amount of training data is proportional to the training time. In addition, a well-behaved model requires repeated trials of different structure designs and hyper-parameters, which may take a large amount of time even with state-of-the-art (SOTA) hyper-parameter optimization (HPO) algorithms and neural architecture search (NAS) algorithms. In this paper, we propose an Automatic Selection of Proxy dataset framework (ASP) aimed to dynamically find the informative proxy subsets of training data at each epoch, reducing the training data size as well as saving the AutoML processing time. We verify the effectiveness and generalization of ASP on CIFAR10, CIFAR100, ImageNet16-120, and ImageNet-1k, across various public model benchmarks. The experiment results show that ASP can obtain better results than other data selection methods at all selection ratios. ASP can also enable much more efficient AutoML processing with a speedup of 2x-20x while obtaining better architectures and better hyper-parameters compared to utilizing the entire dataset.
摘要
HGCVAE: Integrating Generative and Contrastive Learning for Heterogeneous Graph Learning
results: 对比于多种州rror-of-the-art基elines,HGCVAE达到了Remarkable的结果,证明了其superiority。Abstract
Generative self-supervised learning (SSL) has exhibited significant potential and garnered increasing interest in graph learning. In this study, we aim to explore the problem of generative SSL in the context of heterogeneous graph learning (HGL). The previous SSL approaches for heterogeneous graphs have primarily relied on contrastive learning, necessitating the design of complex views to capture heterogeneity. However, existing generative SSL methods have not fully leveraged the capabilities of generative models to address the challenges of HGL. In this paper, we present HGCVAE, a novel contrastive variational graph auto-encoder that liberates HGL from the burden of intricate heterogeneity capturing. Instead of focusing on complicated heterogeneity, HGCVAE harnesses the full potential of generative SSL. HGCVAE innovatively consolidates contrastive learning with generative SSL, introducing several key innovations. Firstly, we employ a progressive mechanism to generate high-quality hard negative samples for contrastive learning, utilizing the power of variational inference. Additionally, we present a dynamic mask strategy to ensure effective and stable learning. Moreover, we propose an enhanced scaled cosine error as the criterion for better attribute reconstruction. As an initial step in combining generative and contrastive SSL, HGCVAE achieves remarkable results compared to various state-of-the-art baselines, confirming its superiority.
摘要
<>输入文本翻译为简化中文。<>生成自我超级学习(SSL)在图学习中展示了重要性和吸引了越来越多的关注。在这项研究中,我们想要探讨生成SSL在非同质graph学习(HGL)中的问题。前一些SSL方法 для非同质图把主要依靠于对比学习,因此需要设计复杂的视图来捕捉非同质性。然而,现有的生成SSL方法没有充分利用生成模型来解决HGL中的挑战。在本文中,我们提出了HGCVAE,一种新的对比变量图自动编码器。而不是关注复杂的非同质性,HGCVAE fully harnesses the full potential of generative SSL。HGCVAE innovatively consolidates contrastive learning with generative SSL, introducing several key innovations. Firstly, we employ a progressive mechanism to generate high-quality hard negative samples for contrastive learning, utilizing the power of variational inference. Additionally, we present a dynamic mask strategy to ensure effective and stable learning. Moreover, we propose an enhanced scaled cosine error as the criterion for better attribute reconstruction. As an initial step in combining generative and contrastive SSL, HGCVAE achieves remarkable results compared to various state-of-the-art baselines, confirming its superiority.
MeKB-Rec: Personal Knowledge Graph Learning for Cross-Domain Recommendation
paper_authors: Xin Su, Yao Zhou, Zifei Shan, Qian Chen for: 强化推荐 для新用户 (addressing the cold-start problem in modern recommender systems)methods: 使用Personal Knowledge Graph (PKG) 和 Pretrained Language Models (PLMs) 来建立用户会兴趣的域别不受限制的 semantic representationresults: 在多个公共 CDR 数据集上实验,证明了 MeKB-Rec 的新定义比前一代方法更具弹性,实现了 HR@10 和 NDCG@10 метри增加24%–91%,zero-shot 用户在目标领域的行为无需准确数据可以获得 significiant 提升(105%)。在 WeiXin 推荐场景中部署 MeKB-Rec,获得了重要的线上数据提升。MeKB-Rec 现在在实际产品中服务百亿用户。Abstract
It is a long-standing challenge in modern recommender systems to effectively make recommendations for new users, namely the cold-start problem. Cross-Domain Recommendation (CDR) has been proposed to address this challenge, but current ways to represent users' interests across systems are still severely limited. We introduce Personal Knowledge Graph (PKG) as a domain-invariant interest representation, and propose a novel CDR paradigm named MeKB-Rec. We first link users and entities in a knowledge base to construct a PKG of users' interests, named MeKB. Then we learn a semantic representation of MeKB for the cross-domain recommendation. To efficiently utilize limited training data in CDR, MeKB-Rec employs Pretrained Language Models to inject world knowledge into understanding users' interests. Beyond most existing systems, our approach builds a semantic mapping across domains which breaks the requirement for in-domain user behaviors, enabling zero-shot recommendations for new users in a low-resource domain. We experiment MeKB-Rec on well-established public CDR datasets, and demonstrate that the new formulation % is more powerful than previous approaches, achieves a new state-of-the-art that significantly improves HR@10 and NDCG@10 metrics over best previous approaches by 24\%--91\%, with a 105\% improvement for HR@10 of zero-shot users with no behavior in the target domain. We deploy MeKB-Rec in WeiXin recommendation scenarios and achieve significant gains in core online metrics. MeKB-Rec is now serving hundreds of millions of users in real-world products.
摘要
现代推荐系统中长期面临的挑战是如何有效地为新用户提供推荐,即冷启用户问题。跨领域推荐(CDR)已经被提议以解决这个问题,但现有的用户兴趣表示方式仍然受到严重的限制。我们引入个人知识图(PKG)作为领域不变的兴趣表示方式,并提出了一种基于PKG的CDR paradigma,称之为MeKB-Rec。我们首先将用户和实体在知识库中连接,以构建用户兴趣的PKG,称之为MeKB。然后,我们学习MeKB的semantic表示,以便在跨领域推荐中使用。为了有效利用CDR中的有限训练数据,MeKB-Rec使用预训练语言模型,以尝试在理解用户兴趣时涉及世界知识。与现有系统不同,我们的方法建立了领域之间的semantic映射,使得不需要在目标领域中有具体的用户行为,以实现零容量推荐。我们在well-established的公共CDR数据集上实验MeKB-Rec,并证明了新的表示方式比前一代方法更有力,在HR@10和NDCG@10指标上提高了24%--91%,与最佳前一代方法相比提高105%。我们在微信推荐场景中部署MeKB-Rec,得到了显著的提升。MeKB-Rec现在为百亿用户服务。
Feature Pyramid biLSTM: Using Smartphone Sensors for Transportation Mode Detection
results: FPbiLSTM使用仅三个感知器(加速度计、陀螺仪和磁场计)的数据,在2018年的Sussex-Huawei Locomotion(SHL)挑战数据集上达到了95.1%的准确率和94.7%的F1分数,在八种不同的交通模式中进行了可识别。Abstract
The widespread utilization of smartphones has provided extensive availability to Inertial Measurement Units, providing a wide range of sensory data that can be advantageous for the detection of transportation modes. The objective of this study is to propose a novel end-to-end approach to effectively explore a reduced amount of sensory data collected from a smartphone to achieve accurate mode detection in common daily traveling activities. Our approach, called Feature Pyramid biLSTM (FPbiLSTM), is characterized by its ability to reduce the number of sensors required and processing demands, resulting in a more efficient modeling process without sacrificing the quality of the outcomes than the other current models. FPbiLSTM extends an existing CNN biLSTM model with the Feature Pyramid Network, leveraging the advantages of both shallow layer richness and deeper layer feature resilience for capturing temporal moving patterns in various transportation modes. It exhibits an excellent performance by employing the data collected from only three out of seven sensors, i.e. accelerometers, gyroscopes, and magnetometers, in the 2018 Sussex-Huawei Locomotion (SHL) challenge dataset, attaining a noteworthy accuracy of 95.1% and an F1-score of 94.7% in detecting eight different transportation modes.
摘要
通过智能手机的广泛使用,提供了大量的感知数据,这些数据可以为交通方式检测提供有利条件。本研究的目标是提出一种新的端到端方法,使用少量的感知数据来准确地检测日常旅行中的交通方式。我们提出的方法,即特征峰网络bilstm(FPbiLSTM),通过减少感知器数量和处理需求,实现了更高效的模型化过程,而不 sacrificing 结果质量。FPbiLSTM extend了现有的CNN bilstm模型,利用浅层layer的richness和深层layer的特征鲜柔性,捕捉不同交通方式的时间运动模式。它在使用2018年的 sussex-Huawei Locomotion(SHL)挑战数据集中,达到了95.1%的准确率和94.7%的F1分数,在检测八种不同的交通方式中表现出色。
In-Context Few-Shot Relation Extraction via Pre-Trained Language Models
paper_authors: Yilmazcan Ozyurt, Stefan Feuerriegel, Ce Zhang
for: 实现文本文档中的人类知识结构,扩展现有的语言模型技术。
methods: 提出一个基于预训语言模型的内容几个扩展框架,不需要名称实体输入或文档人类标注。
results: 在 DocRED dataset 上进行评估,表现与原始标签相似或更好,并可以轻松地更新 для新的关系集。Abstract
Relation extraction aims at inferring structured human knowledge from textual documents. State-of-the-art methods based on language models commonly have two limitations: (1) they require named entities to be either given as input or infer them, which introduces additional noise, and (2) they require human annotations of documents. As a remedy, we present a novel framework for in-context few-shot relation extraction via pre-trained language models. To the best of our knowledge, we are the first to reformulate the relation extraction task as a tailored in-context few-shot learning paradigm. Thereby, we achieve crucial benefits in that we eliminate the need for both named entity recognition and human annotation of documents. Unlike existing methods based on fine-tuning, our framework is flexible in that it can be easily updated for a new set of relations without re-training. We evaluate our framework using DocRED, the largest publicly available dataset for document-level relation extraction, and demonstrate that our framework achieves state-of-the-art performance. Finally, our framework allows us to identify missing annotations, and we thus show that our framework actually performs much better than the original labels from the development set of DocRED.
摘要
关系提取目标是从文本文档中提取结构化的人类知识。现有的方法基于自然语言模型很多时候受到两种限制:(1)它们需要输入名称实体,或者自动检测名称实体,这会增加额外的噪音,(2)它们需要文档的人类标注。为了解决这些问题,我们提出了一种新的框架,即在文本上进行受限的少量trainingrelation extractionvia预训练的语言模型。根据我们所知,我们是第一个将关系提取任务重新定义为特定的在文本上进行受限的少量学习 paradigm。这种方法比现有的方法更加灵活,因为它可以轻松地更新 для新的关系集 ohne需要重新训练。我们使用DocRED dataset,这是公共可用的最大文档关系提取 dataset,来评估我们的框架。我们的结果显示,我们的框架可以达到领先的性能。此外,我们的框架可以识别缺失的注释,因此我们实际上可以证明我们的框架实际上比原始的标签更好。
Multi-omics Sampling-based Graph Transformer for Synthetic Lethality Prediction
results: 研究发现,使用 MSGT-SL 方法可以在真实世界 SL 任务中获得较高的 empirical 效果,证明了该方法在 SL 预测中的效用性。Abstract
Synthetic lethality (SL) prediction is used to identify if the co-mutation of two genes results in cell death. The prevalent strategy is to abstract SL prediction as an edge classification task on gene nodes within SL data and achieve it through graph neural networks (GNNs). However, GNNs suffer from limitations in their message passing mechanisms, including over-smoothing and over-squashing issues. Moreover, harnessing the information of non-SL gene relationships within large-scale multi-omics data to facilitate SL prediction poses a non-trivial challenge. To tackle these issues, we propose a new multi-omics sampling-based graph transformer for SL prediction (MSGT-SL). Concretely, we introduce a shallow multi-view GNN to acquire local structural patterns from both SL and multi-omics data. Further, we input gene features that encode multi-view information into the standard self-attention to capture long-range dependencies. Notably, starting with batch genes from SL data, we adopt parallel random walk sampling across multiple omics gene graphs encompassing them. Such sampling effectively and modestly incorporates genes from omics in a structure-aware manner before using self-attention. We showcase the effectiveness of MSGT-SL on real-world SL tasks, demonstrating the empirical benefits gained from the graph transformer and multi-omics data.
摘要
<> traduced text into Simplified Chinese.<>人工致死性(SL)预测是用来判断两个基因的共同突变是否导致细胞死亡。现有的策略是将SL预测视为基因节点之间的边分类任务,并使用图神经网络(GNN)来实现。然而,GNN受到消息传递机制的局限性,包括过滤和压缩问题。另外,在大规模多Omics数据中利用非SL基因关系来促进SL预测是一个非常困难的问题。为了解决这些问题,我们提出了一种新的多Omics采样基于图变换器 для SL预测(MSGT-SL)。具体来说,我们引入了一个浅层多视图GNN,以获取SL数据和多Omics数据中的本地结构模式。然后,我们将基因特征,其中包含多视图信息,输入到标准自注意力中,以捕捉长距离依赖关系。另外,我们从SL数据中开始,采样多Omics基因图中包含的批处理基因,以模estamente和多Omics基因图中的结构相关的方式进行采样。这种采样方式可以有效地和modestly地在多Omics基因图中包含基因,并在使用自注意力之前进行结构意识。我们在实际SL任务上展示了MSGT-SL的效果,并证明了图变换器和多Omics数据的实际效果。
Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models
paper_authors: Hsuan Su, Cheng-Chu Cheng, Hua Farn, Shachi H Kumar, Saurav Sahay, Shang-Tse Chen, Hung-yi Lee for: 这研究旨在检测大语言模型(LLM)中的可能性偏见,并提出了一种自动生成测试用例的方法来检测这些偏见。methods: 这种方法使用自动生成的测试用例来检测LLMs中的偏见,并对检测到的偏见进行纠正。results: 实验结果表明,使用该方法可以使LLMs生成更公正的回答。Abstract
Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.
摘要
In this work, we propose a novel method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.
Sim-to-Real Transfer of Adaptive Control Parameters for AUV Stabilization under Current Disturbance
results: 实验结果显示,该方法可以从不优的模拟模型上学习出高效策略,并在实际 Vehicle上实现了控制性能3倍高于其模型基于非适应控制器的对照试验。Abstract
Learning-based adaptive control methods hold the premise of enabling autonomous agents to reduce the effect of process variations with minimal human intervention. However, its application to autonomous underwater vehicles (AUVs) has so far been restricted due to 1) unknown dynamics under the form of sea current disturbance that we can not model properly nor measure due to limited sensor capability and 2) the nonlinearity of AUVs tasks where the controller response at some operating points must be overly conservative in order to satisfy the specification at other operating points. Deep Reinforcement Learning (DRL) can alleviates these limitations by training general-purpose neural network policies, but applications of DRL algorithms to AUVs have been restricted to simulated environments, due to their inherent high sample complexity and distribution shift problem. This paper presents a novel approach, merging the Maximum Entropy Deep Reinforcement Learning framework with a classic model-based control architecture, to formulate an adaptive controller. Within this framework, we introduce a Sim-to-Real transfer strategy comprising the following components: a bio-inspired experience replay mechanism, an enhanced domain randomisation technique, and an evaluation protocol executed on a physical platform. Our experimental assessments demonstrate that this method effectively learns proficient policies from suboptimal simulated models of the AUV, resulting in control performance 3 times higher when transferred to a real-world vehicle, compared to its model-based nonadaptive but optimal counterpart.
摘要
学习基于控制方法可以让自主Agent减少过程变化的影响,但是在自主水下潜水器(AUV)上应用尚未得到广泛使用,主要原因是1)不能正确地模型海流干扰的不确定动力学特性,因为感知器的限制不能准确地测量,2)AUV任务的非线性性,控制器在某些操作点上必须采取保守的响应,以满足其他操作点的规范。深度优化学习(DRL)可以解决这些限制,通过训练通用神经网络策略来减少模型不确定性和分布偏移问题。这篇论文提出了一种新的方法,将最大 entropy深度优化学习框架与经典模型基于控制架构结合,形成一个适应控制器。在这个框架中,我们提出了一种Sim-to-Real转移策略,包括以下三个组成部分:生物发现经验回放机制、改进的领域随机化技术和在物理平台上执行的评估协议。我们的实验评估表明,这种方法可以从优化的模拟模型中学习出高效策略,在真实世界潜水器上实现控制性能3倍高于其模型基于非适应控制器的优化对照。
Robust-MBFD: A Robust Deep Learning System for Motor Bearing Faults Detection Using Multiple Deep Learning Training Strategies and A Novel Double Loss Function
methods: 本论文提出了多种机器学习基于系统 для MBFD 任务,并评估了这些系统的性能。此外,本论文还提出了三种深度学习基于系统,每一种都采用了不同的训练策略:supervised learning、semi-supervised learning和Unsupervised learning。
results: 对多个 benchmark 数据集进行了广泛的实验,包括美国机械失效预防学会 (MFPT)、Case Western Reserve University Bearing Center (CWRU) 和Paderborn University (PU) 的condition monitoring of bearing damage in electromechanical drive systems。实验结果表明,深度学习基于系统比机器学习基于系统更有效果于 MBFD 任务。此外,我们还实现了一种可靠和通用的深度学习基于系统,并在多个 benchmark 数据集上实现了良好的性能,demonstrating its potential for real-life MBFD applications。Abstract
This paper presents a comprehensive analysis of motor bearing fault detection (MBFD), which involves the task of identifying faults in a motor bearing based on its vibration. To this end, we first propose and evaluate various machine learning based systems for the MBFD task. Furthermore, we propose three deep learning based systems for the MBFD task, each of which explores one of the following training strategies: supervised learning, semi-supervised learning, and unsupervised learning. The proposed machine learning based systems and deep learning based systems are evaluated, compared, and then they are used to identify the best model for the MBFD task. We conducted extensive experiments on various benchmark datasets of motor bearing faults, including those from the American Society for Mechanical Failure Prevention Technology (MFPT), Case Western Reserve University Bearing Center (CWRU), and the Condition Monitoring of Bearing Damage in Electromechanical Drive Systems from Paderborn University (PU). The experimental results on different datasets highlight two main contributions of this study. First, we prove that deep learning based systems are more effective than machine learning based systems for the MBFD task. Second, we achieve a robust and general deep learning based system with a novel loss function for the MBFD task on several benchmark datasets, demonstrating its potential for real-life MBFD applications.
摘要
The authors propose three deep learning-based systems for the MBFD task, each of which explores a different training strategy: supervised learning, semi-supervised learning, and unsupervised learning. The proposed machine learning-based systems and deep learning-based systems are evaluated and compared, and the best model for the MBFD task is identified.The authors conducted extensive experiments on various benchmark datasets of motor bearing faults, including those from the American Society for Mechanical Failure Prevention Technology (MFPT), Case Western Reserve University Bearing Center (CWRU), and the Condition Monitoring of Bearing Damage in Electromechanical Drive Systems from Paderborn University (PU). The experimental results show that deep learning-based systems are more effective than machine learning-based systems for the MBFD task. Additionally, the authors develop a novel loss function for the MBFD task that achieves robust and general performance on several benchmark datasets, demonstrating its potential for real-life MBFD applications.
Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning
paper_authors: Shitong Duan, Xiaoyuan Yi, Peng Zhang, Tun Lu, Xing Xie, Ning Gu
for: This paper aims to explore the ethical values of large language models (LLMs) and develop methods to improve their value compliance.
methods: The paper proposes a novel prompt generation algorithm called DeNEVIL to dynamically elicit the violation of ethics in LLMs, and constructs a high-quality dataset called MoralPrompt to benchmark the intrinsic values of LLMs. The paper also develops an in-context alignment method called VILMO to improve the value compliance of LLM outputs.
results: The paper discovers that most LLMs are essentially misaligned and demonstrates the effectiveness of VILMO in improving the value compliance of LLM outputs. The results provide a promising initial step in studying the ethical values of LLMs and aligning their outputs with human values.Here’s the Simplified Chinese text:
results: 论文发现大多数LLMs是 Essentially Misaligned,并证明了VIMO的效果性。结果提供了一个有前途的初步探索LLMs的伦理价值,并将其输出与人类价值进行对齐。Abstract
Large Language Models (LLMs) have made unprecedented breakthroughs, yet their increasing integration into everyday life might raise societal risks due to generated unethical content. Despite extensive study on specific issues like bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values utilizing Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.
摘要
大型语言模型(LLM)已经创造出了无 precedent 的突破,但是它们的日益加入到日常生活中可能会提高社会风险,因为它们可能会生成不道德的内容。despite extensive research on specific issues such as bias, the intrinsic values of LLMs remain largely unexplored from a moral philosophy perspective. This work delves into ethical values using Moral Foundation Theory. Moving beyond conventional discriminative evaluations with poor reliability, we propose DeNEVIL, a novel prompt generation algorithm tailored to dynamically exploit LLMs' value vulnerabilities and elicit the violation of ethics in a generative manner, revealing their underlying value inclinations. On such a basis, we construct MoralPrompt, a high-quality dataset comprising 2,397 prompts covering 500+ value principles, and then benchmark the intrinsic values across a spectrum of LLMs. We discovered that most models are essentially misaligned, necessitating further ethical value alignment. In response, we develop VILMO, an in-context alignment method that substantially enhances the value compliance of LLM outputs by learning to generate appropriate value instructions, outperforming existing competitors. Our methods are suitable for black-box and open-source models, offering a promising initial step in studying the ethical values of LLMs.
Nonet at SemEval-2023 Task 6: Methodologies for Legal Evaluation
results: 论文在这三个子任务中的result包括数据统计和方法ология,并在领先者排名中获得了竞争性的排名,分别为15$^{th}$, 11$^{th}$, 和1$^{st}$。Abstract
This paper describes our submission to the SemEval-2023 for Task 6 on LegalEval: Understanding Legal Texts. Our submission concentrated on three subtasks: Legal Named Entity Recognition (L-NER) for Task-B, Legal Judgment Prediction (LJP) for Task-C1, and Court Judgment Prediction with Explanation (CJPE) for Task-C2. We conducted various experiments on these subtasks and presented the results in detail, including data statistics and methodology. It is worth noting that legal tasks, such as those tackled in this research, have been gaining importance due to the increasing need to automate legal analysis and support. Our team obtained competitive rankings of 15$^{th}$, 11$^{th}$, and 1$^{st}$ in Task-B, Task-C1, and Task-C2, respectively, as reported on the leaderboard.
摘要
这份论文描述了我们在SemEval-2023中的任务6中对LegalEval:理解法律文本的提交。我们的提交集中了三个子任务:法律名称识别(L-NER)、法律预测(LJP)和法律判决预测与解释(CJPE)。我们进行了各种实验,并在详细的报告中提供了数据统计和方法ология。值得注意的是,如果法律任务,如这些研究所解决的,在过去几年中得到了越来越多的重视,因为自动化法律分析和支持的需求不断增长。我们团队在排名榜上获得了15名、11名和1名的竞争性排名,分别对应于任务B、C1和C2。
Understanding Contrastive Learning via Distributionally Robust Optimization
results: 研究发现CL具有 InfoNCE 作为估计的� strongMutual Information 和一种新的 $\phi$- divergence 基于通信信息的 estimator,并且提出了一种改进的 Adjusted InfoNCE 损失函数(ADNCE),可以 Mitigate CL 的缺点,包括过度保守和敏感到异常值。广泛的实验在图像、句子和图 граhp 等领域 validate 了提议的效果。代码可以在 \url{https://github.com/junkangwu/ADNCE} 上获取。Abstract
This study reveals the inherent tolerance of contrastive learning (CL) towards sampling bias, wherein negative samples may encompass similar semantics (\eg labels). However, existing theories fall short in providing explanations for this phenomenon. We bridge this research gap by analyzing CL through the lens of distributionally robust optimization (DRO), yielding several key insights: (1) CL essentially conducts DRO over the negative sampling distribution, thus enabling robust performance across a variety of potential distributions and demonstrating robustness to sampling bias; (2) The design of the temperature $\tau$ is not merely heuristic but acts as a Lagrange Coefficient, regulating the size of the potential distribution set; (3) A theoretical connection is established between DRO and mutual information, thus presenting fresh evidence for ``InfoNCE as an estimate of MI'' and a new estimation approach for $\phi$-divergence-based generalized mutual information. We also identify CL's potential shortcomings, including over-conservatism and sensitivity to outliers, and introduce a novel Adjusted InfoNCE loss (ADNCE) to mitigate these issues. It refines potential distribution, improving performance and accelerating convergence. Extensive experiments on various domains (image, sentence, and graphs) validate the effectiveness of the proposal. The code is available at \url{https://github.com/junkangwu/ADNCE}.
摘要
CL essentially performs DRO over the negative sampling distribution, enabling robust performance across various potential distributions and demonstrating resistance to sampling bias.2. The temperature parameter $\tau$ is not just a heuristic, but rather a Lagrange coefficient that regulates the size of the potential distribution set.3. We establish a theoretical connection between DRO and mutual information, providing fresh evidence for the idea that “InfoNCE is an estimate of MI” and presenting a new approach for estimating $\phi$-divergence-based generalized mutual information.However, CL also has some limitations, such as over-conservatism and sensitivity to outliers. To address these issues, we propose a novel Adjusted InfoNCE loss (ADNCE) that refines the potential distribution and improves performance and convergence. Extensive experiments on various domains (images, sentences, and graphs) demonstrate the effectiveness of our proposal. The code is available at \url{https://github.com/junkangwu/ADNCE}.
Fast Graph Condensation with Structure-based Neural Tangent Kernel
results: 提出了一种基于 Structure-based Neural Tangent Kernel(SNTK)的数据压缩框架(GC-SNTK),可以减少图数据的计算成本,保持高精度预测性能。Abstract
The rapid development of Internet technology has given rise to a vast amount of graph-structured data. Graph Neural Networks (GNNs), as an effective method for various graph mining tasks, incurs substantial computational resource costs when dealing with large-scale graph data. A data-centric manner solution is proposed to condense the large graph dataset into a smaller one without sacrificing the predictive performance of GNNs. However, existing efforts condense graph-structured data through a computational intensive bi-level optimization architecture also suffer from massive computation costs. In this paper, we propose reforming the graph condensation problem as a Kernel Ridge Regression (KRR) task instead of iteratively training GNNs in the inner loop of bi-level optimization. More specifically, We propose a novel dataset condensation framework (GC-SNTK) for graph-structured data, where a Structure-based Neural Tangent Kernel (SNTK) is developed to capture the topology of graph and serves as the kernel function in KRR paradigm. Comprehensive experiments demonstrate the effectiveness of our proposed model in accelerating graph condensation while maintaining high prediction performance.
摘要
“互联网科技的快速发展导致了大量的树结构数据的生成。树神经网络(GNNs)作为许多树采矿任务的有效方法,对于大规模树数据而言,具有很大的计算资源成本。为了缩小大型树dataset,而不是降低GNNs的预测性能,我们提出了一个数据中心的解决方案。但是,现有的实现方法通过重复地训练GNNs内部的双层优化架构,具有巨大的计算成本。在本文中,我们提出了将树数据缩小问题转化为核心ridge regression(KRR)任务,而不是透过双层优化架构的迭代训练GNNs。更specifically,我们提出了一个新的数据缩小框架(GC-SNTK),其中一个基于结构的神经 tangent kernel(SNTK)被设计来捕捉树的结构,并且作为KRR模式中的kernel函数。实验结果显示,我们的提议的模型能够快速缩小树数据,而且保持高预测性能。”
Spoofing Attack Detection in the Physical Layer with Robustness to User Movement
paper_authors: Daniel Romero, Tien Ngoc Ha, Peter Gerstoft
for: 防止 spoofing 攻击
methods: combining 深度学习 和 图граhp 检测
results: 可以准确地 отлиenciate spoofing 和 user movementAbstract
In a spoofing attack, an attacker impersonates a legitimate user to access or modify data belonging to the latter. Typical approaches for spoofing detection in the physical layer declare an attack when a change is observed in certain channel features, such as the received signal strength (RSS) measured by spatially distributed receivers. However, since channels change over time, for example due to user movement, such approaches are impractical. To sidestep this limitation, this paper proposes a scheme that combines the decisions of a position-change detector based on a deep neural network to distinguish spoofing from movement. Building upon community detection on graphs, the sequence of received frames is partitioned into subsequences to detect concurrent transmissions from distinct locations. The scheme can be easily deployed in practice since it just involves collecting a small dataset of measurements at a few tens of locations that need not even be computed or recorded. The scheme is evaluated on real data collected for this purpose.
摘要
在 spoofing 攻击中,攻击者会伪装为合法用户,以访问或修改受影响用户的数据。通常的 spoofing 检测方法在物理层将攻击宣告为当前通道特征发生变化,如接收信号强度(RSS)测量的空间分布式接收器。然而,由于通道随着时间的变化,例如用户移动,这些方法是不实用的。为了绕过这些限制,这篇论文提议一种方案,将 deep neural network 基于位置变化探测器的决策与 Movement 分离开来。基于图 communit 探测,接收的序列被分割成子序列,以检测同时从不同位置发送的同时传输。该方案可以轻松实现,只需要收集一小量的测量数据,并且不需要计算或记录。这篇论文使用实际数据进行评估。
Radio Map Estimation in the Real-World: Empirical Validation and Analysis
paper_authors: Raju Shrestha, Tien Ngoc Ha, Pham Q. Viet, Daniel Romero
for: 这篇论文主要针对的是量化广播信号强度或其他广播频率环境中每个点的地理区域。
methods: 这篇论文使用了许多现有的广播地图估计器,并对这些估计器进行了实际验证。
results: 研究发现,使用深度神经网络(DNNs)的复杂估计器可以提供最佳性能,但它们需要大量的训练数据来达到显著优势。一种新的混合估计器可以同时利用这两种估计器的优点,并且可能值得进一步探索。Abstract
Radio maps quantify received signal strength or other magnitudes of the radio frequency environment at every point of a geographical region. These maps play a vital role in a large number of applications such as wireless network planning, spectrum management, and optimization of communication systems. However, empirical validation of the large number of existing radio map estimators is highly limited. To fill this gap, a large data set of measurements has been collected with an autonomous unmanned aerial vehicle (UAV) and a representative subset of these estimators were evaluated on this data. The performance-complexity trade-off and the impact of fast fading are extensively investigated. Although sophisticated estimators based on deep neural networks (DNNs) exhibit the best performance, they are seen to require large volumes of training data to offer a substantial advantage relative to more traditional schemes. A novel algorithm that blends both kinds of estimators is seen to enjoy the benefits of both, thereby suggesting the potential of exploring this research direction further.
摘要
Radio 地图量化接收信号强度或其他频率环境中每个地理区域点的其他物理量。这些地图在许多应用中发挥重要作用,如无线网络规划、频谱管理和通信系统优化。然而,现有的大量Radio map estimator的实验 validate 是非常有限的。为了填补这一空白,一个大量测量数据集被收集,并对这些 estimator 进行了评估。本研究探讨了性能vs复杂度的贸易和快速抖动的影响。虽然基于深度神经网络(DNNs)的复杂 estimator 表现最佳,但它们需要大量的训练数据来提供substantial 的优势 relative to 传统方案。一种混合 estimator 的新算法被发现,它们享有两种 estimator 的优点,因此更多的研究是可能的。
Core Building Blocks: Next Gen Geo Spatial GPT Application
results: 研究表明,通过结合LMMs和地理数据处理技术,MapGPT可以提供更加准确和Contextual awareness的响应,并且可以实现地理计算和可视化输出。Abstract
This paper proposes MapGPT which is a novel approach that integrates the capabilities of language models, specifically large language models (LLMs), with spatial data processing techniques. This paper introduces MapGPT, which aims to bridge the gap between natural language understanding and spatial data analysis by highlighting the relevant core building blocks. By combining the strengths of LLMs and geospatial analysis, MapGPT enables more accurate and contextually aware responses to location-based queries. The proposed methodology highlights building LLMs on spatial and textual data, utilizing tokenization and vector representations specific to spatial information. The paper also explores the challenges associated with generating spatial vector representations. Furthermore, the study discusses the potential of computational capabilities within MapGPT, allowing users to perform geospatial computations and obtain visualized outputs. Overall, this research paper presents the building blocks and methodology of MapGPT, highlighting its potential to enhance spatial data understanding and generation in natural language processing applications.
摘要
这篇论文提出了MapGPT,一种新的方法,它将自然语言理解和空间数据处理技术相结合。这篇论文描述了MapGPT的目标是将自然语言理解和空间数据分析相连接,并通过高亮相关核心组件来强调这个目标。通过结合LLMs和地理空间分析的优势,MapGPT可以提供更加准确和上下文感知的回答。该方法利用了地理空间数据和文本数据的Tokenization和 вектор表示,并解决了生成空间 вектор表示的挑战。此外,该研究还探讨了MapGPT的计算能力,允许用户进行地理计算并获得可视化输出。总之,这篇研究论文介绍了MapGPT的建构和方法,并强调其在自然语言处理应用中增强空间数据理解和生成的潜在能力。
Compatible Transformer for Irregularly Sampled Multivariate Time Series
results: 本文的实验结果显示,对于多个真实世界数据集,CoFormer 对于预测和分类 зада问表现出卓越的成绩,与现有的方法相比,CoFormer 具有优异的表现。Abstract
To analyze multivariate time series, most previous methods assume regular subsampling of time series, where the interval between adjacent measurements and the number of samples remain unchanged. Practically, data collection systems could produce irregularly sampled time series due to sensor failures and interventions. However, existing methods designed for regularly sampled multivariate time series cannot directly handle irregularity owing to misalignment along both temporal and variate dimensions. To fill this gap, we propose Compatible Transformer (CoFormer), a transformer-based encoder to achieve comprehensive temporal-interaction feature learning for each individual sample in irregular multivariate time series. In CoFormer, we view each sample as a unique variate-time point and leverage intra-variate/inter-variate attentions to learn sample-wise temporal/interaction features based on intra-variate/inter-variate neighbors. With CoFormer as the core, we can analyze irregularly sampled multivariate time series for many downstream tasks, including classification and prediction. We conduct extensive experiments on 3 real-world datasets and validate that the proposed CoFormer significantly and consistently outperforms existing methods.
摘要
多变量时间序列分析方法中,大多数先前方法假设时间序列的尺度保持不变,即间隔时间和样本数均不变。然而,实际上,数据收集系统可能会生成不规则的时间序列,这是因为仪器故障和干预等原因。现有的方法无法直接处理不规则的时间序列,这是因为它们在时间和变量维度上存在偏移。为填补这个空白,我们提出了兼容变换器(CoFormer),一种基于变换器的编码器,用于在不规则的多变量时间序列中实现每个样本独特的时间互动特征学习。在CoFormer中,我们视每个样本为独特的变量-时间点,并通过内变量/外变量注意力来学习每个样本的时间互动特征,基于内变量/外变量的邻居。通过CoFormer作为核心,我们可以分析不规则的多变量时间序列,并进行识别和预测等下游任务。我们在3个真实世界数据集上进行了广泛的实验,并证明了提出的CoFormer在 existed 方法的基础上显著并且一致性地提高了性能。
From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling
results: 论文的结果表明,通过结构 causal modeling 可以提高深度生成模型的解释性、避免偶极相关性和外部数据描述稳定性,同时提高数据生成的准确性和多样性。Abstract
Deep generative models have shown tremendous success in data density estimation and data generation from finite samples. While these models have shown impressive performance by learning correlations among features in the data, some fundamental shortcomings are their lack of explainability, the tendency to induce spurious correlations, and poor out-of-distribution extrapolation. In an effort to remedy such challenges, one can incorporate the theory of causality in deep generative modeling. Structural causal models (SCMs) describe data-generating processes and model complex causal relationships and mechanisms among variables in a system. Thus, SCMs can naturally be combined with deep generative models. Causal models offer several beneficial properties to deep generative models, such as distribution shift robustness, fairness, and interoperability. We provide a technical survey on causal generative modeling categorized into causal representation learning and controllable counterfactual generation methods. We focus on fundamental theory, formulations, drawbacks, datasets, metrics, and applications of causal generative models in fairness, privacy, out-of-distribution generalization, and precision medicine. We also discuss open problems and fruitful research directions for future work in the field.
摘要
深度生成模型在数据密度估计和数据生成从有限样本中表现出色,但它们存在一些基本缺陷,如无法解释、产生假 correlations 和外部扩展不稳定。为了解决这些挑战,可以在深度生成模型中涵盖 causality 理论。结构 causal model(SCM)描述了数据生成过程,模型了系统中变量之间复杂的 causal 关系和机制。因此,SCM 可以自然地与深度生成模型结合。 causal 模型具有许多有利的性能,如分布shift 稳定性、公平性和可操作性。我们提供了深入检查 causal 生成模型的技术survey,分为 causal representation learning 和可控 counterfactual generation 方法。我们关注基本理论、形式、缺陷、数据集、指标和应用于公平、隐私、外部扩展、精准医学等领域。我们还讨论了未解决的问题和未来研究的可能性。
Accelerating Scalable Graph Neural Network Inference with Node-Adaptive Propagation
results: 在公共数据集上实现了比例性和效率的较好的推理加速,特别是在大规模图上,实现了75倍的推理速度提升。Abstract
Graph neural networks (GNNs) have exhibited exceptional efficacy in a diverse array of applications. However, the sheer size of large-scale graphs presents a significant challenge to real-time inference with GNNs. Although existing Scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure, these methods still suffer from scalability issues when making inferences on unseen nodes, as the feature preprocessing requires the graph to be known and fixed. To further accelerate Scalable GNNs inference in this inductive setting, we propose an online propagation framework and two novel node-adaptive propagation methods that can customize the optimal propagation depth for each node based on its topological information and thereby avoid redundant feature propagation. The trade-off between accuracy and latency can be flexibly managed through simple hyper-parameters to accommodate various latency constraints. Moreover, to compensate for the inference accuracy loss caused by the potential early termination of propagation, we further propose Inception Distillation to exploit the multi-scale receptive field information within graphs. The rigorous and comprehensive experimental study on public datasets with varying scales and characteristics demonstrates that the proposed inference acceleration framework outperforms existing state-of-the-art graph inference acceleration methods in terms of accuracy and efficiency. Particularly, the superiority of our approach is notable on datasets with larger scales, yielding a 75x inference speedup on the largest Ogbn-products dataset.
摘要
GRAPHNeuralNetworks (GNNs) 已经在多种应用中表现出色。然而,大规模图表示一个 significante challenge для实时推理GNNs。虽然现有的可扩展GNNs使用线性宣传来预处理特征和加速训练和推理过程,但这些方法仍然在对未看过节点的推理中遇到缺乏扩展性的问题,因为特征预处理需要知道和固定的图。为了进一步加速可扩展GNNs的推理在这种推理设定中,我们提议了在线宣传框架和两种新的节点适应性宣传方法。这些方法可以根据每个节点的 topological information 自适应地定制最佳宣传深度,并因此避免了 redundant feature propagation。通过简单的 гиперпараметр来管理准确率和延迟的负担,我们可以适应不同的延迟限制。此外,为了补做因推理早期终止而导致的准确性损失,我们进一步提议了 Inception Distillation,以利用图中的多尺度接收器场信息。我们对公共数据集进行了严格和全面的实验研究,结果显示,我们的推理加速框架在准确率和效率方面都超过了现有的状态码图推理加速方法。特别是在大规模的数据集上,我们的方法可以实现75倍的推理速度增加。
EXMODD: An EXplanatory Multimodal Open-Domain Dialogue dataset
results: 实验表明,使用MDCF生成的对话数据和图像解释具有正确性和高质量,可以帮助提高对对话任务的研究质量Abstract
The need for high-quality data has been a key issue hindering the research of dialogue tasks. Recent studies try to build datasets through manual, web crawling, and large pre-trained models. However, man-made data is expensive and data collected from the internet often includes generic responses, meaningless statements, and toxic dialogues. Automatic data generation through large models is a cost-effective method, but for open-domain multimodal dialogue tasks, there are still three drawbacks: 1) There is currently no open-source large model that can accept multimodal input; 2) The content generated by the model lacks interpretability; 3) The generated data is usually difficult to quality control and require extensive resource to collect. To alleviate the significant human and resource expenditure in data collection, we propose a Multimodal Data Construction Framework (MDCF). MDCF designs proper prompts to spur the large-scale pre-trained language model to generate well-formed and satisfactory content. Additionally, MDCF also automatically provides explanation for a given image and its corresponding dialogue, which can provide a certain degree of interpretability and facilitate manual follow-up quality inspection. Based on this, we release an Explanatory Multimodal Open-Domain dialogue dataset (EXMODD). Experiments indicate a positive correlation between the model's ability to generate accurate understandings and high-quality responses. Our code and data can be found at https://github.com/poplpr/EXMODD.
摘要
需求高质量数据一直是对对话任务研究的关键障碍。近期研究通过手动、网络爬虫和大型预训练模型建立数据集。然而,人工生成数据昂贵,网络上收集的数据经常包含无关的回答、意义不明确的声明以及恶意对话。通过大型模型自动生成数据是一种经济的方法,但对开放频道多媒体对话任务还存在三个缺点:1)目前没有开源的大型模型可以接受多媒体输入;2)模型生成的内容缺乏可读性;3)生成的数据困难以质量控制,需要广泛的资源来收集。为了减少数据收集的人工和资源投入,我们提出了多媒体数据建构框架(MDCF)。MDCF设计合适的提示,使大规模预训练语言模型生成高质量和满意的内容。此外,MDCF还自动提供图像和对应对话的解释,可以提供一定的可读性,便于手动跟踪质量检查。根据这,我们发布了解释多媒体开放频道对话数据集(EXMODD)。实验表明,模型能够生成准确理解和高质量回答之间存在正相关关系。我们的代码和数据可以在 GitHub 上找到。
results: 研究发现,一些常用的自然语言理解数据集具有特点性的效果,这些效果集中在一些语言维度上。此外,研究还发现了一些“泄漏”效果:数据集可以影响模型在不同维度上的表现,这些维度可能与计划中的任务无关。该研究为负责任和可靠的模型开发提供了一个系统的理解。Abstract
The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.
摘要
“深度神经网络(DNN)系统的卓越成功受到训练 datasets 的高质量影响,但 datasets 之间的交互效果还未得到足够探究。我们提出了一个状态向量框架,以便系统地研究这一方面。这个框架使用理想化的 probing 测试结果作为基准,并允许我们量化单独和交互 datasets 的效果。我们发现一些通用语言理解 datasets 的效果是特征性的,集中在一些语言维度上。此外,我们还发现了一些“倒流”效果: datasets 可以影响模型在不直接相关的任务上的表现。我们的状态向量框架为负责任和可靠模型开发提供了一个系统的理解。”
Enhanced Transformer Architecture for Natural Language Processing
results: 使用 Multi30k 翻译数据集进行评估,提高了对 original transformer 的202.96% 的 BLEU 分数Abstract
Transformer is a state-of-the-art model in the field of natural language processing (NLP). Current NLP models primarily increase the number of transformers to improve processing performance. However, this technique requires a lot of training resources such as computing capacity. In this paper, a novel structure of Transformer is proposed. It is featured by full layer normalization, weighted residual connection, positional encoding exploiting reinforcement learning, and zero masked self-attention. The proposed Transformer model, which is called Enhanced Transformer, is validated by the bilingual evaluation understudy (BLEU) score obtained with the Multi30k translation dataset. As a result, the Enhanced Transformer achieves 202.96% higher BLEU score as compared to the original transformer with the translation dataset.
摘要
transformer 是当前自然语言处理(NLP)领域的先进模型。现有的 NLP 模型主要通过增加 transformer 的数量来提高处理性能。然而,这种技术需要大量的训练资源,如计算能力。在这篇论文中,一种新的 transformer 结构被提出,具有全层正常化、权重征值连接、位置编码利用强化学习和零层隐藏自注意力。这种提出的 transformer 模型被称为增强 transformer,在使用 Multi30k 翻译集合时通过对比 Bleu 分数来验证其性能。结果显示,增强 transformer 与原始 transformer 相比,在 Multi30k 翻译集合中的 Bleu 分数提高了202.96%。
Using Audio Data to Facilitate Depression Risk Assessment in Primary Health Care
paper_authors: Adam Valen Levinson, Abhay Goyal, Roger Ho Chun Man, Roy Ka-Wei Lee, Koustuv Saha, Nimay Parekh, Frederick L. Altice, Lam Yin Cheung, Munmun De Choudhury, Navin Kumar
results: 选择的模型在预测抑郁风险中表现出色(准确率0.98,报告率0.93,F1评分0.96)。这些发现可能导致开发识别抑郁风险的工具,以便通过AI驱动的聊天机器人进行初步检测。Abstract
Telehealth is a valuable tool for primary health care (PHC), where depression is a common condition. PHC is the first point of contact for most people with depression, but about 25% of diagnoses made by PHC physicians are inaccurate. Many other barriers also hinder depression detection and treatment in PHC. Artificial intelligence (AI) may help reduce depression misdiagnosis in PHC and improve overall diagnosis and treatment outcomes. Telehealth consultations often have video issues, such as poor connectivity or dropped calls. Audio-only telehealth is often more practical for lower-income patients who may lack stable internet connections. Thus, our study focused on using audio data to predict depression risk. The objectives were to: 1) Collect audio data from 24 people (12 with depression and 12 without mental health or major health condition diagnoses); 2) Build a machine learning model to predict depression risk. TPOT, an autoML tool, was used to select the best machine learning algorithm, which was the K-nearest neighbors classifier. The selected model had high performance in classifying depression risk (Precision: 0.98, Recall: 0.93, F1-Score: 0.96). These findings may lead to a range of tools to help screen for and treat depression. By developing tools to detect depression risk, patients can be routed to AI-driven chatbots for initial screenings. Partnerships with a range of stakeholders are crucial to implementing these solutions. Moreover, ethical considerations, especially around data privacy and potential biases in AI models, need to be at the forefront of any AI-driven intervention in mental health care.
摘要
电健康是一种有价值的工具 для基础健康护理(PHC), где抑郁是一种常见的疾病。PHC是大多数抑郁患者的第一个接触点,但约25%的诊断由PHC医生进行的是不正确的。许多其他的障碍也妨碍了抑郁检测和治疗在PHC中。人工智能(AI)可能可以减少在PHC中的抑郁误诊和提高总诊断和治疗结果。电健康咨询经常会出现视频问题,如互联网络不稳定或掉线问题。对于低收入患者来说,音频电健康更加实用,因为他们可能缺乏稳定的互联网连接。因此,我们的研究专注于使用音频数据预测抑郁风险。研究的目标是:1. 收集24名参与者的音频数据(12名抑郁患者和12名没有心理或重要健康状况诊断的人)。2. 使用自动Machine Learning(autoML)工具TPOT选择最佳Machine Learning算法,选择的是K-最近邻准确分类算法。选择的模型在识别抑郁风险方面表现出色(准确率0.98,感知率0.93,F1分数0.96)。这些发现可能导致一系列用于检测和治疗抑郁的工具的开发。通过开发检测抑郁风险的工具,患者可以被导向由AI驱动的聊天机器人进行初步检查。与多个各类投资者合作是实现这些解决方案的关键。此外,在AI驱动的医疗干预中,优先考虑数据隐私和可能存在的AI模型偏见等伦理考虑。
Intelligent Software Tooling for Improving Software Development
results: 研究发现,通过使用深度学习技术可以提高软件开发过程的效率和质量。Abstract
Software has eaten the world with many of the necessities and quality of life services people use requiring software. Therefore, tools that improve the software development experience can have a significant impact on the world such as generating code and test cases, detecting bugs, question and answering, etc., The success of Deep Learning (DL) over the past decade has shown huge advancements in automation across many domains, including Software Development processes. One of the main reasons behind this success is the availability of large datasets such as open-source code available through GitHub or image datasets of mobile Graphical User Interfaces (GUIs) with RICO and ReDRAW to be trained on. Therefore, the central research question my dissertation explores is: In what ways can the software development process be improved through leveraging DL techniques on the vast amounts of unstructured software engineering artifacts?
摘要
NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain
results: 实验表明,even the best LLMs 在本研究中表现不佳,这表明现有的 LLMs 在核领域具有科学知识的 gap。Abstract
As LLMs have become increasingly popular, they have been used in almost every field. But as the application for LLMs expands from generic fields to narrow, focused science domains, there exists an ever-increasing gap in ways to evaluate their efficacy in those fields. For the benchmarks that do exist, a lot of them focus on questions that don't require proper understanding of the subject in question. In this paper, we present NuclearQA, a human-made benchmark of 100 questions to evaluate language models in the nuclear domain, consisting of a varying collection of questions that have been specifically designed by experts to test the abilities of language models. We detail our approach and show how the mix of several types of questions makes our benchmark uniquely capable of evaluating models in the nuclear domain. We also present our own evaluation metric for assessing LLM's performances due to the limitations of existing ones. Our experiments on state-of-the-art models suggest that even the best LLMs perform less than satisfactorily on our benchmark, demonstrating the scientific knowledge gap of existing LLMs.
摘要
Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures?
results: 广泛的实验(对1785个模型进行了调整)表明,EMoE可以有效地提高预训练模型的在域和离域泛化能力。此外,分析和缺失研究表明,EMoE可以减少负知识传递和对不同配置的稳定性。Abstract
Incorporating modular designs into neural networks demonstrates superior out-of-generalization, learning efficiency, etc. Existing modular neural networks are generally $\textit{explicit}$ because their modular architectures are pre-defined, and individual modules are expected to implement distinct functions. Conversely, recent works reveal that there exist $\textit{implicit}$ modular structures in standard pre-trained transformers, namely $\textit{Emergent Modularity}$. They indicate that such modular structures exhibit during the early pre-training phase and are totally spontaneous. However, most transformers are still treated as monolithic models with their modular natures underutilized. Therefore, given the excellent properties of explicit modular architecture, we explore $\textit{whether and how dense pre-trained transformers can benefit from emergent modular structures.}$ To study this question, we construct \textbf{E}mergent $\textbf{M}$ixture-$\textbf{o}$f-$\textbf{E}$xperts (EMoE). Without introducing additional parameters, EMoE can be seen as the modular counterpart of the original model and can be effortlessly incorporated into downstream tuning. Extensive experiments (we tune 1785 models) on various downstream tasks (vision and language) and models (22M to1.5B) demonstrate that EMoE effectively boosts in-domain and out-of-domain generalization abilities. Further analysis and ablation study suggest that EMoE mitigates negative knowledge transfer and is robust to various configurations. Code is available at \url{https://github.com/qiuzh20/EMoE}
摘要
使用模块化设计在神经网络中表现出色的特点,包括更好的泛化性、学习效率等。现有的模块化神经网络通常是显式的,即其模块化结构是预先定义的,每个模块需要实现特定的功能。然而,latest works表明,标准预训练变换器中存在隐式的模块结构,称为“ Emergent Modularity”。这些模块结构在初始预训练阶段自然地出现,并且是 Totally Spontaneous。然而,大多数变换器仍然被视为坚实的模块化模型,其中模块性未得到利用。因此,我们提出了以下问题:whether and how dense pre-trained transformers can benefit from emergent modular structures。为了研究这个问题,我们构建了Emergent Mixture-of-Experts (EMoE)。EMoE可以看作原始模型的模块化对手,而且可以无需添加参数进行下游调整。我们对多种下游任务(视觉和语言)和模型(22M到1.5B)进行了广泛的实验,结果表明,EMoE可以有效地提高域内和域外泛化能力。进一步的分析和减少研究表明,EMoE可以 mitigate negative knowledge transfer和是对多种配置的稳定。代码可以在 \url{https://github.com/qiuzh20/EMoE} 上下载。
results: 论文的两个实验表明,使用子任务抽象可以减少模型需要的训练数据量,并且可以成功地塑造模型采用人类化的形态偏好。Abstract
Despite the recent success of artificial neural networks on a variety of tasks, we have little knowledge or control over the exact solutions these models implement. Instilling inductive biases -- preferences for some solutions over others -- into these models is one promising path toward understanding and controlling their behavior. Much work has been done to study the inherent inductive biases of models and instill different inductive biases through hand-designed architectures or carefully curated training regimens. In this work, we explore a more mechanistic approach: Subtask Induction. Our method discovers a functional subnetwork that implements a particular subtask within a trained model and uses it to instill inductive biases towards solutions utilizing that subtask. Subtask Induction is flexible and efficient, and we demonstrate its effectiveness with two experiments. First, we show that Subtask Induction significantly reduces the amount of training data required for a model to adopt a specific, generalizable solution to a modular arithmetic task. Second, we demonstrate that Subtask Induction successfully induces a human-like shape bias while increasing data efficiency for convolutional and transformer-based image classification models.
摘要
尽管人工神经网络在各种任务上表现出色,但我们对这些模型实际解决方案的具体知识和控制仍然很少。尝试通过填充模型中的预设偏见(preferences for certain solutions)来理解和控制它们的行为是一个有前途的路径。许多研究已经专注于研究模型的内生预设偏见和通过手动设计architecture或特制的训练程序来实现不同的预设偏见。在这个工作中,我们探索了一种更机制化的方法:子任务抽象。我们的方法可以在已经训练过的模型中找到一个功能子网络,该子网络实现了特定的子任务,然后使用这个子任务来填充模型中的预设偏见。子任务抽象是高效和灵活的,我们通过两个实验证明其效果。首先,我们表明了子任务抽象可以减少模型学习数据量,使模型采用特定、普适的解决方案。其次,我们成功地在基于卷积和变换器的图像分类模型中实现了人类化形态偏见,同时提高了数据效率。