results: 本文实现了与XLS-R相当的性能(ML-SUPERB),仅使用了少于10%的训练数据,并且可以使用academic compute进行实现。此外,还提出了一种使用vanilla HuBERT基础模型,可以维持94%的XLS-R性能,仅使用3%的数据、4个GPU和有限的试验。Abstract
Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already limited to few research groups due to its resource usage. We show that more powerful techniques can actually lead to more efficient pre-training, opening SSL to more research groups. We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages. To build WavLabLM, we devise a novel multi-stage pre-training method, designed to address the language imbalance of multilingual data. WavLabLM achieves comparable performance to XLS-R on ML-SUPERB with less than 10% of the training data, making SSL realizable with academic compute. We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data, 4 GPUs, and limited trials. We open-source all code and models in ESPnet.
摘要
多语言自监学习(SSL)经常落后于当前最佳方法(SOTA),这主要是因为处理多种语言的成本和复杂性。这会增加SSL的复制性问题,现在只有一些研究小组可以进行SSL的研究,因为它的资源使用。我们表明更强大的技术可以实际导致更有效的预训练,从而使SSL更加开放。我们提出了WavLabLM,它将在40000小时数据上扩展WavLM的联合预测和干扰。为建立WavLabLM,我们开发了一种新的多stage预训练方法,用于解决多语言数据的语言偏好问题。WavLabLM在ML-SUPERB上与XLS-R的性能相似,但使用了 fewer than 10% of the training data,使SSL在学术计算机中变得可行。我们还证明了可以通过使用vanilla HuBERT Base模型,以维持94%的XLS-R性能,只需3%的数据,4个GPU和有限的尝试。我们将所有代码和模型开源在ESPnet上。
MAPTree: Beating “Optimal” Decision Trees with Bayesian Decision Trees
results: 在 16 个实际数据集上,MAPTree Either outperforms baselines 或者和比较好的性能,但是它的树会更小。在一个 sintetic 数据集上,MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches。此外,MAPTree 还可以更快地找到最大 posteriori 树,并且可以提供一个 оптимальность 证明。Abstract
Decision trees remain one of the most popular machine learning models today, largely due to their out-of-the-box performance and interpretability. In this work, we present a Bayesian approach to decision tree induction via maximum a posteriori inference of a posterior distribution over trees. We first demonstrate a connection between maximum a posteriori inference of decision trees and AND/OR search. Using this connection, we propose an AND/OR search algorithm, dubbed MAPTree, which is able to recover the maximum a posteriori tree. Lastly, we demonstrate the empirical performance of the maximum a posteriori tree both on synthetic data and in real world settings. On 16 real world datasets, MAPTree either outperforms baselines or demonstrates comparable performance but with much smaller trees. On a synthetic dataset, MAPTree also demonstrates greater robustness to noise and better generalization than existing approaches. Finally, MAPTree recovers the maxiumum a posteriori tree faster than existing sampling approaches and, in contrast with those algorithms, is able to provide a certificate of optimality. The code for our experiments is available at https://github.com/ThrunGroup/maptree.
摘要
The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents
paper_authors: Che-Jui Chang, Samuel S. Sohn, Sen Zhang, Rajath Jayashankar, Muhammad Usman, Mubbasir Kapadia
for: 这个论文的目的是提高虚拟人工智能对人类的情感传递。
methods: 这个论文使用的方法是基于多Modal的行为生成框架,以实现情感的共同传递。
results: 研究发现,当多Modal的行为与主要情感保持一致时,人类对虚拟人工智能的情感传递得到了最高评分。同时,当某一Modal的行为与主要情感不一致时,人类对虚拟人工智能的情感传递明显减弱。Abstract
Previous studies regarding the perception of emotions for embodied virtual agents have shown the effectiveness of using virtual characters in conveying emotions through interactions with humans. However, creating an autonomous embodied conversational agent with expressive behaviors presents two major challenges. The first challenge is the difficulty of synthesizing the conversational behaviors for each modality that are as expressive as real human behaviors. The second challenge is that the affects are modeled independently, which makes it difficult to generate multimodal responses with consistent emotions across all modalities. In this work, we propose a conceptual framework, ACTOR (Affect-Consistent mulTimodal behaviOR generation), that aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We have conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The result shows that among all model conditions, our affect-consistent framework receives the highest Likert scores for the perception of driving affects. Our statistical analysis suggests that making a modality affect-inconsistent significantly decreases the perception of driving affects. We also observe that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects. Therefore, we conclude that multimodal emotion conditioning and affect consistency are vital to enhancing the perception of affects for embodied conversational agents.
摘要
Synthesizing conversational behaviors for each modality that are as expressive as real human behaviors is difficult.2. Affects are modeled independently, making it hard to generate multimodal responses with consistent emotions across all modalities.To address these challenges, we propose the ACTOR (Affect-Consistent mulTimodal behaviOR generation) framework, which aims to increase the perception of affects by generating multimodal behaviors conditioned on a consistent driving affect. We conducted a user study with 199 participants to assess how the average person judges the affects perceived from multimodal behaviors that are consistent and inconsistent with respect to a driving affect. The results show that our affect-consistent framework received the highest Likert scores for the perception of driving affects. Additionally, we found that making a modality affect-inconsistent significantly decreases the perception of driving affects, and that multimodal behaviors conditioned on consistent affects are more expressive compared to behaviors with inconsistent affects.Therefore, we conclude that multimodal emotion conditioning and affect consistency are crucial to enhancing the perception of affects for embodied conversational agents.
Ruffle&Riley: Towards the Automated Induction of Conversational Tutoring Systems
results: 在一项在线用户研究中,与简单的问答聊天机器人和阅读活动相比,Ruffle&Riley Users表现出更高的理解和记忆,并认为提供的支持更有用和对话更 coherent。Abstract
Conversational tutoring systems (CTSs) offer learning experiences driven by natural language interaction. They are known to promote high levels of cognitive engagement and benefit learning outcomes, particularly in reasoning tasks. Nonetheless, the time and cost required to author CTS content is a major obstacle to widespread adoption. In this paper, we introduce a novel type of CTS that leverages the recent advances in large language models (LLMs) in two ways: First, the system induces a tutoring script automatically from a lesson text. Second, the system automates the script orchestration via two LLM-based agents (Ruffle&Riley) with the roles of a student and a professor in a learning-by-teaching format. The system allows a free-form conversation that follows the ITS-typical outer-/inner-loop structure. In an initial between-subject online user study (N = 100) comparing Ruffle&Riley to simpler QA chatbots and reading activity, we found no significant differences in post-test scores. Nonetheless, in the learning experience survey, Ruffle&Riley users expressed higher ratings of understanding and remembering and further perceived the offered support as more helpful and the conversation as coherent. Our study provides insights for a new generation of scalable CTS technologies.
摘要
对话教育系统(CTS)提供了由自然语言互动驱动的学习体验。它们知名于提高高度的认知投入和学习结果,特别是在推理任务中。然而,创建CTS内容所需的时间和成本是普遍采用的主要障碍。在这篇论文中,我们介绍了一种新型的CTS,它利用最近的大语言模型(LLM)的进步,自动从课程文本中推导导师课程。其次,这个系统通过两个LLM-基于的代理人(Ruffle&Riley),将学生和教授的角色分别扮演为学习-教学格式。系统允许自由的对话,并且遵循ITS-典型的外部/内部回路结构。在我们的初步在网上用户研究(N = 100)中,比较Ruffle&Riley与简单的问答聊天机器人和阅读活动,我们未发现任何显著的差异在 poste-test scores。然而,Ruffle&Riley 用户对系统的理解和记忆得分高于其他两种方法,并且觉得提供的支持更有帮助,以及对话更加流畅。我们的研究给出了一代新的可扩展的CTS技术的洞察。
STERLING: Self-Supervised Terrain Representation Learning from Unconstrained Robot Experience
results: 通过物理机器人实验,研究人员发现 STERLING 的特征在 preference-aligned visual navigation 任务上与完全超vised方法相当,而且与其他现有的方法相比,具有更好的对称性。此外,研究人员在一个3英里长的自然走道上完成了自主旅行,只需要两次人工干预,这表明 STERLING 在实际的 off-road 环境中具有较好的Robustness。Abstract
Terrain awareness, i.e., the ability to identify and distinguish different types of terrain, is a critical ability that robots must have to succeed at autonomous off-road navigation. Current approaches that provide robots with this awareness either rely on labeled data which is expensive to collect, engineered features and cost functions that may not generalize, or expert human demonstrations which may not be available. Towards endowing robots with terrain awareness without these limitations, we introduce Self-supervised TErrain Representation LearnING (STERLING), a novel approach for learning terrain representations that relies solely on easy-to-collect, unconstrained (e.g., non-expert), and unlabelled robot experience, with no additional constraints on data collection. STERLING employs a novel multi-modal self-supervision objective through non-contrastive representation learning to learn relevant terrain representations for terrain-aware navigation. Through physical robot experiments in off-road environments, we evaluate STERLING features on the task of preference-aligned visual navigation and find that STERLING features perform on par with fully supervised approaches and outperform other state-of-the-art methods with respect to preference alignment. Additionally, we perform a large-scale experiment of autonomously hiking a 3-mile long trail which STERLING completes successfully with only two manual interventions, demonstrating its robustness to real-world off-road conditions.
摘要
<>TERRAIN 认知,即机器人能够识别和区别不同类型的地形,是自主off-road导航中机器人必备的重要能力。现有的方法提供机器人 terrain 认知都有一些限制,包括需要收集和标注的数据成本高昂、可能不会总结的工程特征和成本函数,以及可能不可获得的专家人类示范。为了让机器人具备 terrain 认知无需这些限制,我们介绍了一种新的方法:Self-supervised TErrain Representation LearnING(STERLING)。STERLING 方法基于非对照 represencing 学习,通过不同模式的自我监督目标来学习地形表示。通过物理机器人在off-road环境中的实验,我们评估了 STERLING 的特征在视觉导航中的性能,发现 STERLING 特征与完全监督方法相当,并且在对齐性方面超过了现有的状态艺术方法。此外,我们进行了一项大规模的自主步行一个3英里长的徒步道,STERLING 成功完成了这项任务,只需两次人类干预,这表明它在实际的off-road条件下具有了可靠性。>>>
results: 证明了该方法可以超越已知最大Entropy技术,并在多种各种标准准备列表上实现了较好的性能。Abstract
The assumption that data are independent and identically distributed underpins all machine learning. When data are collected sequentially from agent experiences this assumption does not generally hold, as in reinforcement learning. Here, we derive a method that overcomes these limitations by exploiting the statistical mechanics of ergodic processes, which we term maximum diffusion reinforcement learning. By decorrelating agent experiences, our approach provably enables agents to learn continually in single-shot deployments regardless of how they are initialized. Moreover, we prove our approach generalizes well-known maximum entropy techniques, and show that it robustly exceeds state-of-the-art performance across popular benchmarks. Our results at the nexus of physics, learning, and control pave the way towards more transparent and reliable decision-making in reinforcement learning agents, such as locomoting robots and self-driving cars.
摘要
<>Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you prefer Traditional Chinese, I can provide that as well.
Out of Sight, Still in Mind: Reasoning and Planning about Unobserved Objects with Video Tracking Enabled Memory Models
results: 我们在广泛的模拟和实际实验中发现,我们的方法在不同数量的物体和干扰动作下表现出色,并且超越了隐式记忆基准。Abstract
Robots need to have a memory of previously observed, but currently occluded objects to work reliably in realistic environments. We investigate the problem of encoding object-oriented memory into a multi-object manipulation reasoning and planning framework. We propose DOOM and LOOM, which leverage transformer relational dynamics to encode the history of trajectories given partial-view point clouds and an object discovery and tracking engine. Our approaches can perform multiple challenging tasks including reasoning with occluded objects, novel objects appearance, and object reappearance. Throughout our extensive simulation and real-world experiments, we find that our approaches perform well in terms of different numbers of objects and different numbers of distractor actions. Furthermore, we show our approaches outperform an implicit memory baseline.
摘要
瑜珐机器人需要保持过去观察到但现在受阻物体的记忆,以在真实环境中正常工作。我们研究对象 ориентирован的记忆编码到多对象抓取和规划框架中。我们提出了DOOM和LOOM,它们利用变换器关系动力学来编码部分视图点云中的历史轨迹,并且具有发现和跟踪物体的引擎。我们的方法可以完成许多具有挑战性的任务,包括处理遮盲物体、新出现的物体和物体重新出现。在我们的广泛的模拟和实际实验中,我们发现我们的方法在不同数量的物体和不同数量的干扰动作下表现良好。此外,我们还证明我们的方法比基eline抑制器表现更好。
Efficient Low-rank Backpropagation for Vision Transformer Adaptation
results: 我们在不同的模型(ViT、混合 convolution-ViT模型)和多个数据集上进行了广泛的实验,证明了我们的LBP-WHT方法的效果。例如,当适应一个EfficientFormer-L1模型在CIFAR100上时,我们的LBP-WHT方法可以提高对比预测的准确率10.4%,同时需要9亿次FLOPs的计算量少。Abstract
The increasing scale of vision transformers (ViT) has made the efficient fine-tuning of these large models for specific needs a significant challenge in various applications. This issue originates from the computationally demanding matrix multiplications required during the backpropagation process through linear layers in ViT. In this paper, we tackle this problem by proposing a new Low-rank BackPropagation via Walsh-Hadamard Transformation (LBP-WHT) method. Intuitively, LBP-WHT projects the gradient into a low-rank space and carries out backpropagation. This approach substantially reduces the computation needed for adapting ViT, as matrix multiplication in the low-rank space is far less resource-intensive. We conduct extensive experiments with different models (ViT, hybrid convolution-ViT model) on multiple datasets to demonstrate the effectiveness of our method. For instance, when adapting an EfficientFormer-L1 model on CIFAR100, our LBP-WHT achieves 10.4% higher accuracy than the state-of-the-art baseline, while requiring 9 MFLOPs less computation. As the first work to accelerate ViT adaptation with low-rank backpropagation, our LBP-WHT method is complementary to many prior efforts and can be combined with them for better performance.
摘要
随着视transformer(ViT)的扩大规模,为特定需求进行高效调整这些大型模型已成为各种应用中的主要挑战。这个问题的起源在于ViT中linear层中的计算复杂度较高,即在backpropagation过程中的矩阵乘法。在这篇论文中,我们解决这个问题,提出了一种新的Low-rank BackPropagation via Walsh-Hadamard Transformation(LBP-WHT)方法。INTUITIVELY,LBP-WHT方法将梯度投影到低级空间中,并进行backpropagation。这种方法可以减少适应ViT的计算量,因为矩阵乘法在低级空间中是非常资源占用的。我们在不同的模型(ViT、混合卷积-ViT模型)和多个数据集上进行了广泛的实验,以证明我们的方法的有效性。例如,当适应EfficientFormer-L1模型在CIFAR100上的时候,我们的LBP-WHT方法可以与状态之前的基线方法相比,提高10.4%的准确率,同时需要9 MFLOPs fewer computation。作为第一个加速ViT适应的低级后向传播方法,我们的LBP-WHT方法是与许多先前的尝试相结合的,可以提高性能。
Memory-Efficient Continual Learning Object Segmentation for Long Video
results: 实验结果表明,提出的方法可以提高在线对象分割模型的性能,在长视频 dataset 上提高精度达到10%,同时在短视频 dataset 上保持相对稳定的性能。Abstract
Recent state-of-the-art semi-supervised Video Object Segmentation (VOS) methods have shown significant improvements in target object segmentation accuracy when information from preceding frames is used in undertaking segmentation on the current frame. In particular, such memory-based approaches can help a model to more effectively handle appearance changes (representation drift) or occlusions. Ideally, for maximum performance, online VOS methods would need all or most of the preceding frames (or their extracted information) to be stored in memory and be used for online learning in consecutive frames. Such a solution is not feasible for long videos, as the required memory size would grow without bound. On the other hand, these methods can fail when memory is limited and a target object experiences repeated representation drifts throughout a video. We propose two novel techniques to reduce the memory requirement of online VOS methods while improving modeling accuracy and generalization on long videos. Motivated by the success of continual learning techniques in preserving previously-learned knowledge, here we propose Gated-Regularizer Continual Learning (GRCL), which improves the performance of any online VOS subject to limited memory, and a Reconstruction-based Memory Selection Continual Learning (RMSCL) which empowers online VOS methods to efficiently benefit from stored information in memory. Experimental results show that the proposed methods improve the performance of online VOS models up to 10 %, and boosts their robustness on long-video datasets while maintaining comparable performance on short-video datasets DAVIS16 and DAVIS17.
摘要
现代 semi-supervised Video Object Segmentation(VOS)方法在使用前一帧信息进行当前帧分 segmentation 时已经显示了显著改善 target 对象分 segmentation 精度。尤其是这些记忆型方法可以帮助模型更好地处理出现变化(表达漂移)或遮挡。理想情况下,为了 дости到最高性能,在线 VOS 方法需要所有或大多数的前一帧(或其提取的信息)被存储在内存中,并在连续帧中进行在线学习。然而,这种解决方案并不可行于长视频,因为所需的内存大小会无限增长。另一方面,这些方法可能会失败当内存有限制,target 对象在视频中经历重复的表达漂移。我们提出了两种新的技术来降低在线 VOS 方法的内存需求,同时提高模型的准确性和泛化能力在长视频上。我们的方法包括:1. 闭包 regularizer continual learning(GRCL),可以提高在线 VOS 模型的性能,并且可以在有限内存下进行学习。2. 重建基于内存选择 continual learning(RMSCL),可以让在线 VOS 方法有效地利用存储在内存中的信息,以提高模型的性能和泛化能力。实验结果表明,我们的方法可以提高在线 VOS 模型的性能,并且可以增强其在长视频上的稳定性,而不会影响短视频上的性能。
STARC: A General Framework For Quantifying Differences Between Reward Functions
paper_authors: Joar Skalse, Lucy Farnik, Sumeet Ramesh Motwani, Erik Jenner, Adam Gleave, Alessandro Abate
For: The paper aims to provide a solution to the problem of deriving theoretical guarantees for reward learning algorithms, which is difficult due to the lack of good methods for quantifying the difference between reward functions.* Methods: The paper proposes a class of pseudometrics called STARC (STAndardised Reward Comparison) metrics, which can be used to quantify the difference between reward functions and provide both upper and lower bounds on worst-case regret.* Results: The paper shows that STARC metrics are tight and bilipschitz equivalent, and identifies issues with reward metrics proposed by earlier works. The paper also demonstrates the empirical efficacy of STARC metrics through practical evaluation.Abstract
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivises undesirable behaviour. As a result, it is increasingly popular to use reward learning algorithms, which attempt to learn a reward function from data. However, the theoretical foundations of reward learning are not yet well-developed. In particular, it is typically not known when a given reward learning algorithm with high probability will learn a reward function that is safe to optimise. This means that reward learning algorithms generally must be evaluated empirically, which is expensive, and that their failure modes are difficult to predict in advance. One of the roadblocks to deriving better theoretical guarantees is the lack of good methods for quantifying the difference between reward functions. In this paper we provide a solution to this problem, in the form of a class of pseudometrics on the space of all reward functions that we call STARC (STAndardised Reward Comparison) metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret, which implies that our metrics are tight, and that any metric with the same properties must be bilipschitz equivalent to ours. Moreover, we also identify a number of issues with reward metrics proposed by earlier works. Finally, we evaluate our metrics empirically, to demonstrate their practical efficacy. STARC metrics can be used to make both theoretical and empirical analysis of reward learning algorithms both easier and more principled.
摘要
Translated into Simplified Chinese:为解决一个任务使用奖励学习,首先需要正式化该任务的目标为奖励函数。然而,在许多实际任务中,很难手动指定一个不奖励不良行为的奖励函数。因此,奖励学习算法在很多情况下变得越来越受欢迎。然而,奖励学习的理论基础还不够发展。特别是,通常不知道一个给定的奖励学习算法是否会在高probability上学习一个安全优化的奖励函数。这意味着奖励学习算法通常需要通过实际评估,这会很昂贵,并且其失败模式难以预测。一个障碍得到更好的理论保证的原因是量化奖励函数之间的差别的缺乏好的方法。在这篇论文中,我们提供一个解决方案,即 STARC(STAndardised Reward Comparison)度量。我们证明STARC度量induces both upper and lower bound on worst-case regret,这意味着我们的度量是紧张的,任何与我们度量相同的度量都必须是bilipSchitz相同的。此外,我们还标识了早期的奖励度量的一些问题。最后,我们employmSTARC度量进行实证评估,以证明其实践效果。STARC度量可以使得奖励学习算法的理论和实证分析变得更加容易和更加原则性。
results: 实验结果显示,VPA可以提高对异常输入的泛化能力 by 3.3%,超过了之前的测试方法。此外,VPA还可以提高损害鲁棒性 by 6.5%,并且可以提高领域转换性能 by 5.2%。Abstract
Textual prompt tuning has demonstrated significant performance improvements in adapting natural language processing models to a variety of downstream tasks by treating hand-engineered prompts as trainable parameters. Inspired by the success of textual prompting, several studies have investigated the efficacy of visual prompt tuning. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without necessitating source-domain information. We examine our VPA design under diverse adaptation settings, encompassing single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results reveal that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
摘要
文本提示调整已经显示出对多种下游任务的模型适应性提高了显著性。 draw inspiration from textual prompting success, several studies have explored visual prompting efficacy. In this work, we present Visual Prompt Adaptation (VPA), the first framework that generalizes visual prompting with test-time adaptation. VPA introduces a small number of learnable tokens, enabling fully test-time and storage-efficient adaptation without requiring source-domain information. We examine our VPA design under diverse adaptation settings, including single-image, batched-image, and pseudo-label adaptation. We evaluate VPA on multiple tasks, including out-of-distribution (OOD) generalization, corruption robustness, and domain adaptation. Experimental results show that VPA effectively enhances OOD generalization by 3.3% across various models, surpassing previous test-time approaches. Furthermore, we show that VPA improves corruption robustness by 6.5% compared to strong baselines. Finally, we demonstrate that VPA also boosts domain adaptation performance by relatively 5.2%. Our VPA also exhibits marked effectiveness in improving the robustness of zero-shot recognition for vision-language models.
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets
paper_authors: Daria Reshetova, Swetava Ganguli, C. V. Krishnakumar Iyer, Vipul Pandey for:* 这个论文旨在检测多modal geospatial数据中的 геометрические异常(如道路、建筑、地形等)。methods:* 该论文提出了一种无监督异常检测技术,即 SeMAnD,使用 RandPolyAugment 数据增强策略和自我超vised 训练目标来学习多modal数据的表示,并具有异常检测能力。results:* 该论文的实验表明,SeMAnD 能够有效地检测实际世界中的异常,并与域外异常检测策略相比,提高了4.8-19.7%的异常分类 AUC。此外,模型性能随输入模式数量和数据增强策略的多样性和强度增长。Abstract
We propose a Self-supervised Anomaly Detection technique, called SeMAnD, to detect geometric anomalies in Multimodal geospatial datasets. Geospatial data comprises of acquired and derived heterogeneous data modalities that we transform to semantically meaningful, image-like tensors to address the challenges of representation, alignment, and fusion of multimodal data. SeMAnD is comprised of (i) a simple data augmentation strategy, called RandPolyAugment, capable of generating diverse augmentations of vector geometries, and (ii) a self-supervised training objective with three components that incentivize learning representations of multimodal data that are discriminative to local changes in one modality which are not corroborated by the other modalities. Detecting local defects is crucial for geospatial anomaly detection where even small anomalies (e.g., shifted, incorrectly connected, malformed, or missing polygonal vector geometries like roads, buildings, landcover, etc.) are detrimental to the experience and safety of users of geospatial applications like mapping, routing, search, and recommendation systems. Our empirical study on test sets of different types of real-world geometric geospatial anomalies across 3 diverse geographical regions demonstrates that SeMAnD is able to detect real-world defects and outperforms domain-agnostic anomaly detection strategies by 4.8-19.7% as measured using anomaly classification AUC. We also show that model performance increases (i) up to 20.4% as the number of input modalities increase and (ii) up to 22.9% as the diversity and strength of training data augmentations increase.
摘要
我们提出了一种自动异常检测技术,称为SeMAnD,用于检测多Modal geospatial数据中的几何异常。 geospatial数据包括获取和 derivated 多种数据类型,我们将其转换为semantically meaningful的 image-like 张量,以解决多Modal数据的表示、对接和融合问题。 SeMAnD 包括(i)一种简单的数据增强策略,称为 RandPolyAugment,可以生成多种几何异常的扩展,以及(ii)一种自动异常检测目标函数,包括三个组件,这些组件鼓励学习多Modal数据的表示,对一个模式中的局部变化不被其他模式支持。 检测局部异常是关键的,因为even small 异常(例如,偏移、错过连接、腐坏、缺失多边形vector geometry like roads, buildings, landcover, etc.)对 geospatial应用程序(如地图、路径规划、搜索、推荐系统)的用户体验和安全产生重要影响。我们的实验表明,SeMAnD 可以检测实际的异常并在预测异常分类 AUC 方面高于域无关异常检测策略 by 4.8-19.7%。我们还发现,模型性能随输入模式数量和数据增强策略的多样性和强度而增长,最高可以达到 20.4% 和 22.9%。
PlotMap: Automated Layout Design for Building Game Worlds
paper_authors: Yi Wang, Jieliang Luo, Adam Gaier, Evan Atherton, Hilmar Koch for:The paper aims to address the challenge of designing game maps that support a desired narrative, by introducing an extra layer of plot facility layout design that is independent of the underlying map generation method.methods:The paper proposes using Reinforcement Learning (RL) to automatically assign concrete locations on a game map to abstract locations mentioned in a given story (plot facilities), following spatial constraints derived from the story.results:The paper presents a system that considers input from multiple modalities, including map images, facility locations, and story constraints expressed in natural language, to train and evaluate RL models for plot facility layout design. The system is evaluated through a group of comprehensive experiments and ablation studies, providing insights for RL-based plot facility layout design.Abstract
World-building, the process of developing both the narrative and physical world of a game, plays a vital role in the game's experience. Critically acclaimed independent and AAA video games are praised for strong world building, with game maps that masterfully intertwine with and elevate the narrative, captivating players and leaving a lasting impression. However, designing game maps that support a desired narrative is challenging, as it requires satisfying complex constraints from various considerations. Most existing map generation methods focus on considerations about gameplay mechanics or map topography, while the need to support the story is typically neglected. As a result, extensive manual adjustment is still required to design a game world that facilitates particular stories. In this work, we approach this problem by introducing an extra layer of plot facility layout design that is independent of the underlying map generation method in a world-building pipeline. Concretely, we present a system that leverages Reinforcement Learning (RL) to automatically assign concrete locations on a game map to abstract locations mentioned in a given story (plot facilities), following spatial constraints derived from the story. A decision-making agent moves the plot facilities around, considering their relationship to the map and each other, to locations on the map that best satisfy the constraints of the story. Our system considers input from multiple modalities: map images as pixels, facility locations as real values, and story constraints expressed in natural language. We develop a method of generating datasets of facility layout tasks, create an RL environment to train and evaluate RL models, and further analyze the behaviors of the agents through a group of comprehensive experiments and ablation studies, aiming to provide insights for RL-based plot facility layout design.
摘要
世界建设,游戏的叙述和物理世界的开发过程,对游戏的体验非常重要。独立游戏和AAA游戏都得到了广泛的赞誉,即使游戏地图与叙述相互呼应, captivating 玩家并留下深刻的印象。然而,设计游戏地图以支持特定的叙述是具有复杂的约束的问题,而现有的地图生成方法通常会忽略这些约束。因此,手动调整仍然是必要的,以设计游戏世界,以便支持特定的叙述。在这种情况下,我们采用了一种独特的叙述设计方法,通过强化学习(RL)自动将叙述中提到的抽象位置(叙述设施)转换为游戏地图上的具体位置。一个决策者会将叙述设施移动到地图上,考虑它们之间的关系和地图上的约束,以满足叙述的约束。我们的系统将来自多种模式的输入进行处理:地图图像作为像素、设施位置作为真实值,以及叙述约束表示在自然语言中。我们开发了一种生成叙述设施任务的方法,创建了RL环境来训练和评估RL模型,并通过一系列完整的实验和剥离研究,以提供RL-基于叙述设施布局的深入理解。
ChatGPT & Mechanical Engineering: Examining performance on the FE Mechanical Engineering and Undergraduate Exams
For: This paper is written for exploring the capabilities of ChatGPT in the discipline of mechanical engineering, specifically in the classroom and professional settings.* Methods: The paper uses two ChatGPT models, one free and one paid subscription, to examine their performance on junior and senior level mechanical engineering exams and practice questions for the Fundamentals of Engineering Exam (FE) in Mechanical Engineering.* Results: The paid subscription model (GPT-4) greatly outperformed the free version (GPT-3.5), achieving 76% correct vs 51% correct, but the limitation of text only input on both models makes neither likely to pass the FE exam. The results confirm findings in the literature with regards to types of errors and pitfalls made by ChatGPT.Abstract
The launch of ChatGPT at the end of 2022 generated large interest into possible applications of artificial intelligence in STEM education and among STEM professions. As a result many questions surrounding the capabilities of generative AI tools inside and outside of the classroom have been raised and are starting to be explored. This study examines the capabilities of ChatGPT within the discipline of mechanical engineering. It aims to examine use cases and pitfalls of such a technology in the classroom and professional settings. ChatGPT was presented with a set of questions from junior and senior level mechanical engineering exams provided at a large private university, as well as a set of practice questions for the Fundamentals of Engineering Exam (FE) in Mechanical Engineering. The responses of two ChatGPT models, one free to use and one paid subscription, were analyzed. The paper found that the subscription model (GPT-4) greatly outperformed the free version (GPT-3.5), achieving 76% correct vs 51% correct, but the limitation of text only input on both models makes neither likely to pass the FE exam. The results confirm findings in the literature with regards to types of errors and pitfalls made by ChatGPT. It was found that due to its inconsistency and a tendency to confidently produce incorrect answers the tool is best suited for users with expert knowledge.
摘要
<> translate into Simplified Chinese于2022年底发布ChatGPT引发了大量关于人工智能在科学技术教育和相关领oprofessions的兴趣。因此,许多关于生成AI工具在课堂和职业场景中的能力和局限性的问题被提出并开始被探讨。本研究探讨了ChatGPT在机械工程领域的能力。它的目的是检查ChatGPT在课堂和职业场景中的应用案例和陷阱。为了完成这项研究,ChatGPT被给予了一组 junior和senior机械工程考试中的问题,以及一组机械工程基础知识考试(FE)的练习题。两个ChatGPT模型(一个免费版本GPT-3.5和一个付费版本GPT-4)的回答被分析。研究发现,付费版本GPT-4在正确率方面大幅过之GPT-3.5(76%对51%),但两个模型均限于文本输入,使得 neither是可以通过FE考试。研究结果证明了文献中关于ChatGPT的错误和陷阱的发现。它发现了ChatGPT的不一致和自信地生成错误的倾向,因此建议用户应具备专家知识使用该工具。本研究的结果表明,ChatGPT在机械工程领域的应用需要进一步的研究和开发,以便更好地利用其能力。此外,研究还发现了一些可能的应用场景,例如用于提供学生们学习资源、帮助教师们设计课程和评估学生们的知识水平等。
Learning Using Generated Privileged Information by Text-to-Image Diffusion Models
results: 在四个文本分类数据集上,LUGPI方法实现了明显的性能提升,示出其在文本分类中的潜力。Abstract
Learning Using Privileged Information is a particular type of knowledge distillation where the teacher model benefits from an additional data representation during training, called privileged information, improving the student model, which does not see the extra representation. However, privileged information is rarely available in practice. To this end, we propose a text classification framework that harnesses text-to-image diffusion models to generate artificial privileged information. The generated images and the original text samples are further used to train multimodal teacher models based on state-of-the-art transformer-based architectures. Finally, the knowledge from multimodal teachers is distilled into a text-based (unimodal) student. Hence, by employing a generative model to produce synthetic data as privileged information, we guide the training of the student model. Our framework, called Learning Using Generated Privileged Information (LUGPI), yields noticeable performance gains on four text classification data sets, demonstrating its potential in text classification without any additional cost during inference.
摘要
学习使用特权信息是一种特定的知识填充,其中教师模型在训练时得到额外数据表示,称为特权信息,以改进学生模型,该模型不能看到额外表示。然而,特权信息在实践中很少可用。为此,我们提议一种文本分类框架,利用文本到图像扩散模型生成人工特权信息。生成的图像和原始文本样本被用来训练基于state-of-the-art transformer-based architecture的多Modal教师模型。最后,多Modal教师模型中的知识被透传到文本基本(单Modal)学生模型中。因此,通过使用生成模型生成Synthetic数据作为特权信息,我们可以导引学生模型的训练。我们的框架,称为学习使用生成的特权信息(LUGPI),在四个文本分类数据集上显示了明显的性能提升,这表明它在文本分类中具有潜在的应用前提。
User Experience Design Professionals’ Perceptions of Generative Artificial Intelligence
results: 研究发现经验丰富的设计师对GenAI的辅助作用表示自信,但新手设计师可能会受到技能减退、工作替换和创造力疲劳的影响。Abstract
Among creative professionals, Generative Artificial Intelligence (GenAI) has sparked excitement over its capabilities and fear over unanticipated consequences. How does GenAI impact User Experience Design (UXD) practice, and are fears warranted? We interviewed 20 UX Designers, with diverse experience and across companies (startups to large enterprises). We probed them to characterize their practices, and sample their attitudes, concerns, and expectations. We found that experienced designers are confident in their originality, creativity, and empathic skills, and find GenAI's role as assistive. They emphasized the unique human factors of "enjoyment" and "agency", where humans remain the arbiters of "AI alignment". However, skill degradation, job replacement, and creativity exhaustion can adversely impact junior designers. We discuss implications for human-GenAI collaboration, specifically copyright and ownership, human creativity and agency, and AI literacy and access. Through the lens of responsible and participatory AI, we contribute a deeper understanding of GenAI fears and opportunities for UXD.
摘要
amongst creative professionals, Generative Artificial Intelligence (GenAI) has sparked excitement over its capabilities and fear over unanticipated consequences. How does GenAI impact User Experience Design (UXD) practice, and are fears warranted? We interviewed 20 UX Designers, with diverse experience and across companies (startups to large enterprises). We probed them to characterize their practices, and sample their attitudes, concerns, and expectations. We found that experienced designers are confident in their originality, creativity, and empathic skills, and find GenAI's role as assistive. They emphasized the unique human factors of "enjoyment" and "agency", where humans remain the arbiters of "AI alignment". However, skill degradation, job replacement, and creativity exhaustion can adversely impact junior designers. We discuss implications for human-GenAI collaboration, specifically copyright and ownership, human creativity and agency, and AI literacy and access. Through the lens of responsible and participatory AI, we contribute a deeper understanding of GenAI fears and opportunities for UXD.Here's the word-for-word translation in Simplified Chinese: amongst 创新专业人士, 生成人工智能(GenAI)已引起了能力和不确定后果的兴趣。GenAI如何影响用户体验设计(UXD)实践,而担忧是否合理?我们采访了20名UX设计师,他们在不同的公司(从创新公司到大型企业)中有各种经验。我们询问他们描述他们的做法,并抽样他们的态度、担忧和期望。我们发现经验丰富的设计师对他们的原创性、创造力和Empathy技能充满自信,并认为GenAI的角色是助手。他们强调了人类的特有因素,如“娱乐”和“权力”,人类仍然是AI的“调节者”。然而,技能衰退、工作替换和创造力疲劳可能会对初级设计师产生负面影响。我们讨论了人类-GenAI合作的影响,包括版权和所有权、人类创造力和权力,以及AI文化和访问权。通过负责和参与式AI的镜头,我们为UXD提供了更深刻的GenAI担忧和机遇。
Collaborative Watermarking for Adversarial Speech Synthesis
results: 研究表明,协作训练可以增加对噪音和时间压缩的Robustness,而且 listening 测试表明,协作训练对语音质量没有负面影响。Abstract
Advances in neural speech synthesis have brought us technology that is not only close to human naturalness, but is also capable of instant voice cloning with little data, and is highly accessible with pre-trained models available. Naturally, the potential flood of generated content raises the need for synthetic speech detection and watermarking. Recently, considerable research effort in synthetic speech detection has been related to the Automatic Speaker Verification and Spoofing Countermeasure Challenge (ASVspoof), which focuses on passive countermeasures. This paper takes a complementary view to generated speech detection: a synthesis system should make an active effort to watermark the generated speech in a way that aids detection by another machine, but remains transparent to a human listener. We propose a collaborative training scheme for synthetic speech watermarking and show that a HiFi-GAN neural vocoder collaborating with the ASVspoof 2021 baseline countermeasure models consistently improves detection performance over conventional classifier training. Furthermore, we demonstrate how collaborative training can be paired with augmentation strategies for added robustness against noise and time-stretching. Finally, listening tests demonstrate that collaborative training has little adverse effect on perceptual quality of vocoded speech.
摘要
(Simplified Chinese translation) neural speech synthesis技术的进步使得我们可以快速创建自然语音,并且可以使用预训练模型进行访问。然而,这些生成的内容的潮流使得人工语音检测和水印成为必要。在这些研究中,我们采用了一种与生成语音检测相关的方法:一个合成系统应该为另一个机器提供一种可以帮助检测的 watermark,但是对人类听众来说是透明的。我们提议一种合作训练方法,使得HiFi-GAN神经 vocoder与ASVspoof 2021基准防范模型之间进行了可 collaborative 的训练。我们表明,这种方法可以提高检测性能,并且可以结合增强策略来抵御噪音和时间延迟。最后,我们通过听力测试表明,这种合作训练对语音质量的影响是微乎其微的。
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition
results: 这个论文的实验结果表明,使用LoRB架构可以在LibriSpeech和内部数据集上实现减少训练时间,减少5.4倍到3.6倍不等。Abstract
We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring. Although pretrained language models (LMs) like BERT have shown superior performance in second-pass rescoring, the high computational cost of scaling up the pretraining stage and adapting the pretrained models to specific domains limit their practical use in rescoring. Here we present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction (0.08%) of the pretrained parameters. These inserted matrices are optimized through a discriminative training objective along with a correlation-based regularization loss. The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6.
摘要
我们提出一种基于低级别适应(LoRA)的神经语言模型系统,用于语音识别输出重新分配。尽管先前训练的语言模型(LM)如BERT已经在第二次重新分配中显示出优秀表现,但是扩大预训练阶段和适应特定领域的预训练模型的计算成本限制了它们的实际使用。我们提出一种基于低级别分解的方法,通过只使用0.08%的预训练参数来训练重新分配BERT模型并适应新领域。这些插入矩阵通过一种推理目标和相关性基于的规则损失进行优化。我们提出的低级别适应重新分配BERT(LoRB)架构在LibriSpeech和内部数据集上进行评估,与训练时间相应减少了5.4到3.6倍。
results: 在多个数据集、领域和任务上评估了family of方法,其中conservative FB算法在总体来说达到了150%的vanilla FB性能。同时,保守的FB算法也超越了具有奖励标签的任务特定基线,即使没有访问奖励标签。Abstract
Zero-shot reinforcement learning (RL) promises to provide agents that can perform any task in an environment after an offline pre-training phase. Forward-backward (FB) representations represent remarkable progress towards this ideal, achieving 85% of the performance of task-specific agents in this setting. However, such performance is contingent on access to large and diverse datasets for pre-training, which cannot be expected for most real problems. Here, we explore how FB performance degrades when trained on small datasets that lack diversity, and mitigate it with conservatism, a well-established feature of performant offline RL algorithms. We evaluate our family of methods across various datasets, domains and tasks, reaching 150% of vanilla FB performance in aggregate. Somewhat surprisingly, conservative FB algorithms also outperform the task-specific baseline, despite lacking access to reward labels and being required to maintain policies for all tasks. Conservative FB algorithms perform no worse than FB on full datasets, and so present little downside over their predecessor. Our code is available open-source via https://enjeeneer.io/projects/conservative-world-models/.
摘要
zero-shot reinforcement learning(RL)承诺提供能够在环境中完成任何任务的代理人,具体来说是通过在线上进行准备 phase 来实现。forward-backward(FB)表示达了非常 significiant progress towards this ideal,达到了85%的任务特定代理人的性能。然而,这种性能受到了大量和多样化数据集的预training的限制,这些数据集在实际问题中不可能被期望。在这里,我们研究了FB表示性能如何随着小数据集的不同而下降,并采用保守性作为缓解方法,这是一种在offline RL算法中广泛存在的特征。我们在不同的数据集、领域和任务上评估了我们的家族方法,共 дости得150%的vanilla FB性能。尚未料算起来,保守FB算法还能超过任务特定基线,即使没有访问奖励标签和维护所有任务的策略。保守FB算法与FB在全数据集上表现相同,因此它们没有下降的缺点。我们的代码可以在https://enjeeneer.io/projects/conservative-world-models/上获取。
Revealing the Power of Spatial-Temporal Masked Autoencoders in Multivariate Time Series Forecasting
results: 实验结果显示,通过与 existed 的空间时间模型混合 STMAE,可以大幅提高 MTS 预测的能力。Abstract
Multivariate time series (MTS) forecasting involves predicting future time series data based on historical observations. Existing research primarily emphasizes the development of complex spatial-temporal models that capture spatial dependencies and temporal correlations among time series variables explicitly. However, recent advances have been impeded by challenges relating to data scarcity and model robustness. To address these issues, we propose Spatial-Temporal Masked Autoencoders (STMAE), an MTS forecasting framework that leverages masked autoencoders to enhance the performance of spatial-temporal baseline models. STMAE consists of two learning stages. In the pretraining stage, an encoder-decoder architecture is employed. The encoder processes the partially visible MTS data produced by a novel dual-masking strategy, including biased random walk-based spatial masking and patch-based temporal masking. Subsequently, the decoders aim to reconstruct the masked counterparts from both spatial and temporal perspectives. The pretraining stage establishes a challenging pretext task, compelling the encoder to learn robust spatial-temporal patterns. In the fine-tuning stage, the pretrained encoder is retained, and the original decoder from existing spatial-temporal models is appended for forecasting. Extensive experiments are conducted on multiple MTS benchmarks. The promising results demonstrate that integrating STMAE into various spatial-temporal models can largely enhance their MTS forecasting capability.
摘要
多变量时间序列(MTS)预测 involve forecasting future time series data based on historical observations. Existing research primarily focuses on developing complex spatial-temporal models that explicitly capture spatial dependencies and temporal correlations among time series variables. However, recent advances have been hindered by challenges related to data scarcity and model robustness. To address these issues, we propose Spatial-Temporal Masked Autoencoders (STMAE), an MTS forecasting framework that leverages masked autoencoders to enhance the performance of spatial-temporal baseline models.STMAE consists of two learning stages. In the pretraining stage, an encoder-decoder architecture is employed. The encoder processes the partially visible MTS data produced by a novel dual-masking strategy, including biased random walk-based spatial masking and patch-based temporal masking. Subsequently, the decoders aim to reconstruct the masked counterparts from both spatial and temporal perspectives. The pretraining stage establishes a challenging pretext task, compelling the encoder to learn robust spatial-temporal patterns. In the fine-tuning stage, the pretrained encoder is retained, and the original decoder from existing spatial-temporal models is appended for forecasting.Extensive experiments are conducted on multiple MTS benchmarks. The promising results demonstrate that integrating STMAE into various spatial-temporal models can significantly enhance their MTS forecasting capability.
3D Reconstruction with Generalizable Neural Fields using Scene Priors
paper_authors: Yang Fu, Shalini De Mello, Xueting Li, Amey Kulkarni, Jan Kautz, Xiaolong Wang, Sifei Liu
For: The paper is written for high-fidelity 3D scene reconstruction using neural fields, with a focus on scalability and flexibility.* Methods: The paper introduces training generalizable Neural Fields incorporating scene Priors (NFPs), which map single-view RGB-D images into signed distance and radiance values. The NFP network does not require a fusion module, allowing for faster adaptation to new scenes with fewer views.* Results: The paper demonstrates state-of-the-art (SOTA) scene reconstruction performance and efficiency, as well as support for single-image novel-view synthesis, which is underexplored in neural fields.Abstract
High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods train a separate network from scratch for each individual scene. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs). The NFP network maps any single-view RGB-D image into signed distance and radiance values. A complete scene can be reconstructed by merging individual frames in the volumetric space WITHOUT a fusion module, which provides better flexibility. The scene priors can be trained on large-scale datasets, allowing for fast adaptation to the reconstruction of a new scene with fewer views. NFP not only demonstrates SOTA scene reconstruction performance and efficiency, but it also supports single-image novel-view synthesis, which is underexplored in neural fields. More qualitative results are available at: https://oasisyang.github.io/neural-prior
摘要
高级准确3D场景重建得到了近期神经场的进步。然而,大多数现有方法都是从零开始训练单个场景的分开网络。这不可持续、不够高效,并且难以在有限视角下获得好结果。而学习基于多视图零点法则可以减轻这些问题的影响,但它们的多视图设置使其更难扩展和应用于广泛的场景。相反,我们引入了基于场景假设(NFP)的培训普通神经场。NFP网络将单个视角RGB-D图像映射到了签名距离和颜色值上。通过在Volume空间合并个体帧,可以无需拟合模块重建完整的场景。场景假设可以在大规模数据集上培训,以便快速适应重建新场景的更少视角。NFP不仅达到了最佳场景重建性能和效率,还支持单个图像新视角synthesis,这是神经场中尚未得到充分发挥的。更详细的结果可以在:https://oasisyang.github.io/neural-prior 查看。
Doduo: Learning Dense Visual Correspondence from Unsupervised Semantic-Aware Flow
results: 在使用实际视频数据进行训练后,这种方法能够准确地对应图像中每个像素的位置,并在不同场景下保持高精度。 代码和更多视觉化数据可以在 https://ut-austin-rpl.github.io/Doduo 上找到。Abstract
Dense visual correspondence plays a vital role in robotic perception. This work focuses on establishing the dense correspondence between a pair of images that captures dynamic scenes undergoing substantial transformations. We introduce Doduo to learn general dense visual correspondence from in-the-wild images and videos without ground truth supervision. Given a pair of images, it estimates the dense flow field encoding the displacement of each pixel in one image to its corresponding pixel in the other image. Doduo uses flow-based warping to acquire supervisory signals for the training. Incorporating semantic priors with self-supervised flow training, Doduo produces accurate dense correspondence robust to the dynamic changes of the scenes. Trained on an in-the-wild video dataset, Doduo illustrates superior performance on point-level correspondence estimation over existing self-supervised correspondence learning baselines. We also apply Doduo to articulation estimation and zero-shot goal-conditioned manipulation, underlining its practical applications in robotics. Code and additional visualizations are available at https://ut-austin-rpl.github.io/Doduo
摘要
紧密的视觉对应在机器人感知中发挥关键作用。这项工作专注于在两个图像中建立紧密的对应关系,以捕捉在进行重大变化的动态场景中。我们提出了Doduo,一种不需要基于真实数据的学习批处理的普适 dense visual correspondence 算法。给定两个图像,它估算每个像素在一个图像中的满意流场,并将其映射到另一个图像中的对应像素。Doduo 使用流程基于折射来获得超级visery 信号,用于训练。将semantic prior 与自我supervised flow 训练结合,Doduo 可以生成高度准确的紧密对应,抗性能够抵御场景的动态变化。在一个野外视频数据集上训练,Doduo 在点级对应估算方面表现出色,超过现有的自我supervised correspondence 学习基线。我们还应用Doduo 到人工智能和机器人的艺术骨骼估算和零基础目标conditined manipulation 中,展示了其实际应用的可行性。代码和补充的视觉化可以在https://ut-austin-rpl.github.io/Doduo 中找到。
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
results: 我们在11个 dataset中进行了大规模的探索,包括7B、13B和70B的Llama-2家族模型。我们提出了SAT Probe方法,可以预测约束满足和事实错误,并允许早期错误识别。这种方法和结论表明如何在LLMs中利用事实准确性的机制来提高可靠性。Abstract
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text. We propose modeling factual queries as Constraint Satisfaction Problems and use this framework to investigate how the model interacts internally with factual constraints. Specifically, we discover a strong positive relation between the model's attention to constraint tokens and the factual accuracy of its responses. In our curated suite of 11 datasets with over 40,000 prompts, we study the task of predicting factual errors with the Llama-2 family across all scales (7B, 13B, 70B). We propose SAT Probe, a method probing self-attention patterns, that can predict constraint satisfaction and factual errors, and allows early error identification. The approach and findings demonstrate how using the mechanistic understanding of factuality in LLMs can enhance reliability.
摘要
我们研究 transformer 基于大语言模型(LLM)在生成错误文本时的内部行为。我们提议将 factual 查询作为约束满足问题来调查模型如何与约束进行交互。我们发现模型对约束符号的注意力强相关于它们的事实准确率。在我们精心制作的 11 个数据集中,包括超过 40,000 个提示,我们使用 Llama-2 家族在不同级别(7B、13B、70B)中预测错误。我们提出了 SAT Probe,一种探测自注意力模式的方法,可以预测约束满足和事实错误,并允许早期错误识别。这种方法和发现表明如何通过理解 LLM 中的事实性机制来提高可靠性。
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
paper_authors: Han Lin, Abhay Zala, Jaemin Cho, Mohit Bansal
For: The paper aims to explore the use of large language models (LLMs) for temporally consistent long video generation, and to develop a novel framework called VideoDirectorGPT that can leverage the knowledge of LLMs for video content planning and grounded video generation.* Methods: The proposed VideoDirectorGPT framework consists of a video planner LLM (GPT-4) and a video generator (Layout2Vid), which work together to generate multi-scene videos with visual consistency across scenes. The video planner generates a “video plan” that includes scene descriptions, entity layouts, and background information, and the video generator uses this plan to generate the video content.* Results: The proposed framework substantially improves layout and movement control in both single- and multi-scene video generation, and can generate multi-scene videos with visual consistency across scenes while achieving competitive performance with state-of-the-art (SOTA) methods in open-domain single-scene text-to-video generation. Additionally, the framework can dynamically control the strength of layout guidance and can generate videos with user-provided images.Abstract
Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.
摘要
Although recent text-to-video (T2V) generation methods have made significant progress, most of these works focus on producing short video clips with a single background (i.e., single-scene videos). However, recent large language models (LLMs) have demonstrated their ability to generate layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.
A Review on AI Algorithms for Energy Management in E-Mobility Services
paper_authors: Sen Yan, Maqsood Hussain Shah, Ji Li, Noel O’Connor, Mingming Liu
for: 本研究旨在探讨人工智能(AI)在电动交通系统(EMS)中的应用潜力,以Address various challenges related to efficient energy management, including range anxiety, charge rate optimization, and energy storage longevity.
methods: 本研究通过分析现有文献,探讨AI在EMS中的应用,并提出未来研究的有效方向。
results: 本研究的目标是提供EMS中AI应用的现状报告,并提出未来研究的有效方向,以为可持续和高效的电动交通系统提供贡献,并为交通领域带来更绿色和可持续的未来。Abstract
E-mobility, or electric mobility, has emerged as a pivotal solution to address pressing environmental and sustainability concerns in the transportation sector. The depletion of fossil fuels, escalating greenhouse gas emissions, and the imperative to combat climate change underscore the significance of transitioning to electric vehicles (EVs). This paper seeks to explore the potential of artificial intelligence (AI) in addressing various challenges related to effective energy management in e-mobility systems (EMS). These challenges encompass critical factors such as range anxiety, charge rate optimization, and the longevity of energy storage in EVs. By analyzing existing literature, we delve into the role that AI can play in tackling these challenges and enabling efficient energy management in EMS. Our objectives are twofold: to provide an overview of the current state-of-the-art in this research domain and propose effective avenues for future investigations. Through this analysis, we aim to contribute to the advancement of sustainable and efficient e-mobility solutions, shaping a greener and more sustainable future for transportation.
摘要
电动 mobilité (e-mobility) 已经出现为解决交通领域的环境和可持续性问题的重要解决方案。 fossil fuels 的枯竭、增加的气候变化排放和战 against 气候变化 都高亮了转换到电动汽车 (EV) 的必要性。 本文想要探讨人工智能 (AI) 在电动交通系统 (EMS) 中有效能源管理的挑战。这些挑战包括范围焦虑、加速率优化和电动汽车中能量存储的寿命。通过分析现有的文献,我们探讨了 AI 在这些挑战中的作用,并提出了有效的未来研究方向。我们的目标是为可持续可靠的电动交通解决方案做出贡献,创造一个更绿色、更可持续的交通未来。
When Prolog meets generative models: a new approach for managing knowledge and planning in robotic applications
results: 该系统通过一个实际应用示例,实现了自动化计划生成和执行,提高了机器人系统的效率和可靠性。I hope that helps! Let me know if you have any other questions.Abstract
In this paper, we propose a robot oriented knowledge management system based on the use of the Prolog language. Our framework hinges on a special organisation of knowledge base that enables: 1. its efficient population from natural language texts using semi-automated procedures based on Large Language Models, 2. the bumpless generation of temporal parallel plans for multi-robot systems through a sequence of transformations, 3. the automated translation of the plan into an executable formalism (the behaviour trees). The framework is supported by a set of open source tools and is shown on a realistic application.
摘要
在这篇论文中,我们提出了一种基于Prolog语言的机器人知识管理系统。我们的框架利用特殊的知识库组织方式,以实现:1. 自然语言文本自动或半自动填充知识库,使用大语言模型;2. 生成多机器人系统的时间平行计划,通过序列转换;3. 自动将计划转换为执行语言(行为树)。该框架得到了一组开源工具的支持,并在实际应用中展示了其可行性。
Class Incremental Learning via Likelihood Ratio Based Task Prediction
for: 这篇论文targets continual learning setting of class incremental learning (CIL), where tasks are learned sequentially and no task identifier is provided at test time.
results: 该论文表明,使用传统的OOD探测器进行任务标识预测是低效的,因为可以利用CIL中的额外信息(如回退数据和已学会任务)来设计更好和原理性的任务标识预测方法。该论文提出了TPLR(任务标识预测基于likelihood ratio)方法,该方法在CIL中表现出了明显的优异。Abstract
Class incremental learning (CIL) is a challenging setting of continual learning, which learns a series of tasks sequentially. Each task consists of a set of unique classes. The key feature of CIL is that no task identifier (or task-id) is provided at test time for each test sample. Predicting the task-id for each test sample is a challenging problem. An emerging theoretically justified and effective approach is to train a task-specific model for each task in a shared network for all tasks based on a task-incremental learning (TIL) method to deal with forgetting. The model for each task in this approach is an out-of-distribution (OOD) detector rather than a conventional classifier. The OOD detector can perform both within-task (in-distribution (IND)) class prediction and OOD detection. The OOD detection capability is the key for task-id prediction during inference for each test sample. However, this paper argues that using a traditional OOD detector for task-id prediction is sub-optimal because additional information (e.g., the replay data and the learned tasks) available in CIL can be exploited to design a better and principled method for task-id prediction. We call the new method TPLR (Task-id Prediction based on Likelihood Ratio}). TPLR markedly outperforms strong CIL baselines.
摘要
增量学习(CIL)是一种挑战性的持续学习设定,它通过一系列任务进行顺序学习。每个任务包含一组唯一的类。CIL的关键特征是在测试时没有提供任务标识符(或任务ID)。预测任务标识符 для每个测试样本是一个挑战性的问题。一种迅速成熔和有理据 justify的方法是在一个共享网络中基于任务增量学习(TIL)方法来处理忘记。在这种方法中,每个任务的模型是一个out-of-distribution(OOD)探测器,而不是一个传统的分类器。OOD探测器可以同时进行内任务(IN-distribution (IND))类预测和OOD检测。OOD检测能力是键 для任务标识符预测 durante la inferencia para cada muestra de prueba. Sin embargo, este artículo argumenta que utilizar un detector OOD tradicional para la predicción de la tarea es subóptima, ya que la información adicional (por ejemplo, los datos de replay y las tareas aprendidas) disponible en CIL puede ser explotada para diseñar un método más efectivo y principios para la predicción de la tarea. Llamamos al nuevo método TPLR (Predicción de Tarea basada en el Ratiode Likelihood). TPLR notablemente supera los baselines fuertes de CIL.
Combining Survival Analysis and Machine Learning for Mass Cancer Risk Prediction using EHR data
results: 对比基线方法,提出的方法在ROC AUC、F1和年龄基线方法上均显示出显著的优势(22.8% vs 15.1%、83.7% vs 84.9%、17.8% vs 21.4%)。在盲测随机回归测试中,提出的方法还能够正确地检测肿瘤病人(9 out of 100)。Abstract
Purely medical cancer screening methods are often costly, time-consuming, and weakly applicable on a large scale. Advanced Artificial Intelligence (AI) methods greatly help cancer detection but require specific or deep medical data. These aspects affect the mass implementation of cancer screening methods. For these reasons, it is a disruptive change for healthcare to apply AI methods for mass personalized assessment of the cancer risk among patients based on the existing Electronic Health Records (EHR) volume. This paper presents a novel method for mass cancer risk prediction using EHR data. Among other methods, our one stands out by the minimum data greedy policy, requiring only a history of medical service codes and diagnoses from EHR. We formulate the problem as a binary classification. This dataset contains 175 441 de-identified patients (2 861 diagnosed with cancer). As a baseline, we implement a solution based on a recurrent neural network (RNN). We propose a method that combines machine learning and survival analysis since these approaches are less computationally heavy, can be combined into an ensemble (the Survival Ensemble), and can be reproduced in most medical institutions. We test the Survival Ensemble in some studies. Firstly, we obtain a significant difference between values of the primary metric (Average Precision) with 22.8% (ROC AUC 83.7%, F1 17.8%) for the Survival Ensemble versus 15.1% (ROC AUC 84.9%, F1 21.4%) for the Baseline. Secondly, the performance of the Survival Ensemble is also confirmed during the ablation study. Thirdly, our method exceeds age baselines by a significant margin. Fourthly, in the blind retrospective out-of-time experiment, the proposed method is reliable in cancer patient detection (9 out of 100 selected). Such results exceed the estimates of medical screenings, e.g., the best Number Needed to Screen (9 out of 1000 screenings).
摘要
医疗保健领域的纯医学抵抗癌症检测方法通常是非常昂贵的、耗时的、并且适用范围不够广泛。高级人工智能(AI)方法可以帮助癌症检测,但它们需要特定或深入的医疗数据。这些因素对普遍实施癌症检测方法产生影响。为了缓解这些问题,我们提出了基于电子医疗记录(EHR)量的大规模个性化癌症风险评估的干预性变革。这篇论文提出了一种新的癌症风险预测方法,使用EHR数据。与其他方法不同的是,我们的方法只需要医疗服务代码和诊断记录,并将问题定义为二分类问题。我们的数据集包含175441名医疗记录(2861名患有癌症)。作为基线,我们实施了一种基于循环神经网络(RNN)的解决方案。我们提出了一种结合机器学习和生存分析的方法,因为这些方法较为轻量级,可以结合成ensemble(生存ensemble),并且可以在大多数医疗机构中实现。我们在一些研究中测试了生存ensemble。首先,我们发现Survival Ensemble的主要指标(均值精度)的值为22.8%(ROC AUC 83.7%, F1 17.8%),与基线相比,表示Survival Ensemble的性能有所提升。其次,我们在减少研究中证明了Survival Ensemble的性能。第三,我们的方法超过了年龄基线的 margin。最后,在盲测退化试验中,我们的方法可靠地检测癌症患者(9 out of 100)。这些结果超越了医疗检测的估计,例如最佳数量检测(9 out of 1000)。
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
paper_authors: Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner
for: The paper aims to develop a simple lie detector for large language models (LLMs) that does not require access to the LLM’s activations or ground-truth knowledge of the fact in question.
methods: The lie detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM’s yes/no answers into a logistic regression classifier.
results: The detector is highly accurate and generalizes well to different LLM architectures, fine-tuned lies, sycophantic lies, and real-life scenarios such as sales, indicating that LLMs have distinctive lie-related behavioral patterns that could enable general-purpose lie detection.Here’s the simplified Chinese version of the three key points:
results: 检测器具有高准确率和可扩展性,可以在不同的LLM架构、精心预期假话、卖场假话和实际生活场景中工作,表明LLM在假说方面具有一定的共同行为特征,可能实现普适的假说检测。Abstract
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
摘要
翻译结果:大型语言模型(LLM)可以“谎”,我们定义为输出 false 语句,即使其“知道”真实情况可见。LLM 可能“谎”,例如,当它被 instruced 输出谎言。我们开发了一种简单的谎言检测器,不需要访问 LLM 的激活(黑盒),也不需要对事实知识进行证明。检测器通过在可疑谎言后提出一组预先定义的无关 follow-up 问题,并将 LLM 的是/否答案传入 logistic regression 分类器。尽管其简单,但这种检测器具有高度准确和意外的通用性。当在单个设定下(提问 GPT-3.5 谎言关于事实 вопросы)进行训练后,检测器可以通过(1)其他 LLM 架构、(2) LLM 精通谎言、(3)偏袋谎言和(4)实际生活场景中的谎言来进行泛化。这些结果表明,LLM 在不同架构和上下文中具有一致的谎言相关行为模式,可能实现通用的谎言检测。
Don’t throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding
paper_authors: Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz
for: 提高生成文本的可读性和吸引力
methods: 结合Monte-Carlo Tree Search(MCTS)和Proximal Policy Optimization(PPO)
results: 比标准实践提高生成文本的偏好性和可读性Abstract
Inference-time search algorithms such as Monte-Carlo Tree Search (MCTS) may seem unnecessary when generating natural language text based on state-of-the-art reinforcement learning such as Proximal Policy Optimization (PPO). In this paper, we demonstrate that it is possible to get extra mileage out of PPO by integrating MCTS on top. The key idea is not to throw out the value network, a byproduct of PPO training for evaluating partial output sequences, when decoding text out of the policy network. More concretely, we present a novel value-guided decoding algorithm called PPO-MCTS, which can integrate the value network from PPO to work closely with the policy network during inference-time generation. Compared to prior approaches based on MCTS for controlled text generation, the key strength of our approach is to reduce the fundamental mismatch of the scoring mechanisms of the partial outputs between training and test. Evaluation on four text generation tasks demonstrate that PPO-MCTS greatly improves the preferability of generated text compared to the standard practice of using only the PPO policy. Our results demonstrate the promise of search algorithms even on top of the aligned language models from PPO, and the under-explored benefit of the value network.
摘要
<>TRANSLATE_TEXT假设用 Monte-Carlo Tree Search(MCTS)进行推理时,可能会认为无需使用 reinforcement learning 的 state-of-the-art 方法,例如 Proximal Policy Optimization(PPO)。在这篇论文中,我们表明可以通过将 MCTS 与 PPO 集成起来,从而获得更多的优势。关键思想是不要抛弃 PPO 的值网络,即在 PPO 训练过程中生成的 partial output sequences 的评估结果,而是在推理时使用这些值网络来导引policy网络进行文本生成。我们提出了一种新的值导向的推理算法,称为 PPO-MCTS,它可以将 PPO 的值网络与 policy 网络在推理时进行紧密的合作。与之前基于 MCTS 的文本生成方法相比,我们的方法的关键优势在于减少了在训练和测试之间的基本匹配问题,从而提高生成的文本的偏好性。我们的实验结果表明,PPO-MCTS 可以在四个文本生成任务上大幅提高生成的文本的偏好性,比标准实践使用只有 PPO 政策更好。我们的结果表明,搜索算法可以在 PPO 的对齐语言模型上获得优势,而且值网络的可用性还未得到充分利用。
results: 本文提供了评估大语言模型对Alignment技术的多种benchmark和评估方法,并对这些方法的可靠性和安全性进行了分析。Abstract
Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.
摘要
近年来,大语言模型(LLM)的进步很快,吸引了广泛的注意。然而,这些进步同时也引发了各种担忧。 LLM 的潜在力量无疑,但它们可能生成不准确、误导或甚至有害的文本。因此,使得这些模型与人类价值观 align 成为 Paramount 问题。本文尝试为大语言模型alignment的方法和新提议进行全面的探索,同时涉及到现有的 capacitor 研究。采用 AI alignment 的视野,我们分类了现有的方法和新提议为外部Alignment和内部Alignment。我们还考虑了模型的可读性和对 adversarial attack 的抵御能力。为评估 LLM 的 alignmen,我们提供了多种benchmark和评价方法。在讨论大语言模型alignment的现状后,我们将对未来的研究方向进行探讨,探讨这个领域可能会出现的有前途的研究方向。我们的 aspiration 不仅是激发关于这个领域的研究兴趣,还是在 AI Alignment 研究社区和探索大语言模型capability的研究人员之间建立桥梁,以实现 capable 和安全的 LLM。
PINF: Continuous Normalizing Flows for Physics-Constrained Deep Learning
results: 可以效率地解决高维时间依赖和稳态 Фоккер-朋杰尔方程Abstract
The normalization constraint on probability density poses a significant challenge for solving the Fokker-Planck equation. Normalizing Flow, an invertible generative model leverages the change of variables formula to ensure probability density conservation and enable the learning of complex data distributions. In this paper, we introduce Physics-Informed Normalizing Flows (PINF), a novel extension of continuous normalizing flows, incorporating diffusion through the method of characteristics. Our method, which is mesh-free and causality-free, can efficiently solve high dimensional time-dependent and steady-state Fokker-Planck equations.
摘要
“常规化约束对概率密度进行解决是一个 significiant 挑战。正则化流,一种可逆生成模型,利用变量变换公式来保证概率密度的保守和复杂数据分布的学习。本文提出了物理学 informed 正则化流(PINF),一种新的连续正则化流扩展,通过方法Characteristics 来实现增材和 causality-free 的高维时间依赖和稳态方程解决。”Note: Simplified Chinese is also known as "简化字" or "简体字" in Chinese.
Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex
paper_authors: Ruixing Liang, Xiangyu Zhang, Qiong Li, Lai Wei, Hexin Liu, Avisha Kumar, Kelley M. Kempski Leadingham, Joshua Punnoose, Leibny Paola Garcia, Amir Manbachi
for: 这 paper 的目的是为了利用人工智能来理解视觉过程,并且可以用于研究大脑的功能和结构。
methods: 这 paper 使用了人工神经网络模型,名为 VISION,来模拟大脑的视觉过程。该模型使用视觉和语义输入,可以预测大脑的功能磁共振成像(fMRI) scan 响应。
results: 这 paper 的结果表明,VISION 模型可以准确预测人类血液响应的 fMRI 磁共振成像,比现有技术的性能高出 45%。此外,这 paper 还探讨了训练的神经网络中的表征偏见,生成了可验证的实验假设,并提出了一个可解释的度量来关联这些假设与 cortical 功能。Abstract
While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces.
摘要
尽管人工智能(AI)在不同领域得到了重大进步,但它在理解视觉理解仍然处于未利用状态。我们提议一种人工神经网络名为“视觉接口系统”(VISION),以模拟人类大脑并证明它可以推动神经科学研究。使用视觉和语言输入,这种多模态模型预测大脑的功能磁共振成像(fMRI)扫描响应自然图像。VISION成功预测人类血液动力学响应视觉输入,与现有技术的性能相比,提高了45%。我们进一步探究训练的网络,揭示视觉区域的表达偏好,生成可验证的假设,并构建可解释的度量,将这些假设相关于 cortical 功能。通过这种模型和评价度量,设计和实施功能分析可以减少成本和时间开销。我们的工作表明,计算机模型的演化可能为我们的基本理解提供灯光,并提供可靠的脑机器接口。
Automating question generation from educational text
paper_authors: Ayan Kumar Bhowmick, Ashish Jagmohan, Aditya Vempaty, Prasenjit Dey, Leigh Hall, Jeremy Hartman, Ravi Kokku, Hema Maheshwari for:本研究设计了一个自动生成问题工具,用于学校的学习和评估过程中的形ative和总结评估。methods:我们使用了最近的生成AI技术,设计了一个模块化框架,将transformer型语言模型用于自动生成多项选择问题(MCQ)。results:我们进行了广泛的量化和质感评估,展示了不同技术和模型之间的贸易。Abstract
The use of question-based activities (QBAs) is wide-spread in education, traditionally forming an integral part of the learning and assessment process. In this paper, we design and evaluate an automated question generation tool for formative and summative assessment in schools. We present an expert survey of one hundred and four teachers, demonstrating the need for automated generation of QBAs, as a tool that can significantly reduce the workload of teachers and facilitate personalized learning experiences. Leveraging the recent advancements in generative AI, we then present a modular framework employing transformer based language models for automatic generation of multiple-choice questions (MCQs) from textual content. The presented solution, with distinct modules for question generation, correct answer prediction, and distractor formulation, enables us to evaluate different language models and generation techniques. Finally, we perform an extensive quantitative and qualitative evaluation, demonstrating trade-offs in the use of different techniques and models.
摘要
使用问题基本活动(QBA)是教育中广泛的应用,传统上作为学习和评估过程的重要组成部分。在这篇论文中,我们设计并评估了一种自动生成问题工具,用于形ative和summative评估。我们发布了一百四名教师专家调查,表明自动生成QBA的需求,作为可以大幅减轻教师的工作负担,并且为个性化学习经验提供便利。利用最新的生成AI技术,我们然后提出了一种模块化框架,利用转换器基于语言模型生成多项选择问题(MCQ)。该解决方案具有问题生成、正确答案预测和幌Launchx的三个模块,允许我们评估不同的语言模型和生成技术。最后,我们进行了详细的量化和质量评估,描述了不同技术和模型的负担。
Measurement Models For Sailboats Price vs. Features And Regional Areas
results: 分析发现,单桅船通常比双桅船更便宜,并且一些参数,如长、宽、排水量和帆面积直接影响价格。另外,较低的吃水也与更高的列价有直接关系。研究还发现,不同国家的GDP没有直接影响帆船价格。使用50%交叉验证方法,我们的模型在测试组中具有一致的结果。本研究通过机器学习技术提供了更加精准的帆船价格预测,为潜在购买者提供了有用的指导。Abstract
In this study, we investigated the relationship between sailboat technical specifications and their prices, as well as regional pricing influences. Utilizing a dataset encompassing characteristics like length, beam, draft, displacement, sail area, and waterline, we applied multiple machine learning models to predict sailboat prices. The gradient descent model demonstrated superior performance, producing the lowest MSE and MAE. Our analysis revealed that monohulled boats are generally more affordable than catamarans, and that certain specifications such as length, beam, displacement, and sail area directly correlate with higher prices. Interestingly, lower draft was associated with higher listing prices. We also explored regional price determinants and found that the United States tops the list in average sailboat prices, followed by Europe, Hong Kong, and the Caribbean. Contrary to our initial hypothesis, a country's GDP showed no direct correlation with sailboat prices. Utilizing a 50% cross-validation method, our models yielded consistent results across test groups. Our research offers a machine learning-enhanced perspective on sailboat pricing, aiding prospective buyers in making informed decisions.
摘要
本研究 investigate sailboat技术参数和价格之间的关系,以及地域性的影响。使用一个包括特征如长、宽、吃水、排水量、 sail 面积和水线的数据集,我们应用多种机器学习模型来预测 sailboat 价格。梯度下降模型表现出色,生成最低的MSE和MAE。我们的分析发现,单桅船通常比多桅船便宜,并且一些特征,如长、宽、排水量和 sail 面积直接与更高的价格相关。另外,较低的吃水也与更高的列价有关。我们还探究了不同地区的价格决定因素,发现美国的平均 sailboat 价格最高,其次是欧洲、香港和加勒比海。与我们的初始假设不同,一个国家的GDP直接与 sailboat 价格无关。使用50%的交叉验证方法,我们的模型在测试组中提供了一致的结果。本研究提供了机器学习增强的 sailboat 价格Perspective,帮助潜在买家做出了 Informed 决定。
Investigating Deep Neural Network Architecture and Feature Extraction Designs for Sensor-based Human Activity Recognition
paper_authors: Danial Ahangarani, Mohammad Shirazi, Navid Ashraf
for: 本研究旨在 Investigate the performance of common deep learning and machine learning approaches, as well as different training mechanisms and feature representations, for human activity recognition.
results: 实验研究表明,deep learning方法可以在人类活动识别任务中表现出优于传统的信号处理和机器学习方法,而不同的特征表示方法和训练机制也对任务的性能有着不同的影响。Abstract
The extensive ubiquitous availability of sensors in smart devices and the Internet of Things (IoT) has opened up the possibilities for implementing sensor-based activity recognition. As opposed to traditional sensor time-series processing and hand-engineered feature extraction, in light of deep learning's proven effectiveness across various domains, numerous deep methods have been explored to tackle the challenges in activity recognition, outperforming the traditional signal processing and traditional machine learning approaches. In this work, by performing extensive experimental studies on two human activity recognition datasets, we investigate the performance of common deep learning and machine learning approaches as well as different training mechanisms (such as contrastive learning), and various feature representations extracted from the sensor time-series data and measure their effectiveness for the human activity recognition task.
摘要
“智能设备和互联网物联网(IoT)中的广泛 ubique 感知器的可用性,已经开启了基于感知器的活动识别的可能性。相比传统的感知器时间序列处理和手工设计特征提取,随着深度学习在不同领域的证明效果,许多深度方法在人类活动识别中被探索,超越传统的信号处理和机器学习方法。在这种工作中,我们通过对两个人活动识别数据集进行广泛的实验研究, investigate 不同的深度学习和机器学习方法,以及不同的训练机制(如对照学习)和不同的特征表示方法,并测试它们在人类活动识别任务中的效果。”Note: "ubique" is not a word in Simplified Chinese, so I translated it as "广泛" (practical) to convey the same meaning.
Improving Unsupervised Visual Program Inference with Code Rewriting Families
results: 使用 SIRI 和 rewrite family,对 2D 和 3D CSG shape programming languages 进行了改进,包括更好的重建和更快的收敛率,并且在测试时可以更好地提高 SIRI 预测结果的重建性能。Abstract
Programs offer compactness and structure that makes them an attractive representation for visual data. We explore how code rewriting can be used to improve systems for inferring programs from visual data. We first propose Sparse Intermittent Rewrite Injection (SIRI), a framework for unsupervised bootstrapped learning. SIRI sparsely applies code rewrite operations over a dataset of training programs, injecting the improved programs back into the training set. We design a family of rewriters for visual programming domains: parameter optimization, code pruning, and code grafting. For three shape programming languages in 2D and 3D, we show that using SIRI with our family of rewriters improves performance: better reconstructions and faster convergence rates, compared with bootstrapped learning methods that do not use rewriters or use them naively. Finally, we demonstrate that our family of rewriters can be effectively used at test time to improve the output of SIRI predictions. For 2D and 3D CSG, we outperform or match the reconstruction performance of recent domain-specific neural architectures, while producing more parsimonious programs that use significantly fewer primitives.
摘要
(Simplified Chinese)程序具有紧凑性和结构,使其成为视觉数据的有效表示。我们探索如何使用代码重写来改善从视觉数据中推理程序的系统。我们首先提出了零噪抽象 rewrite injection(SIRI)框架,该框架通过对训练程序集进行零噪抽象 rewrite 操作,将改进后的程序重新插入到训练集中。我们设计了一家函数 rewrite 的家族,用于视觉编程领域:参数优化、代码剪辑和代码rafting。对于2D和3D CSGshape编程语言,我们显示了使用 SIRI 和我们家族的 rewrite 可以提高性能:更好的重建和更快的收敛率,相比于不使用 rewrite 或使用它们的随机方法。最后,我们示出了我们家族的 rewrite 可以在测试时有效地提高 SIRI 预测的输出。对2D和3D CSG,我们的方法可以与最新的域特定神经网络架构相比,并且生成更简洁的程序,使用更少的基本元素。
Deep Generative Methods for Producing Forecast Trajectories in Power Systems
results: 我们在法国TSO RTE风力预测数据上进行了广泛的实验,并与特定的时间序列生成metric进行比较。结果表明,我们的深度学习模型可以更好地预测风力资源的变化,并且可以减少预测错误的概率。Abstract
With the expansion of renewables in the electricity mix, power grid variability will increase, hence a need to robustify the system to guarantee its security. Therefore, Transport System Operators (TSOs) must conduct analyses to simulate the future functioning of power systems. Then, these simulations are used as inputs in decision-making processes. In this context, we investigate using deep learning models to generate energy production and load forecast trajectories. To capture the spatiotemporal correlations in these multivariate time series, we adapt autoregressive networks and normalizing flows, demonstrating their effectiveness against the current copula-based statistical approach. We conduct extensive experiments on the French TSO RTE wind forecast data and compare the different models with \textit{ad hoc} evaluation metrics for time series generation.
摘要
随着可再生能源在电力混合体中的扩展,电力网络的变化程度将增加,因此需要强化电力系统以确保其安全性。因此,交通系统运营商(TSOs)必须进行分析来模拟未来电力系统的运行。然后,这些分析结果将用于决策过程中。在这个上下文中,我们调查使用深度学习模型生成能源生产和负荷预测曲线。为了捕捉多变量时间序列的空间时间相关性,我们适应 autoregressive 网络和 норmalizing 流,并证明其效果性比现有的 copula 统计方法更高。我们在法国TSO RTE 风力预测数据上进行了广泛的实验,并与不同模型进行比较,使用特制时间序列生成评价指标。
Recurrent Hypernetworks are Surprisingly Strong in Meta-RL
results: 研究发现,结合批量学习和一般化模型(如回归网络)可以实现强大的性能,但是使用超网络是关键来激活这种潜在的性能。 surprisingly,这些简单的基本方法实际上在所有评估方法中表现最优。Abstract
Deep reinforcement learning (RL) is notoriously impractical to deploy due to sample inefficiency. Meta-RL directly addresses this sample inefficiency by learning to perform few-shot learning when a distribution of related tasks is available for meta-training. While many specialized meta-RL methods have been proposed, recent work suggests that end-to-end learning in conjunction with an off-the-shelf sequential model, such as a recurrent network, is a surprisingly strong baseline. However, such claims have been controversial due to limited supporting evidence, particularly in the face of prior work establishing precisely the opposite. In this paper, we conduct an empirical investigation. While we likewise find that a recurrent network can achieve strong performance, we demonstrate that the use of hypernetworks is crucial to maximizing their potential. Surprisingly, when combined with hypernetworks, the recurrent baselines that are far simpler than existing specialized methods actually achieve the strongest performance of all methods evaluated.
摘要
Interactively Learning Social Media Representations Improves News Source Factuality Detection
results: 在实际世界事件上,我们的实验结果显示了对新闻来源的实际性检测表现的改善,甚至只需要几次人类互动即可。Abstract
The rise of social media has enabled the widespread propagation of fake news, text that is published with an intent to spread misinformation and sway beliefs. Rapidly detecting fake news, especially as new events arise, is important to prevent misinformation. While prior works have tackled this problem using supervised learning systems, automatedly modeling the complexities of the social media landscape that enables the spread of fake news is challenging. On the contrary, having humans fact check all news is not scalable. Thus, in this paper, we propose to approach this problem interactively, where humans can interact to help an automated system learn a better social media representation quality. On real world events, our experiments show performance improvements in detecting factuality of news sources, even after few human interactions.
摘要
“社交媒体的崛起导致假新闻的广泛传播,这是为了散播误信和影响人们的信念。快速检测假新闻,特别是在新事件发生时,是非常重要的,以预防误信。而以往的工作已经使用监督学习系统来解决这个问题,但模拟社交媒体的复杂景象,却是一个挑战。而且,让人类检查所有新闻也不是可扩展的。因此,在这篇论文中,我们提出了一个互动式的方法,让人类和机器系统共同帮助对社交媒体的表现质量进行学习。在实际的世界事件上,我们的实验结果显示,对新闻来源的实际性进行检查,甚至只需几次人类互动,就可以 obtain 性能提升。”
Contrastive Continual Multi-view Clustering with Filtered Structural Fusion
paper_authors: Xinhang Wan, Jiyuan Liu, Ao Li, Xinwang Liu, En Zhu
for: 该 paper 针对于实时数据 clustering 问题提出了一种新的方法,即 Contrastive Continual Multi-view Clustering with Filtered Structural Fusion (CCMVC-FSF),以解决现有方法在面临新视图时的灾难性忘记问题。
methods: 该 paper 使用了一种数据缓存机制,通过筛选结构信息来减少数据的干扰效应,并通过对比学习来生成一个robust的分区矩阵。此外,该 paper 还结合了 semi-supervised learning 和知识储存技术。
results: EXTENSIVE experiments 表明,该 paper 提出的方法可以减少灾难性忘记问题,并且在不同的实际场景中具有优秀的性能。Abstract
Multi-view clustering thrives in applications where views are collected in advance by extracting consistent and complementary information among views. However, it overlooks scenarios where data views are collected sequentially, i.e., real-time data. Due to privacy issues or memory burden, previous views are not available with time in these situations. Some methods are proposed to handle it but are trapped in a stability-plasticity dilemma. In specific, these methods undergo a catastrophic forgetting of prior knowledge when a new view is attained. Such a catastrophic forgetting problem (CFP) would cause the consistent and complementary information hard to get and affect the clustering performance. To tackle this, we propose a novel method termed Contrastive Continual Multi-view Clustering with Filtered Structural Fusion (CCMVC-FSF). Precisely, considering that data correlations play a vital role in clustering and prior knowledge ought to guide the clustering process of a new view, we develop a data buffer with fixed size to store filtered structural information and utilize it to guide the generation of a robust partition matrix via contrastive learning. Furthermore, we theoretically connect CCMVC-FSF with semi-supervised learning and knowledge distillation. Extensive experiments exhibit the excellence of the proposed method.
摘要
多视图聚类在数据视图预先采集了一致和补充的信息中得到最佳效果。然而,它忽略了实时数据的情况,即数据视图随时间的采集。由于隐私问题或内存压力等原因,先前的视图不可能在时间上提供。一些方法已经提出来解决这个问题,但它们受到稳定性和软化之间的负担。具体来说,这些方法在获得新视图时会导致严重的忘记先前知识的问题(CFP),从而使得一致和补充的信息困难以获得,并影响聚类性能。为解决这个问题,我们提出了一种新方法,即对比学习 filtered 结构融合(CCMVC-FSF)。具体来说,我们认为数据相关性在聚类过程中扮演着关键角色,因此我们开发了一个固定大小的数据缓存,用于存储 filtered 结构信息,并使用其引导生成一个强健的分区矩阵。此外,我们 theoretically 连接 CCMVC-FSF 与半导导学习和知识储存。广泛的实验表明我们提出的方法的优势。
Addressing preferred orientation in single-particle cryo-EM through AI-generated auxiliary particles
methods: 使用 Conditional deep generative model 生成辅助粒子,解决观察到的粒子方向估计中的内在偏好。
results: 在凝固粒子电子顺向分析中 Hemagglutinin 聚合体的near-atomic resolution结构重建,以及在不倾斜数据中使用 cryoPROS-MP 版本进行膜蛋白 NaX 的结构重建。Abstract
The single-particle cryo-EM field faces the persistent challenge of preferred orientation, lacking general computational solutions. We introduce cryoPROS, an AI-based approach designed to address the above issue. By generating the auxiliary particles with a conditional deep generative model, cryoPROS addresses the intrinsic bias in orientation estimation for the observed particles. We effectively employed cryoPROS in the cryo-EM single particle analysis of the hemagglutinin trimer, showing the ability to restore the near-atomic resolution structure on non-tilt data. Moreover, the enhanced version named cryoPROS-MP significantly improves the resolution of the membrane protein NaX using the no-tilted data that contains the effects of micelles. Compared to the classical approaches, cryoPROS does not need special experimental or image acquisition techniques, providing a purely computational yet effective solution for the preferred orientation problem. Finally, we conduct extensive experiments that establish the low risk of model bias and the high robustness of cryoPROS.
摘要
单粒子普遍困难:对于单粒子普遍困难,我们提出了一个基于人工智能的方法---cryoPROS。这个方法使用深度生成模型来生成辅助粒子,以解决实验资料中的自然偏见问题。我们在血液蛋白聚矩体中使用了cryoPROS,并取得了非tilt数据中的精确结构。此外,我们还开发了一个优化版本名为cryoPROS-MP,它在没有偏向数据中实现了蛋白质NaX的高分辨率结构。相比于传统方法,cryoPROS不需要特殊的实验或摄像频率技术,提供了一个纯 computationally 的解决方案。最后,我们进行了广泛的实验,证明了cryoPROS并不存在偏见问题,并且具有高价的稳定性。
Multi-Source Domain Adaptation for Object Detection with Prototype-based Mean-teacher
paper_authors: Atif Belal, Akhil Meethal, Francisco Perdigon Romero, Marco Pedersoli, Eric Granger for:This paper is written for adapting visual object detectors to operational target domains using unsupervised domain adaptation (UDA) methods, specifically for multi-source domain adaptation (MSDA) scenarios.methods:The proposed method, Prototype-based Mean-Teacher (PMT), uses class prototypes learned using a contrastive loss to preserve domain-specific information and align categories across domains.results:The proposed PMT method outperforms state-of-the-art MSDA methods on several challenging object detection datasets, demonstrating its effectiveness in adapting visual object detectors to operational target domains.Here’s the information in Simplified Chinese text:for:这篇论文是为了使用无监督领域适应(UDA)方法来适应视觉对象检测器到运维目标领域中,特别是在多源领域适应(MSDA)场景下所写的。methods:该提议的方法是使用类prototype来保持领域特定信息,这些prototype是通过对应性损失来学习的。results:该提议的PMT方法在一些复杂的对象检测数据集上比州先进的MSDA方法表现出色,证明了它在适应视觉对象检测器到运维目标领域中的效果。Abstract
Adapting visual object detectors to operational target domains is a challenging task, commonly achieved using unsupervised domain adaptation (UDA) methods. When the labeled dataset is coming from multiple source domains, treating them as separate domains and performing a multi-source domain adaptation (MSDA) improves the accuracy and robustness over mixing these source domains and performing a UDA, as observed by recent studies in MSDA. Existing MSDA methods learn domain invariant and domain-specific parameters (for each source domain) for the adaptation. However, unlike single-source UDA methods, learning domain-specific parameters makes them grow significantly proportional to the number of source domains used. This paper proposes a novel MSDA method called Prototype-based Mean-Teacher (PMT), which uses class prototypes instead of domain-specific subnets to preserve domain-specific information. These prototypes are learned using a contrastive loss, aligning the same categories across domains and separating different categories far apart. Because of the use of prototypes, the parameter size of our method does not increase significantly with the number of source domains, thus reducing memory issues and possible overfitting. Empirical studies show PMT outperforms state-of-the-art MSDA methods on several challenging object detection datasets.
摘要
通常通过不监督领域适应(UDA)方法来实现对操作目标领域的视觉对象检测器的适应。当来自多个源领域的标注数据集被视为独立的多个源领域时,使用多源领域适应(MSDA)方法可以提高准确性和稳定性,根据最近的研究表明。现有的 MSDA 方法learns领域不变和领域特定参数(对每个源领域) для适应。然而,与单个 UDA 方法不同,学习领域特定参数会使其增长得 proportional to the number of source domains used。这篇文章提出了一种新的 MSDA 方法,即 Prototype-based Mean-Teacher(PMT),它使用类prototype来保留领域特定信息。这些 prototypes 是通过对应类型的损失函数来学习的,以实现类型之间的对齐和不同类型之间的分离。由于使用 prototypes,我们的方法中的参数大小不会随着 source domains 的数量增加,从而避免内存问题和可能的过拟合。实验表明,PMT 在多个挑战性的对象检测数据集上表现出色,超越了当前state-of-the-art MSDA 方法。
A Democratic Platform for Engaging with Disabled Community in Generative AI Development
for: The paper aims to involve the disabled community in the design and development of generative AI systems to address bias and incorrectness in the outputs generated by these systems when used by the disabled community.
methods: The proposed platform calls for asynchronous and remote collaboration between disabled and non-disabled individuals from diverse backgrounds, using a democratic approach to decision-making.
results: The paper hopes to gain insight into the factors that contribute to bias in generative AI systems when used by the disabled community, and to identify the main algorithmic factors responsible for incorrect or irrelevant outputs.Abstract
Artificial Intelligence (AI) systems, especially generative AI technologies are becoming more relevant in our society. Tools like ChatGPT are being used by members of the disabled community e.g., Autistic people may use it to help compose emails. The growing impact and popularity of generative AI tools have prompted us to examine their relevance within the disabled community. The design and development phases often neglect this marginalized group, leading to inaccurate predictions and unfair discrimination directed towards them. This could result from bias in data sets, algorithms, and systems at various phases of creation and implementation. This workshop paper proposes a platform to involve the disabled community while building generative AI systems. With this platform, our aim is to gain insight into the factors that contribute to bias in the outputs generated by generative AI when used by the disabled community. Furthermore, we expect to comprehend which algorithmic factors are the main contributors to the output's incorrectness or irrelevancy. The proposed platform calls on both disabled and non-disabled people from various geographical and cultural backgrounds to collaborate asynchronously and remotely in a democratic approach to decision-making.
摘要
人工智能(AI)系统,尤其是生成AI技术在我们社会中变得越来越重要。如ChatGPT这种工具,被disabled社区的成员使用,例如自闭症人士可以使用其帮助compose电子邮件。随着生成AI工具的增长影响和流行度,我们被迫检查这些工具在disabled社区中的 relevance。然而,设计和开发阶段 часто忽视这个受欢迎的社群,导致错误预测和不公正对待。这可能由数据集、算法和系统在不同阶段的创建和实施中的偏见引起。本工作shop paper提出了一个平台,以便在生成AI系统的建设中包括disabled社区。通过这个平台,我们的目标是了解生成AI在disabled社区中输出的偏见的因素。此外,我们还希望了解算法因素是输出的错误或不相关的主要 contribuens。 proposed platform召集了不同地理和文化背景的 disable和非 disable人士共同参与协作,以征集 asynchronous和 remote的民主决策方式。
Label Deconvolution for Node Representation Learning on Large-scale Attributed Graphs against Learning Bias
methods: 本研究使用 pre-trained model 和 graph neural network (GNN) 结合,将 attribute 和 graph structure 同时编码。然而,由于joint training large-scale graphs 会导致性能下降,therefore, many methods propose to train NEs 和 GNNs 分开。然而,这会导致 feature convolution 在 GNNs 中被忽略,从而导致学习偏好。本研究提出了一种高效的标签减少技术,即 Label Deconvolution (LD),以解决这个问题。
results: 实验结果表明,LD 可以准确地预测 Open Graph Benchmark 数据集中的结果,并且与 state-of-the-art 方法相比,具有显著的性能优势。Abstract
Node representation learning on attributed graphs -- whose nodes are associated with rich attributes (e.g., texts and protein sequences) -- plays a crucial role in many important downstream tasks. To encode the attributes and graph structures simultaneously, recent studies integrate pre-trained models with graph neural networks (GNNs), where pre-trained models serve as node encoders (NEs) to encode the attributes. As jointly training large NEs and GNNs on large-scale graphs suffers from severe scalability issues, many methods propose to train NEs and GNNs separately. Consequently, they do not take feature convolutions in GNNs into consideration in the training phase of NEs, leading to a significant learning bias from that by the joint training. To address this challenge, we propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs. The inverse mapping leads to an objective function that is equivalent to that by the joint training, while it can effectively incorporate GNNs in the training phase of NEs against the learning bias. More importantly, we show that LD converges to the optimal objective function values by thejoint training under mild assumptions. Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph Benchmark datasets.
摘要
Node representation learning on attributed graphs---whose nodes are associated with rich attributes (e.g., texts and protein sequences)---plays a crucial role in many important downstream tasks. To encode the attributes and graph structures simultaneously, recent studies integrate pre-trained models with graph neural networks (GNNs), where pre-trained models serve as node encoders (NEs) to encode the attributes. As jointly training large NEs and GNNs on large-scale graphs suffers from severe scalability issues, many methods propose to train NEs and GNNs separately. Consequently, they do not take feature convolutions in GNNs into consideration in the training phase of NEs, leading to a significant learning bias from that by the joint training. To address this challenge, we propose an efficient label regularization technique, namely Label Deconvolution (LD), to alleviate the learning bias by a novel and highly scalable approximation to the inverse mapping of GNNs. The inverse mapping leads to an objective function that is equivalent to that by the joint training, while it can effectively incorporate GNNs in the training phase of NEs against the learning bias. More importantly, we show that LD converges to the optimal objective function values by the joint training under mild assumptions. Experiments demonstrate LD significantly outperforms state-of-the-art methods on Open Graph Benchmark datasets.
results: 研究发现,AI艺术领域的艺术家和创作者需要更多的信息和工具来理解AI的可持续性影响,并且需要一种可解释的可持续性模型来帮助他们更好地理解AI的可持续性影响。Abstract
AI is becoming increasingly popular in artistic practices, but the tools for informing practitioners about the environmental impact (and other sustainability implications) of AI are adapted for other contexts than creative practices -- making the tools and sustainability implications of AI not accessible for artists and creative practitioners. In this position paper, I describe two empirical studies that aim to develop environmental sustainability reflection systems for AI Arts, and discuss and introduce Explainable Sustainability in for AI Arts.
摘要
AI在艺术实践中日益受欢迎,但现有的环境影响(以及其他可持续发展因素)AI工具主要针对其他领域,因此艺术家和创作者无法访问这些工具和可持续发展因素。在本 Position paper 中,我描述了两项验证研究,旨在为 AI 艺术创造可持续发展反射系统,并讨论了Explainable Sustainability in AI Arts。Here's a breakdown of the translation:* AI在艺术实践中日益受欢迎 (AI is becoming increasingly popular in artistic practices)* 但现有的环境影响(以及其他可持续发展因素)AI工具主要针对其他领域 (but the tools for informing practitioners about the environmental impact and other sustainability implications of AI are mainly adapted for other contexts)* 因此艺术家和创作者无法访问这些工具和可持续发展因素 (therefore, artists and creative practitioners cannot access these tools and sustainability implications)* 在本 Position paper 中 (in this position paper)* 我描述了两项验证研究 (I describe two empirical studies)* 旨在为 AI 艺术创造可持续发展反射系统 (aiming to develop environmental sustainability reflection systems for AI Arts)* 并讨论了Explainable Sustainability in AI Arts (and discuss and introduce Explainable Sustainability in AI Arts)
Navigating Text-To-Image Customization:From LyCORIS Fine-Tuning to Model Evaluation
results: 本研究的结果显示,不同的调整方法对稳定扩散模型的表现有不同的影响,并且提供了一个系统性的评估框架,可以帮助研究人员更好地理解这些影响,并将其应用于实际应用中。Abstract
Text-to-image generative models have garnered immense attention for their ability to produce high-fidelity images from text prompts. Among these, Stable Diffusion distinguishes itself as a leading open-source model in this fast-growing field. However, the intricacies of fine-tuning these models pose multiple challenges from new methodology integration to systematic evaluation. Addressing these issues, this paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion) [https://github.com/KohakuBlueleaf/LyCORIS], an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present a thorough framework for the systematic assessment of varied fine-tuning techniques. This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning, including hyperparameter adjustments and the evaluation with different prompt types across various concept categories. Through this comprehensive approach, our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.
摘要
Note: "Simplified Chinese" is a romanization of Chinese characters, which is used to represent the language in a simpler form, especially for non-native speakers. The text above is written in Simplified Chinese, and it may not be exactly the same as the traditional Chinese version.
Supersonic: Learning to Generate Source Code Optimizations in C/C++
paper_authors: Zimin Chen, Sen Fang, Martin Monperrus
for: 这个论文targets minor source code modifications for optimization.
methods: The paper presents a neural approach called Supersonic, which uses a seq2seq model to optimize C/C++ programs.
results: The experiments show that Supersonic outperforms OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks, while also minimizing the extent of the change with a model that is more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.Abstract
Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic's performance is benchmarked against OpenAI's GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.
摘要
Revisiting Softmax Masking for Stability in Continual Learning
methods: 本文提出一种基于 masking softmax 函数的方法,以 preserve confidence distribution during continual learning。
results: Comparing with state-of-the-art methods, 本文的方法在 class-和 task-incremental learning benchmarks 中显示了更高的稳定性和足够的пластично性。特别是在使用 zero 或小 memory 时,本文的方法表现更好。Abstract
In continual learning, many classifiers use softmax function to learn confidence. However, numerous studies have pointed out its inability to accurately determine confidence distributions for outliers, often referred to as epistemic uncertainty. This inherent limitation also curtails the accurate decisions for selecting what to forget and keep in previously trained confidence distributions over continual learning process. To address the issue, we revisit the effects of masking softmax function. While this method is both simple and prevalent in literature, its implication for retaining confidence distribution during continual learning, also known as stability, has been under-investigated. In this paper, we revisit the impact of softmax masking, and introduce a methodology to utilize its confidence preservation effects. In class- and task-incremental learning benchmarks with and without memory replay, our approach significantly increases stability while maintaining sufficiently large plasticity. In the end, our methodology shows better overall performance than state-of-the-art methods, particularly in the use with zero or small memory. This lays a simple and effective foundation of strongly stable replay-based continual learning.
摘要
在连续学习中,许多分类器使用softmax函数来学习 confidence。然而,许多研究表明softmax函数无法准确地确定outsider的epistemic uncertainty。这种内置的限制也限制了精确地决定在前期训练 confidence distributions 中保留和忘记的决策。为解决这个问题,我们重新评估softmax函数的masking效果。虽然这种方法是简单而普遍存在在文献中,但它在连续学习过程中保持confidence分布的稳定性具有未得到足够的研究。在这篇论文中,我们重新评估softmax masking的影响,并介绍了一种使用它的confidence保存效果的方法。在不同的类和任务增量学习benchmark中,我们的方法显著提高了稳定性,同时保持了足够的пластичность。最后,我们的方法在与零或小的内存使用时表现更好,这建立了一个简单而有效的强有力的连续学习基础。
Evaluating Soccer Match Prediction Models: A Deep Learning Approach and Feature Optimization for Gradient-Boosted Trees
results: 根据验证集的结果表示,我们的模型在win/draw/loss预测中表现出了强大的稳定性,比前一次在2017年足球预测比赛中发表的模型更佳。Abstract
Machine learning models have become increasingly popular for predicting the results of soccer matches, however, the lack of publicly-available benchmark datasets has made model evaluation challenging. The 2023 Soccer Prediction Challenge required the prediction of match results first in terms of the exact goals scored by each team, and second, in terms of the probabilities for a win, draw, and loss. The original training set of matches and features, which was provided for the competition, was augmented with additional matches that were played between 4 April and 13 April 2023, representing the period after which the training set ended, but prior to the first matches that were to be predicted (upon which the performance was evaluated). A CatBoost model was employed using pi-ratings as the features, which were initially identified as the optimal choice for calculating the win/draw/loss probabilities. Notably, deep learning models have frequently been disregarded in this particular task. Therefore, in this study, we aimed to assess the performance of a deep learning model and determine the optimal feature set for a gradient-boosted tree model. The model was trained using the most recent five years of data, and three training and validation sets were used in a hyperparameter grid search. The results from the validation sets show that our model had strong performance and stability compared to previously published models from the 2017 Soccer Prediction Challenge for win/draw/loss prediction.
摘要
机器学习模型在足球赛事预测中变得越来越受欢迎,然而公共可用的标准数据集的缺乏使得模型评估变得困难。2023年足球预测挑战要求预测每个队伍所得的进球数量,以及每个队伍赢得、平局、负败的概率。提供的原始训练集和特征,在竞赛中提供,被补充了在4月4日至4月13日期间进行的其他比赛,表示训练集结束后的时间段,但是在预测的第一场比赛之前(在评估性能时使用)。使用Pi-评分来使用CatBoost模型,这些特征最初被认为是计算赢负平比数据的优选。值得注意的是,深度学习模型在这个特定任务中经常被忽视。因此,在这项研究中,我们想要评估深度学习模型的性能,并确定最佳特征集 для梯度树模型。模型使用过去五年的数据进行训练,并使用三个训练和验证集在搜索扫描中进行了超参数的搜索。验证集的结果表明,我们的模型在win/draw/loss预测方面具有强大的性能和稳定性,与2017年足球预测挑战中发表的模型相比。
Fine-tuning and aligning question answering models for complex information extraction tasks
paper_authors: Matthias Engelbach, Dennis Klau, Felix Scheerer, Jens Drawehn, Maximilien Kintz
For: This paper proposes an approach for improving the feature extraction of German business documents, such as insurance reports or medical leaflets, using extractive question answering (QA) models.* Methods: The paper uses and integrates existing German QA models, and fine-tunes them for tailored extraction tasks of complex linguistic features.* Results: The paper shows that fine-tuning the QA models boosts performance for these tasks, even with a small set of annotated data. Additionally, the paper discusses the relevance of scoring metrics for evaluating information extraction tasks and proposes a combined metric that mimics the assessment criteria from human experts.Here is the information in Simplified Chinese text:* For: 这篇论文提出了一种基于抽取问答模型的德国企业文档特征提取方法,以提高德国企业文档的特征提取精度。* Methods: 论文使用和集成了现有的德国问答模型,并对其进行了定制化的特征提取任务。* Results: 论文显示,通过定制化问答模型可以提高特征提取精度,即使只使用一小部分标注数据。此外,论文还讨论了特征提取任务的评价指标,并提出了一种组合指标,以模拟人类专家的评价标准。Abstract
The emergence of Large Language Models (LLMs) has boosted performance and possibilities in various NLP tasks. While the usage of generative AI models like ChatGPT opens up new opportunities for several business use cases, their current tendency to hallucinate fake content strongly limits their applicability to document analysis, such as information retrieval from documents. In contrast, extractive language models like question answering (QA) or passage retrieval models guarantee query results to be found within the boundaries of an according context document, which makes them candidates for more reliable information extraction in productive environments of companies. In this work we propose an approach that uses and integrates extractive QA models for improved feature extraction of German business documents such as insurance reports or medical leaflets into a document analysis solution. We further show that fine-tuning existing German QA models boosts performance for tailored extraction tasks of complex linguistic features like damage cause explanations or descriptions of medication appearance, even with using only a small set of annotated data. Finally, we discuss the relevance of scoring metrics for evaluating information extraction tasks and deduce a combined metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic the assessment criteria from human experts.
摘要
大型自然语言模型(LLM)的出现对各种自然语言处理任务的性能和可能性带来了提升。然而,使用生成AI模型如ChatGPT时,它们很容易生成假内容,这限制了它们在文档分析中的应用。相比之下,抽取语言模型如问答(QA)或段落检索模型可以保证查询结果在指定的文档范围内,这使得它们在公司生产环境中更适合用于可靠的信息抽取。在这项工作中,我们提议一种使用和结合抽取QA模型来改进德国商业文档的特征提取解决方案。我们还证明了使用现有的德语QA模型进行精细调整可以提高适应性特征提取任务的表现,即使只使用一小部分标注数据。最后,我们讨论了评价信息提取任务的分数指标,并 deduced一个combined metric,包括Levenshtein距离、F1分数、精确匹配和ROUGE-L,以模拟人类专家的评价标准。
Forgetting-aware Linear Bias for Attentive Knowledge Tracing
results: 该论文提出了一种简单 yet effective的解决方案,即忘记遗弃linear偏好(FoLiBi),用于反映学生的忘记行为。尽管简单,FoLiBi可以轻松地与现有的注意力基рованKT模型结合使用,并且在四个 benchmark数据集上实现了state-of-the-art KT模型的2.58%的AUC提升。Abstract
Knowledge Tracing (KT) aims to track proficiency based on a question-solving history, allowing us to offer a streamlined curriculum. Recent studies actively utilize attention-based mechanisms to capture the correlation between questions and combine it with the learner's characteristics for responses. However, our empirical study shows that existing attention-based KT models neglect the learner's forgetting behavior, especially as the interaction history becomes longer. This problem arises from the bias that overprioritizes the correlation of questions while inadvertently ignoring the impact of forgetting behavior. This paper proposes a simple-yet-effective solution, namely Forgetting-aware Linear Bias (FoLiBi), to reflect forgetting behavior as a linear bias. Despite its simplicity, FoLiBi is readily equipped with existing attentive KT models by effectively decomposing question correlations with forgetting behavior. FoLiBi plugged with several KT models yields a consistent improvement of up to 2.58% in AUC over state-of-the-art KT models on four benchmark datasets.
摘要
知识跟踪(KT)目标是根据问题解决历史评估学习者的掌握程度,以便提供更加流畅的课程设计。现今的研究者通常使用注意力机制来捕捉问题之间的相关性,并结合学习者的特点进行响应。然而,我们的实验研究发现,现有的注意力based KT 模型忽略学习者的忘记行为,特别是在互动历史越长时。这个问题来源于关注问题相关性的偏见,不经意识地忽略了忘记行为的影响。本文提出了一种简单 yet effective的解决方案,即忘记行为权重linear bias(FoLiBi),用于反映学习者的忘记行为。尽管简单,FoLiBi可以轻松地与现有的注意力based KT 模型结合使用,并且可以有效地 decomposition question相关性和忘记行为。在四个benchmark dataset上,FoLiBi与其他KT模型相结合得到了2.58%的AUC提升。
Semantic Map Learning of Traffic Light to Lane Assignment based on Motion Data
results: 研究人员通过实现和评估这种方法,并对可用的运动预测数据集进行转换,提供了一个公共的 API,以便研究人员开发和评估自己的方法。Abstract
Understanding which traffic light controls which lane is crucial to navigate intersections safely. Autonomous vehicles commonly rely on High Definition (HD) maps that contain information about the assignment of traffic lights to lanes. The manual provisioning of this information is tedious, expensive, and not scalable. To remedy these issues, our novel approach derives the assignments from traffic light states and the corresponding motion patterns of vehicle traffic. This works in an automated way and independently of the geometric arrangement. We show the effectiveness of basic statistical approaches for this task by implementing and evaluating a pattern-based contribution method. In addition, our novel rejection method includes accompanying safety considerations by leveraging statistical hypothesis testing. Finally, we propose a dataset transformation to re-purpose available motion prediction datasets for semantic map learning. Our publicly available API for the Lyft Level 5 dataset enables researchers to develop and evaluate their own approaches.
摘要
理解交通灯的控制哪一个车道是安全通行口的关键。自动驾驶车辆通常依赖高清晰度地图,这些地图包含交通灯分配给车道的信息。手动提供这些信息是费时、昂贵,并且不可扩展。为解决这些问题,我们提出了一种新的方法,即从交通灯状态和相应的车辆交通模式中提取分配信息。这种方法自动化了进行,不受地理布局的影响。我们还实现了一种基本的统计方法,以确定分配信息的有效性。此外,我们还提出了一种安全考虑的拒绝方法,通过利用统计假设测试来确保安全性。最后,我们建议将可用的动作预测数据集转换为semantic地图学习用的数据集,并提供了一个公共可用API,以便研究人员可以开发和评估自己的方法。
paper_authors: Francesco Immorlano, Veronika Eyring, Thomas le Monnier de Gouville, Gabriele Accarino, Donatello Elia, Giovanni Aloisio, Pierre Gentine
results: 研究结果显示,使用这种方法可以更 precisely project global surface temperature fields in the 21st century,并且可以提供更加准确的气候预测,以便于气候适应和气候控制。 Specifically, the study found that the 1.5°C threshold of the Paris Agreement will be crossed in 2031 (2028-2034) for SSP2-4.5, in 2029 (2027-2031) for SSP3-7.0, and in 2028 (2025-2031) for SSP5-8.5. Similarly, the 2°C threshold will be exceeded in 2051 (2045-2059), 2044 (2040-2047), and 2042 (2038-2047) respectively.Abstract
Accurate climate projections are required for climate adaptation and mitigation. Earth system model simulations, used to project climate change, inherently make approximations in their representation of small-scale physical processes, such as clouds, that are at the root of the uncertainties in global mean temperature's response to increased greenhouse gas concentrations. Several approaches have been developed to use historical observations to constrain future projections and reduce uncertainties in climate projections and climate feedbacks. Yet those methods cannot capture the non-linear complexity inherent in the climate system. Using a Transfer Learning approach, we show that Machine Learning, in particular Deep Neural Networks, can be used to optimally leverage and merge the knowledge gained from Earth system model simulations and historical observations to more accurately project global surface temperature fields in the 21st century. For the Shared Socioeconomic Pathways (SSPs) 2-4.5, 3-7.0 and 5-8.5, we refine regional estimates and the global projection of the average global temperature in 2081-2098 (with respect to the period 1850-1900) to 2.73{\deg}C (2.44-3.11{\deg}C), 3.92{\deg}C (3.5-4.47{\deg}C) and 4.53{\deg}C (3.69-5.5{\deg}C), respectively, compared to the unconstrained 2.7{\deg}C (1.65-3.8{\deg}C), 3.71{\deg}C (2.56-4.97{\deg}C) and 4.47{\deg}C (2.95-6.02{\deg}C). Our findings show that the 1.5{\deg}C threshold of the Paris' agreement will be crossed in 2031 (2028-2034) for SSP2-4.5, in 2029 (2027-2031) for SSP3-7.0 and in 2028 (2025-2031) for SSP5-8.5. Similarly, the 2{\deg}C threshold will be exceeded in 2051 (2045-2059), 2044 (2040-2047) and 2042 (2038-2047) respectively. Our new method provides more accurate climate projections urgently required for climate adaptation.
摘要
准确的气候预测是为气候适应和抑制需要的。地球系统模型的 simulate climate change 中含有一些简化了小规模物理过程,如云,这些过程的不确定性导致全球平均温度响应增加绿house gas concentration的不确定性。多种方法已经开发来使用历史观察来约束未来预测并减少气候预测和反馈的不确定性。然而,这些方法无法捕捉气候系统的非线性复杂性。我们使用传输学习方法,具体来说是深度神经网络,可以最优地利用和融合地球系统模型和历史观察所获得的知识,以更准确地预测21世纪初期的全球表面温度场。对于 Shared Socioeconomic Pathways (SSPs) 2-4.5、3-7.0和5-8.5,我们精细地估算了地域估计和全球预测的平均全球温度差异,分别为2.73°C(2.44-3.11°C)、3.92°C(3.5-4.47°C)和4.53°C(3.69-5.5°C),与未控制的2.7°C(1.65-3.8°C)、3.71°C(2.56-4.97°C)和4.47°C(2.95-6.02°C)相比。我们的发现表明,在2031年(2028-2034年)、2029年(2027-2031年)和2028年(2025-2031年),分别为SSP2-4.5、SSP3-7.0和SSP5-8.5的情况下, Париж协议中的1.5°C阈值将会超过。同时,2°C阈值将在2051年(2045-2059年)、2044年(2040-2047年)和2042年(2038-2047年)内相继超过。我们的新方法提供了更准确的气候预测,为气候适应提供了至关重要的信息。
Exploring Small Language Models with Prompt-Learning Paradigm for Efficient Domain-Specific Text Classification
results: 研究结果显示,在几个数据点下,使用提示学习方法可以实现T5-base模型的精确率高于75%,仅需要限制的标注数据。此外,研究还发现,这些提示可以在零数据点下实现模型的改进,并且可以实现模型之间的ensemble效果。Abstract
Domain-specific text classification faces the challenge of scarce labeled data due to the high cost of manual labeling. Prompt-learning, known for its efficiency in few-shot scenarios, is proposed as an alternative to traditional fine-tuning methods. And besides, although large language models (LLMs) have gained prominence, small language models (SLMs, with under 1B parameters) offer significant customizability, adaptability, and cost-effectiveness for domain-specific tasks, given industry constraints. In this study, we investigate the potential of SLMs combined with prompt-learning paradigm for domain-specific text classification, specifically within customer-agent interactions in retail. Our evaluations show that, in few-shot settings when prompt-based model fine-tuning is possible, T5-base, a typical SLM with 220M parameters, achieve approximately 75% accuracy with limited labeled data (up to 15% of full data), which shows great potentials of SLMs with prompt-learning. Based on this, We further validate the effectiveness of active few-shot sampling and the ensemble strategy in the prompt-learning pipeline that contribute to a remarkable performance gain. Besides, in zero-shot settings with a fixed model, we underscore a pivotal observation that, although the GPT-3.5-turbo equipped with around 154B parameters garners an accuracy of 55.16%, the power of well designed prompts becomes evident when the FLAN-T5-large, a model with a mere 0.5% of GPT-3.5-turbo's parameters, achieves an accuracy exceeding 31% with the optimized prompt, a leap from its sub-18% performance with an unoptimized one. Our findings underscore the promise of prompt-learning in classification tasks with SLMs, emphasizing the benefits of active few-shot sampling, and ensemble strategies in few-shot settings, and the importance of prompt engineering in zero-shot settings.
摘要
域域特定文本分类面临着匮乏标注数据的挑战,由于人工标注的高成本。提问学习,在少数shot场景中著名的效率,被提议为传统精度调整方法的替代。此外,虽然大语言模型(LLMs)在获得了前列,但小语言模型(SLMs,参数在1B以下)在域特定任务中提供了显著的可定制化、适应性和成本效益,遵循产业限制。本研究通过对SLMs与提问学习训练模型的组合来研究域特定文本分类的潜力。我们的评估表明,在少数shot设置下,当提问基本模型精度调整是可能的时候,T5-base,一个典型的SLM,可以在有限标注数据(占总数据的15%)下达到约75%的准确率。此外,我们还验证了在提问学习管道中的活动几个shot采样和 ensemble策略的效果,它们在提高性能方面发挥了重要作用。此外,在零shot设置下,我们注意到,虽然GPT-3.5-turbo搭载约154B参数,它在55.16%的准确率,但提问设计的优势在适用于FLAN-T5-large,一个具有0.5%的GPT-3.5-turbo参数,在优化提问下达到了31.10%的准确率,与未优化提问时的Sub-18%的性能有了大幅提升。我们的发现强调了提问学习在分类任务中的承诺,特别是在少数shot设置下的活动几个shot采样和ensemble策略的重要性,以及零shot设置下的提问工程学的重要性。
Boosting In-Context Learning with Factual Knowledge
results: 实验结果表明,KICT 相比强基eline,可以提高 auto-regressive LLMs 在多种文本分类和问答任务上的表现,提高了更多于 13% 和 7% 的准确率。Abstract
In-Context Learning (ICL) over Large language models (LLMs) aims at solving previously unseen tasks by conditioning on a few training examples, eliminating the need for parameter updates and achieving competitive performance. In this paper, we demonstrate that factual knowledge is imperative for the performance of ICL in three core facets, i.e., the inherent knowledge learned in LLMs, the factual knowledge derived from the selected in-context examples, and the knowledge biases in LLMs for output generation. To unleash the power of LLMs in few-shot learning scenarios, we introduce a novel Knowledgeable In-Context Tuning (KICT) framework to further improve the performance of ICL: 1) injecting factual knowledge to LLMs during continual self-supervised pre-training, 2) judiciously selecting the examples with high knowledge relevance, and 3) calibrating the prediction results based on prior knowledge. We evaluate the proposed approaches on auto-regressive LLMs (e.g., GPT-style models) over multiple text classification and question answering tasks. Experimental results demonstrate that KICT substantially outperforms strong baselines, and improves by more than 13% and 7% of accuracy on text classification and question answering tasks, respectively.
摘要
听说大语言模型(LLM)在受限语言学习(ICL)中表现突出,可以通过几个训练示例来解决未看过的任务,不需要参数更新,并且达到竞争性的性能。在这篇论文中,我们表明了知识的重要性,即LLM中的内在知识、选择的在上下文中的示例知识和LLM的输出生成知识偏好。为了解放LLM在几个步骤学习场景中的力量,我们提出了一种新的知识感知适应(KICT)框架:1)在不断的自我监督预训练中注入知识,2)精准地选择高知识相关的示例,3)根据先前知识进行预测结果的准确性补偿。我们对 auto-regressive LLM(如 GPT 型模型)进行了多种文本分类和问答任务的实验,结果表明,KICT substantially 超越了强基elines,并在文本分类和问答任务中提高了13% 和 7% 的准确率。
for: This paper aims to suggest a correct program with minimal repair edits for solving introductory programming problems.
methods: The authors use a pre-trained CodeT5 model and fine-tune it on code pairs of wrong and correct programs to suggest a correct program.
results: The fine-tuned CodeT5 achieves a pass@100 of 91.95% and an average edit distance of 6.84, indicating that at least one correct program can be suggested by generating 100 candidate programs.Here is the information in Simplified Chinese text:
results: 微调后的 CodeT5 实现了 pass@100 的 91.95% 和平均修改距离 6.84,表明可以通过生成 100 个候选程序来建议至少一个正确程序。Abstract
Programmers often struggle to identify and fix bugs in their programs. In recent years, many language models (LMs) have been proposed to fix erroneous programs and support error recovery. However, the LMs tend to generate solutions that differ from the original input programs. This leads to potential comprehension difficulties for users. In this paper, we propose an approach to suggest a correct program with minimal repair edits using CodeT5. We fine-tune a pre-trained CodeT5 on code pairs of wrong and correct programs and evaluate its performance with several baseline models. The experimental results show that the fine-tuned CodeT5 achieves a pass@100 of 91.95% and an average edit distance of the most similar correct program of 6.84, which indicates that at least one correct program can be suggested by generating 100 candidate programs. We demonstrate the effectiveness of LMs in suggesting program repair with minimal edits for solving introductory programming problems.
摘要
Age Minimization in Massive IoT via UAV Swarm: A Multi-agent Reinforcement Learning Approach
paper_authors: Eslam Eldeeb, Mohammad Shehab, Hirley Alves
for: 这篇论文的目的是要Addressing the high-dimensional problem of deploying a swarm of UAVs to collect fresh information from IoT devices, and minimizing the overall age of information in the IoT network.
methods: 这篇论文使用了多代理深度学习来解决高维度的问题,包括cooperative和partially cooperative multi-agent deep reinforcement learning approaches.
results: 研究结果显示,cooperative和partially cooperative multi-agent deep reinforcement learning approaches可以比中央化深度学习方法表现更好,尤其在大规模网络中。Abstract
In many massive IoT communication scenarios, the IoT devices require coverage from dynamic units that can move close to the IoT devices and reduce the uplink energy consumption. A robust solution is to deploy a large number of UAVs (UAV swarm) to provide coverage and a better line of sight (LoS) for the IoT network. However, the study of these massive IoT scenarios with a massive number of serving units leads to high dimensional problems with high complexity. In this paper, we apply multi-agent deep reinforcement learning to address the high-dimensional problem that results from deploying a swarm of UAVs to collect fresh information from IoT devices. The target is to minimize the overall age of information in the IoT network. The results reveal that both cooperative and partially cooperative multi-agent deep reinforcement learning approaches are able to outperform the high-complexity centralized deep reinforcement learning approach, which stands helpless in large-scale networks.
摘要
在许多大规模IoT通信场景中,IoT设备需要由动态单元提供覆盖,以降低上行能 consumption。一种可靠的解决方案是通过大量的无人机(UAV群)来提供覆盖和IoT网络的更好的直线视野。然而,这些大规模IoT场景中服务单元的研究会导致高维度问题,高复杂性。在这篇论文中,我们运用多代理深度学习来解决由UAV群提供的高维度问题,目标是最小化IoT网络中信息的总龄。结果显示,协作和半协作多代理深度学习方法在大规模网络中表现出色,能够超越中央化深度学习方法,后者在大规模网络中无法作用。
Ego-perspective enhanced fitness training experience of AR Try to Move game
results: 提供了一款AR Try to Move 游戏和一种可以快速和准确识别用户手势的 CNN 模型,帮助用户通过远程训练提高上肢部 muscle system 的效果和方便性Abstract
AR, a recent emerging technology, has been widely used in entertainment to provide users with immersive, interactive, and, sometimes, engaging experiences. The process of rehabilitation treatment and motor training process is often boring, and it is well known that users' exercise efficiency is often not as efficient as in a rehabilitation institution. Thus far, there is no effective upper limb sports rehabilitation training game based on the ego-perspective. Hence, with the objective of enhancing the enjoyment experience in rehabilitation and more effective remote rehabilitation training, this work aims to provide an AR Try to Move game and a convolutional neural network (CNN) for identifying and classifying user gestures from a self-collected AR multiple interactive gestures dataset. Utilizing an AR game scoring system, users are incentivized to enhance their upper limb muscle system through remote training with greater effectiveness and convenience.
摘要
新出现的技术AR在娱乐领域广泛应用,为用户提供 immerse、互动和有趣的经验。然而,rehabilitation treatment和motor training过程经常枯燥,用户的锻炼效率通常不如医疗机构的。目前没有有效的上肢体征复健康训练游戏,因此这项工作的目标是提供一款基于EGO视角的AR Try to Move游戏和一个 convolutional neural network(CNN),用于识别和分类用户自采集的AR多交互姿势数据集。通过使用AR游戏分数系统,用户有更多的动机来提高上肢体肌系,通过远程培训,提高效率和便捷性。
ANNCRIPS: Artificial Neural Networks for Cancer Research In Prediction & Survival
paper_authors: Amit Mathapati for: This research paper aims to develop and validate an intelligent mathematical model using Artificial Neural Networks (ANNs) to enhance the early detection of prostate cancer.methods: The model utilizes ANNs to analyze various clinical and laboratory data, such as PSA and DRE results, to improve the accuracy of prostate cancer detection and reduce false positives.results: The model demonstrates promising potential in reducing false positives and improving patient outcomes, with the potential to become a robust and marketable solution for prostate cancer detection in the future.Here’s the text in Simplified Chinese:for: 这个研究报告的目的是开发和验证一种基于人工神经网络(ANNs)的智能数学模型,以提高肾癌早期检测的准确率。methods: 该模型利用ANNs分析各种临床和实验室数据,如PSA和DRE结果,以提高肾癌检测的准确率并减少假阳性结果。results: 该模型在减少假阳性结果和提高病人结果的前提下显示了扎实的潜在价值,未来可能成为肾癌检测的可靠和市场化解决方案。Abstract
Prostate cancer is a prevalent malignancy among men aged 50 and older. Current diagnostic methods primarily rely on blood tests, PSA:Prostate-Specific Antigen levels, and Digital Rectal Examinations (DRE). However, these methods suffer from a significant rate of false positive results. This study focuses on the development and validation of an intelligent mathematical model utilizing Artificial Neural Networks (ANNs) to enhance the early detection of prostate cancer. The primary objective of this research paper is to present a novel mathematical model designed to aid in the early detection of prostate cancer, facilitating prompt intervention by healthcare professionals. The model's implementation demonstrates promising potential in reducing the incidence of false positives, thereby improving patient outcomes. Furthermore, we envision that, with further refinement, extensive testing, and validation, this model can evolve into a robust, marketable solution for prostate cancer detection. The long-term goal is to make this solution readily available for deployment in various screening centers, hospitals, and research institutions, ultimately contributing to more effective cancer screening and patient care.
摘要
probstate cancer 是男性年龄在50岁及以上的常见肿瘤之一。当前的诊断方法主要基于血液测试和PSA:肾脏特异抗体水平(Digital Rectal Examinations,DRE)。但这些方法具有较高的假阳性率。本研究旨在开发和验证一种智能的数学模型,使用人工神经网络(ANNs),以提高肾脏癌早期检测。本研究的主要目标是提供一种能够帮助检测肾脏癌的新型数学模型,以便医疗专业人员早期发现和治疗。该模型的实施表现了有前途的潜力,可以减少假阳性的发生,从而提高病人的 outcome。此外,我们期望通过进一步的优化、广泛的测试和验证,这种模型可以变得更加坚固,并最终变得可商业化,以便更好地检测和治疗肾脏癌。长期目标是将这种解决方案在各个检查中心、医院和研究机构中广泛应用,以便更有效地检测和治疗癌症,并 ultimately contribute to better patient outcomes.
Legal Question-Answering in the Indian Context: Efficacy, Challenges, and Potential of Modern AI Models
results: 研究发现现有的AILQA方法在理解自然语言提示并生成精确回答方面具有强大的能力。Abstract
Legal QA platforms bear the promise to metamorphose the manner in which legal experts engage with jurisprudential documents. In this exposition, we embark on a comparative exploration of contemporary AI frameworks, gauging their adeptness in catering to the unique demands of the Indian legal milieu, with a keen emphasis on Indian Legal Question Answering (AILQA). Our discourse zeroes in on an array of retrieval and QA mechanisms, positioning the OpenAI GPT model as a reference point. The findings underscore the proficiency of prevailing AILQA paradigms in decoding natural language prompts and churning out precise responses. The ambit of this study is tethered to the Indian criminal legal landscape, distinguished by its intricate nature and associated logistical constraints. To ensure a holistic evaluation, we juxtapose empirical metrics with insights garnered from seasoned legal practitioners, thereby painting a comprehensive picture of AI's potential and challenges within the realm of Indian legal QA.
摘要
法律QA平台承诺能够改变法律专家与法律文档之间的交互方式。在这篇论文中,我们进行了对当代AI框架的比较研究,以评估它们在印度法律环境中的适应度,尤其是在印度法律问答(AILQA)领域。我们的讨论涵盖了一系列的检索和问答机制,使用OpenAI GPT模型作为参考点。研究结果表明现有AILQA方法在处理自然语言提示和生成准确回答方面具有强大的能力。本研究的范围围绕着印度刑事法律景观,这个景观具有复杂的特点和相关的设备限制。为了进行全面的评估,我们对实际指标和从经验法律专业人员处获得的信息进行了结合,从而为AI在印度法律QA领域的潜力和挑战提供了全面的图景。
Effective Multi-Agent Deep Reinforcement Learning Control with Relative Entropy Regularization
results: MACDPP在多任务多代理人合作和竞争任务中,以及传统控制任务中,如OpenAI Benchmarks和机器臂运动控制等,表现出了明显的优势,包括学习能力和样本效率。Abstract
In this paper, a novel Multi-agent Reinforcement Learning (MARL) approach, Multi-Agent Continuous Dynamic Policy Gradient (MACDPP) was proposed to tackle the issues of limited capability and sample efficiency in various scenarios controlled by multiple agents. It alleviates the inconsistency of multiple agents' policy updates by introducing the relative entropy regularization to the Centralized Training with Decentralized Execution (CTDE) framework with the Actor-Critic (AC) structure. Evaluated by multi-agent cooperation and competition tasks and traditional control tasks including OpenAI benchmarks and robot arm manipulation, MACDPP demonstrates significant superiority in learning capability and sample efficiency compared with both related multi-agent and widely implemented signal-agent baselines and therefore expands the potential of MARL in effectively learning challenging control scenarios.
摘要
在这篇论文中,一种新的多智能体学习(MARL)方法,即多智能体连续动态政策差分(MACDPP),用于解决多智能体控制场景中的局限性和样本效率问题。它通过在中央训练与分布式执行(CTDE)框架中引入相对 entropy 约束,使多智能体的政策更新更加一致。在多智能体合作和竞争任务中,以及传统的控制任务,包括OpenAI benchmarks和机械臂 manipulate任务,MACDPP表现出了明显的学习能力和样本效率优势,与相关的多智能体和广泛实施的信号代理基elines相比。因此,MACDPP扩大了MARL在学习复杂控制场景中的潜在能力。
results: 本文发现,个人化大型语言模型可以在广泛的应用领域中提供高品质的结果,例如语言和视觉任务。此外,这些模型还能在实时的情况下执行,并且可以在个人电脑或移动设备上运行。Abstract
Inspired by Federated Learning, in this paper, we propose personal large models that are distilled from traditional large language models but more adaptive to local users' personal information such as education background and hobbies. We classify the large language models into three levels: the personal level, expert level and traditional level. The personal level models are adaptive to users' personal information. They encrypt the users' input and protect their privacy. The expert level models focus on merging specific knowledge such as finance, IT and art. The traditional models focus on the universal knowledge discovery and upgrading the expert models. In such classifications, the personal models directly interact with the user. For the whole system, the personal models have users' (encrypted) personal information. Moreover, such models must be small enough to be performed on personal computers or mobile devices. Finally, they also have to response in real-time for better user experience and produce high quality results. The proposed personal large models can be applied in a wide range of applications such as language and vision tasks.
摘要
受 Federated Learning 启发,在这篇论文中,我们提议个性化大型模型,从传统大型语言模型中提取出更适应本地用户个人信息,如教育背景和兴趣爱好。我们将大语言模型分为三级:个性化级、专家级和传统级。个性化级模型适应用户个人信息,对用户输入进行加密,保护用户隐私。专家级模型将特定知识,如金融、IT和艺术等融合。传统级模型则专注于普遍知识发现和提升专家模型。在这些分类中,个性化模型直接与用户进行交互,用户的(加密)个人信息被模型所拥有。此外,这些模型还需要具备在个人计算机或移动设备上进行执行,并在实时响应用户需求,生成高质量结果。我们提议的个性化大型模型可以应用于语言和视觉任务等广泛领域。
Optimizing delegation between human and AI collaborative agents
paper_authors: Andrew Fuchs, Andrea Passarella, Marco Conti
for: 本研究旨在帮助人类与自动化代理人组成混合团队,并准确地授权团队成员执行动作。
methods: 本研究使用观察团队表现来学习一个管理模型,不Restricting agents to matching dynamics。
results: 研究结果表明,我们的管理模型在不同 Representation of the environment下可以有效地做出授权决策,并且与替代方法相比,表现出色。Abstract
In the context of humans operating with artificial or autonomous agents in a hybrid team, it is essential to accurately identify when to authorize those team members to perform actions. Given past examples where humans and autonomous systems can either succeed or fail at tasks, we seek to train a delegating manager agent to make delegation decisions with respect to these potential performance deficiencies. Additionally, we cannot always expect the various agents to operate within the same underlying model of the environment. It is possible to encounter cases where the actions and transitions would vary between agents. Therefore, our framework provides a manager model which learns through observations of team performance without restricting agents to matching dynamics. Our results show our manager learns to perform delegation decisions with teams of agents operating under differing representations of the environment, significantly outperforming alternative methods to manage the team.
摘要
在人类与自动或自适应智能代理人合作的团队中,正确地授权团队成员执行操作是非常重要的。根据过去的例子,人类和自动系统在任务上可以成功或失败。我们希望通过训练一个委托管理者代理人来做委托决策,尊重团队成员的可能性。同时,我们不能一直预期不同代理人在同一个下式环境中操作。可能会出现情况,代理人的行动和转移会不同。因此,我们的框架提供一个管理模型,通过团队表现观察学习,不Restricting代理人遵循环境的同一个模型。我们的结果显示,我们的管理器在不同代理人操作下的环境表现下适应性明显高于其他管理方法。
From Asset Flow to Status, Action and Intention Discovery: Early Malice Detection in Cryptocurrency
For: 这个研究旨在开发一个早期侦测黑客活动的模型,以解决现有的黑客侦测模型具有深度学习无解释性、只能进行过去黑客活动类型的特定预测等问题。* Methods: 本研究使用了决策树基本特征选择和补充(DT-SC)来定义资产转账路径,然后使用了状态/行为提议模组(S/A-PM)和意图VAE模组来生成状态、行为、意图抽象和隐藏意图抽象。* Results: 实验结果显示,提出的算法在三个真实世界数据集上表现出色,较进行过去的方法更高的侦测速度和解释性。此外,适当的损失函数设计进一步增强预测速度和模型的解释性。Abstract
Cryptocurrency has been subject to illicit activities probably more often than traditional financial assets due to the pseudo-anonymous nature of its transacting entities. An ideal detection model is expected to achieve all three critical properties of (I) early detection, (II) good interpretability, and (III) versatility for various illicit activities. However, existing solutions cannot meet all these requirements, as most of them heavily rely on deep learning without interpretability and are only available for retrospective analysis of a specific illicit type. To tackle all these challenges, we propose Intention-Monitor for early malice detection in Bitcoin (BTC), where the on-chain record data for a certain address are much scarcer than other cryptocurrency platforms. We first define asset transfer paths with the Decision-Tree based feature Selection and Complement (DT-SC) to build different feature sets for different malice types. Then, the Status/Action Proposal Module (S/A-PM) and the Intention-VAE module generate the status, action, intent-snippet, and hidden intent-snippet embedding. With all these modules, our model is highly interpretable and can detect various illegal activities. Moreover, well-designed loss functions further enhance the prediction speed and model's interpretability. Extensive experiments on three real-world datasets demonstrate that our proposed algorithm outperforms the state-of-the-art methods. Furthermore, additional case studies justify our model can not only explain existing illicit patterns but can also find new suspicious characters.
摘要
криптовалюта часто becomes subject to illegal activities due to its pseudonymous nature, making it difficult to detect illicit activities. To address this challenge, we propose an ideal detection model that should have three critical properties: early detection, good interpretability, and versatility for various illegal activities. However, existing solutions are limited by their reliance on deep learning without interpretability and their inability to detect multiple types of illegal activities.To overcome these challenges, we propose Intention-Monitor, a model that uses decision trees to select features and complement the data, followed by a status/action proposal module and an intention-VAE module to generate embeddings for status, action, intent, and hidden intent. Our model is highly interpretable and can detect various illegal activities, and well-designed loss functions further enhance its prediction speed and interpretability.Extensive experiments on three real-world datasets show that our proposed algorithm outperforms state-of-the-art methods, and additional case studies demonstrate that our model can not only explain existing illicit patterns but can also find new suspicious characters.
Are Human-generated Demonstrations Necessary for In-context Learning?
results: 在四个数学逻辑、常识逻辑、多任务语言理解和代码生成测试 benchmarks 中,SEC 表现出色,并且不需要人工制作的示例,与使用手工制作的示例相比,效果相似。这表明,当今的大语言模型在许多任务上具备独立做出决策的能力,可以废除外部训练数据。Abstract
Despite the promising few-shot ability of large language models (LLMs), the standard paradigm of In-context Learning (ICL) suffers the disadvantages of susceptibility to selected demonstrations and the intricacy to generate these demonstrations. In this paper, we raise the fundamental question that whether human-generated demonstrations are necessary for ICL. To answer this question, we propose self-contemplation prompting strategy (SEC), a paradigm free from human-crafted demonstrations. The key point of SEC is that, instead of using hand-crafted examples as demonstrations in ICL, SEC asks LLMs to first create demonstrations on their own, based on which the final output is generated. SEC is a flexible framework and can be adapted to both the vanilla ICL and the chain-of-thought (CoT), but with greater ease: as the manual-generation process of both examples and rationale can be saved. Extensive experiments in arithmetic reasoning, commonsense reasoning, multi-task language understanding, and code generation benchmarks, show that SEC, which does not require hand-crafted demonstrations, significantly outperforms the zero-shot learning strategy, and achieves comparable results to ICL with hand-crafted demonstrations. This demonstrates that, for many tasks, contemporary LLMs possess a sufficient level of competence to exclusively depend on their own capacity for decision making, removing the need for external training data. Code is available at https://github.com/ruili33/SEC.
摘要
尽管大语言模型(LLM)具有批处少量能力,标准的尽Context学习(ICL) paradigm 受到选择性示范和生成示范的缺点。在这篇论文中,我们提出了基本问题:人类生成的示范是ICL必需的?为回答这个问题,我们提议了自我思考提示策略(SEC),一种不需要人类制定示范的 paradigm。SEC的关键点是,而不是使用手工制定的示范,LLMs可以通过自己生成示范来生成最终输出。SEC是一个灵活的框架,可以适应标准ICL和链条思考(CoT),但更容易:手动生成示范和理由的手动处理可以快速地保存。我们在数学逻辑、通用常识逻辑、多任务语言理解和代码生成benchmark中进行了广泛的实验,结果显示,不需要手动生成示范的SEC,在零shot学习策略下表现出色,与手动生成示范的ICL相当。这表明,许多任务上,当代LLMs具有独立决策的能力,可以完全依靠自己的能力,无需外部培训数据。代码可以在https://github.com/ruili33/SEC中找到。
XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection
results: 研究结果表明,XGV-BERT方法比现有的两种方法(VulDeePecker和SySeVR)更高的检测精度,特别是在VulDeePecker数据集上,XGV-BERT方法达到了97.5%的F1分数,而VulDeePecker方法只达到了78.3%的F1分数。在SySeVR数据集上,XGV-BERT方法也达到了95.5%的F1分数,超过了SySeVR方法的83.5%的F1分数。Abstract
With the advancement of deep learning (DL) in various fields, there are many attempts to reveal software vulnerabilities by data-driven approach. Nonetheless, such existing works lack the effective representation that can retain the non-sequential semantic characteristics and contextual relationship of source code attributes. Hence, in this work, we propose XGV-BERT, a framework that combines the pre-trained CodeBERT model and Graph Neural Network (GCN) to detect software vulnerabilities. By jointly training the CodeBERT and GCN modules within XGV-BERT, the proposed model leverages the advantages of large-scale pre-training, harnessing vast raw data, and transfer learning by learning representations for training data through graph convolution. The research results demonstrate that the XGV-BERT method significantly improves vulnerability detection accuracy compared to two existing methods such as VulDeePecker and SySeVR. For the VulDeePecker dataset, XGV-BERT achieves an impressive F1-score of 97.5%, significantly outperforming VulDeePecker, which achieved an F1-score of 78.3%. Again, with the SySeVR dataset, XGV-BERT achieves an F1-score of 95.5%, surpassing the results of SySeVR with an F1-score of 83.5%.
摘要
随着深度学习(DL)在不同领域的应用,有许多尝试通过数据驱动方法揭示软件漏洞。然而,现有的方法缺乏有效的表示方式,能够保留代码特征的非序列性和Contextual关系。因此,在这项工作中,我们提出了XGV-BERT框架,它将CodeBERT模型和图神经网络(GCN)结合以检测软件漏洞。通过在XGV-BERT中同时训练CodeBERT和GCN模块,我们的方法可以利用大规模预训练、继承大量原始数据和转移学习,以学习表示训练数据的图 convolution。研究结果表明,XGV-BERT方法在VulDeePecker和SySeVR dataset上显著提高了漏洞检测精度,比如VulDeePecker和SySeVR方法的F1分数分别为78.3%和83.5%,而XGV-BERT方法在VulDeePecker dataset上达到了97.5%的F1分数,在SySeVR dataset上达到了95.5%的F1分数。
Leveraging Herpangina Data to Enhance Hospital-level Prediction of Hand-Foot-and-Mouth Disease Admissions Using UPTST
results: 模型在医院级别的长臂和短臂预测精度方面具有显著优势,并且在探索性扩展实验中表现出了更广泛的应用前景。Abstract
Outbreaks of hand-foot-and-mouth disease(HFMD) have been associated with significant morbidity and, in severe cases, mortality. Accurate forecasting of daily admissions of pediatric HFMD patients is therefore crucial for aiding the hospital in preparing for potential outbreaks and mitigating nosocomial transmissions. To address this pressing need, we propose a novel transformer-based model with a U-net shape, utilizing the patching strategy and the joint prediction strategy that capitalizes on insights from herpangina, a disease closely correlated with HFMD. This model also integrates representation learning by introducing reconstruction loss as an auxiliary loss. The results show that our U-net Patching Time Series Transformer (UPTST) model outperforms existing approaches in both long- and short-arm prediction accuracy of HFMD at hospital-level. Furthermore, the exploratory extension experiments show that the model's capabilities extend beyond prediction of infectious disease, suggesting broader applicability in various domains.
摘要
OUTBREAKS OF HAND-FOOT-AND-MOUTH DISEASE (HFMD) HAVE BEEN ASSOCIATED WITH SIGNIFICANT MORBIDITY AND, IN SEVERE CASES, MORTALITY. ACCURATE FORECASTING OF DAILY ADMISSIONS OF PEDIATRIC HFMD PATIENTS IS THEREFORE CRUCIAL FOR AIDING THE HOSPITAL IN PREPARING FOR POTENTIAL OUTBREAKS AND MITIGATING NOSOCOMIAL TRANSMISSIONS. TO ADDRESS THIS PRESSING NEED, WE PROPOSE A NOVEL TRANSFORMER-BASED MODEL WITH A U-NET SHAPE, UTILIZING THE PATCHING STRATEGY AND THE JOINT PREDICTION STRATEGY THAT CAPITALIZES ON INSIGHTS FROM HERPANGINA, A DISEASE CLOSELY CORRELATED WITH HFMD. THIS MODEL ALSO INTEGRATES REPRESENTATION LEARNING BY INTRODUCING RECONSTRUCTION LOSS AS AN AUXILIARY LOSS. THE RESULTS SHOW THAT OUR U-NET PATCHING TIME SERIES TRANSFORMER (UPTST) MODEL OUTPERFORMS EXISTING APPROACHES IN BOTH LONG- AND SHORT-ARM PREDICTION ACCURACY OF HFMD AT HOSPITAL-LEVEL. FURTHERMORE, THE EXPLORATORY EXTENSION EXPERIMENTS SHOW THAT THE MODEL'S CAPABILITIES EXTEND BEYOND PREDICTION OF INFECTIOUS DISEASE, SUGGESTING BROADER APPLICABILITY IN VARIOUS DOMEAINS.
ALEX: Towards Effective Graph Transfer Learning with Noisy Labels
methods: 本研究提出了一种新的技术 Balance Alignment and Information-aware Examination (ALEX),包括对矩阵进行特征分解,通过图像对比学习提供稳定的节点表示,并通过估算假设分布来建立均衡的子图分布。
results: 对多个基准数据集进行了广泛的实验,证明了 ALEX 在不同设定下具有突出的优势。Abstract
Graph Neural Networks (GNNs) have garnered considerable interest due to their exceptional performance in a wide range of graph machine learning tasks. Nevertheless, the majority of GNN-based approaches have been examined using well-annotated benchmark datasets, leading to suboptimal performance in real-world graph learning scenarios. To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We introduce a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge. ALEX first employs singular value decomposition to generate different views with crucial structural semantics, which help provide robust node representations using graph contrastive learning. To mitigate both label shift and domain shift, we estimate a prior distribution to build subgraphs with balanced label distributions. Building on this foundation, an adversarial domain discriminator is incorporated for the implicit domain alignment of complex multi-modal distributions. Furthermore, we project node representations into a different space, optimizing the mutual information between the projected features and labels. Subsequently, the inconsistency of similarity structures is evaluated to identify noisy samples with potential overfitting. Comprehensive experiments on various benchmark datasets substantiate the outstanding superiority of the proposed ALEX in different settings.
摘要
GRAPH NEURAL NETWORKS (GNNs) have received extensive attention due to their remarkable performance in a wide range of graph machine learning tasks. However, most GNN-based approaches have been evaluated using well-annotated benchmark datasets, leading to suboptimal performance in real-world graph learning scenarios. To bridge this gap, the present paper investigates the problem of graph transfer learning in the presence of label noise, which transfers knowledge from a noisy source graph to an unlabeled target graph. We propose a novel technique termed Balance Alignment and Information-aware Examination (ALEX) to address this challenge.ALEX first employs singular value decomposition to generate different views with crucial structural semantics, which help provide robust node representations using graph contrastive learning. To mitigate both label shift and domain shift, we estimate a prior distribution to build subgraphs with balanced label distributions. Building on this foundation, an adversarial domain discriminator is incorporated for the implicit domain alignment of complex multi-modal distributions. Furthermore, we project node representations into a different space, optimizing the mutual information between the projected features and labels. Subsequently, the inconsistency of similarity structures is evaluated to identify noisy samples with potential overfitting.Comprehensive experiments on various benchmark datasets substantiate the outstanding superiority of the proposed ALEX in different settings.
Learning Emergent Behavior in Robot Swarms with NEAT
results: 我们在 Georgia Tech Miniature Autonomous Blimps (GT-MABs) 飞行 платформы上进行了 simulations,并在 Anki Vector 机器人上进行了测试。我们在不同的任务中评估了我们的算法,包括 Area Coverage 任务、Surround Target 任务和 Wall Climb 任务。我们比较了我们的算法生成的行为和 ‘设计政策’ 的行为,并发现我们的算法可以更好地实现愿望的群体行为。Abstract
When researching robot swarms, many studies observe complex group behavior emerging from the individual agents' simple local actions. However, the task of learning an individual policy to produce a desired emergent behavior remains a challenging and largely unsolved problem. We present a method of training distributed robotic swarm algorithms to produce emergent behavior. Inspired by the biological evolution of emergent behavior in animals, we use an evolutionary algorithm to train a 'population' of individual behaviors to approximate a desired group behavior. We perform experiments using simulations of the Georgia Tech Miniature Autonomous Blimps (GT-MABs) aerial robotics platforms conducted in the CoppeliaSim simulator. Additionally, we test on simulations of Anki Vector robots to display our algorithm's effectiveness on various modes of actuation. We evaluate our algorithm on various tasks where a somewhat complex group behavior is required for success. These tasks include an Area Coverage task, a Surround Target task, and a Wall Climb task. We compare behaviors evolved using our algorithm against 'designed policies', which we create in order to exhibit the emergent behaviors we desire.
摘要
We conduct experiments using simulations of the Georgia Tech Miniature Autonomous Blimps (GT-MABs) aerial robotics platforms in the CoppeliaSim simulator, and also test our algorithm on simulations of Anki Vector robots to demonstrate its effectiveness on various modes of actuation. We evaluate our algorithm on tasks where a somewhat complex group behavior is required for success, such as Area Coverage, Surround Target, and Wall Climb. We compare the behaviors evolved using our algorithm with 'designed policies' that we create to exhibit the desired emergent behaviors.
CoFiI2P: Coarse-to-Fine Correspondences for Image-to-Point Cloud Registration
results: 根据实验结果,CoFiI2P在KITTI数据集上取得了优秀的结果,相对偏动 Error 为1.14度,相对偏移 Error 为0.29米,与当前最佳方法相比提高了84%和89%。Abstract
Image-to-point cloud (I2P) registration is a fundamental task in the field of autonomous vehicles and transportation systems for cross-modality data fusion and localization. Existing I2P registration methods estimate correspondences at the point/pixel level, often overlooking global alignment. However, I2P matching can easily converge to a local optimum when performed without high-level guidance from global constraints. To address this issue, this paper introduces CoFiI2P, a novel I2P registration network that extracts correspondences in a coarse-to-fine manner to achieve the globally optimal solution. First, the image and point cloud data are processed through a Siamese encoder-decoder network for hierarchical feature extraction. Second, a coarse-to-fine matching module is designed to leverage these features and establish robust feature correspondences. Specifically, In the coarse matching phase, a novel I2P transformer module is employed to capture both homogeneous and heterogeneous global information from the image and point cloud data. This enables the estimation of coarse super-point/super-pixel matching pairs with discriminative descriptors. In the fine matching module, point/pixel pairs are established with the guidance of super-point/super-pixel correspondences. Finally, based on matching pairs, the transform matrix is estimated with the EPnP-RANSAC algorithm. Extensive experiments conducted on the KITTI dataset demonstrate that CoFiI2P achieves impressive results, with a relative rotation error (RRE) of 1.14 degrees and a relative translation error (RTE) of 0.29 meters. These results represent a significant improvement of 84\% in RRE and 89\% in RTE compared to the current state-of-the-art (SOTA) method. Qualitative results are available at https://youtu.be/ovbedasXuZE. The source code will be publicly released at https://github.com/kang-1-2-3/CoFiI2P.
摘要
Image-to-point cloud(I2P)匹配是汽车自主领域和交通系统中的基本任务,用于跨模态数据融合和地理位置。现有I2P匹配方法通常在点/像素级别进行匹配,经常忽略全局约束。然而,I2P匹配容易 converges to 局部最优解而不是全局最优解。为解决这个问题,这篇文章提出了CoFiI2P,一种新的I2P匹配网络,该网络可以在层次结构下提取特征,以实现全球最优解。首先,图像和点云数据经过一个SIAMESE编码器-解码器网络进行特征提取。其次,我们设计了一个coarse-to-fine匹配模块,以利用这些特征并建立稳定的特征对应。Specifically,在粗略匹配阶段,我们采用了一个新的I2P变换模块,以捕捉图像和点云数据中的同质和不同质全局信息。这使得可以估计粗略超点/超像素匹配对,并生成特征描述器。在细密匹配阶段,点/像素对被指导super-point/super-pixel对的建立。最后,基于匹配对,我们使用EPnP-RANSAC算法估计变换矩阵。广泛在KITTI数据集上进行了实验,CoFiI2P实现了出色的结果,其相对旋转误差(RRE)为1.14度,相对平移误差(RTE)为0.29米。这些结果与当前SOTA方法相比,提高了84%的RRE和89%的RTE。详细结果可以在https://youtu.be/ovbedasXuZE中查看。代码将在https://github.com/kang-1-2-3/CoFiI2P上公开发布。
Divide and Conquer in Video Anomaly Detection: A Comprehensive Review and New Approach
results: 根据本研究所获得的发现,提出了一种新的方法, integrate human skeletal frameworks with video data analysis techniques,可以在ShanghaiTech dataset上达到最高的性能,超过所有已有的先进方法。Abstract
Video anomaly detection is a complex task, and the principle of "divide and conquer" is often regarded as an effective approach to tackling intricate issues. It's noteworthy that recent methods in video anomaly detection have revealed the application of the divide and conquer philosophy (albeit with distinct perspectives from traditional usage), yielding impressive outcomes. This paper systematically reviews these literatures from six dimensions, aiming to enhance the use of the divide and conquer strategy in video anomaly detection. Furthermore, based on the insights gained from this review, a novel approach is presented, which integrates human skeletal frameworks with video data analysis techniques. This method achieves state-of-the-art performance on the ShanghaiTech dataset, surpassing all existing advanced methods.
摘要
视频异常检测是一项复杂的任务,“分而治之”的原则经常被视为解决复杂问题的有效方法。值得注意的是,近年来的视频异常检测方法中,已经应用了分而治之的哲学(尽管从传统使用角度有所不同),并且得到了出色的结果。本文系统地回顾这些文献从六个维度,旨在提高视频异常检测中使用分而治之策略的使用。此外,基于本文的检查,我们提出了一种新的方法,即将人体骨架与视频数据分析技术结合起来,该方法在上海理工大学数据集上达到了当前最佳性能,超过了所有先进方法。
Towards A Unified Utilitarian Ethics Framework for Healthcare Artificial Intelligence
paper_authors: Forhan Bin Emdad, Shuyuan Mary Ho, Benhur Ravuri, Shezin Hussain for: This study aims to identify the major ethical principles influencing the utility performance of AI in healthcare settings and to propose a new utilitarian ethics-based theoretical framework for designing ethical AI.methods: The study uses a thematic analysis of secondary survey data from 36 AI experts to identify the top ethical principles of AI design, and a meta-analysis to categorize the ethical issues in AI design.results: The study found that justice, privacy, bias, lack of regulations, risks, and interpretability are the most important ethical principles to consider for ethical AI in healthcare settings. The proposed theoretical framework is based on utilitarian ethics and aims to resolve the ethical issues identified by the meta-analysis and domain experts.Abstract
Artificial Intelligence (AI) aims to elevate healthcare to a pinnacle by aiding clinical decision support. Overcoming the challenges related to the design of ethical AI will enable clinicians, physicians, healthcare professionals, and other stakeholders to use and trust AI in healthcare settings. This study attempts to identify the major ethical principles influencing the utility performance of AI at different technological levels such as data access, algorithms, and systems through a thematic analysis. We observed that justice, privacy, bias, lack of regulations, risks, and interpretability are the most important principles to consider for ethical AI. This data-driven study has analyzed secondary survey data from the Pew Research Center (2020) of 36 AI experts to categorize the top ethical principles of AI design. To resolve the ethical issues identified by the meta-analysis and domain experts, we propose a new utilitarian ethics-based theoretical framework for designing ethical AI for the healthcare domain.
摘要
人工智能(AI)目标是提升医疗健康到最高点,通过支持临床决策。通过解决 relate to the design of ethical AI 的挑战,可以使临床医生、physicians、医疗专业人员和其他关注者可以使用和信任 AI 在医疗设置中。本研究尝试通过主题分析来识别不同技术水平(如数据访问、算法和系统)中 ethical principles 的影响。我们发现,正义、隐私、偏见、缺乏法规、风险和可解释性是 ethical AI 中最重要的原则。本数据驱动的研究分析了 Pew Research Center (2020)36名 AI 专家的次要调查数据,以分类 AI 设计中的最重要道德原则。为解决由 meta-analysis 和领域专家所提出的道德问题,我们提出了一种基于利用主义道德学的新理论框架,用于设计医疗领域的 ethical AI。
Unsupervised Graph Deep Learning Reveals Emergent Flood Risk Profile of Urban Areas
results: 使用美国多个都会统计区(MSA)的数据,该模型可以为每个都会区分类出六个不同的洪水风险水平,并且可以对每个水平中的区域进行特征分析,从而识别出每个都会区的三种极性类型。研究发现洪水风险在每个都会区内存在层次结构,核心城市占洪水风险的主要部分。Abstract
Urban flood risk emerges from complex and nonlinear interactions among multiple features related to flood hazard, flood exposure, and social and physical vulnerabilities, along with the complex spatial flood dependence relationships. Existing approaches for characterizing urban flood risk, however, are primarily based on flood plain maps, focusing on a limited number of features, primarily hazard and exposure features, without consideration of feature interactions or the dependence relationships among spatial areas. To address this gap, this study presents an integrated urban flood-risk rating model based on a novel unsupervised graph deep learning model (called FloodRisk-Net). FloodRisk-Net is capable of capturing spatial dependence among areas and complex and nonlinear interactions among flood hazards and urban features for specifying emergent flood risk. Using data from multiple metropolitan statistical areas (MSAs) in the United States, the model characterizes their flood risk into six distinct city-specific levels. The model is interpretable and enables feature analysis of areas within each flood-risk level, allowing for the identification of the three archetypes shaping the highest flood risk within each MSA. Flood risk is found to be spatially distributed in a hierarchical structure within each MSA, where the core city disproportionately bears the highest flood risk. Multiple cities are found to have high overall flood-risk levels and low spatial inequality, indicating limited options for balancing urban development and flood-risk reduction. Relevant flood-risk reduction strategies are discussed considering ways that the highest flood risk and uneven spatial distribution of flood risk are formed.
摘要
城市洪水风险由多个因素相互作用而生成,包括洪水威胁、洪水暴露、社会和物理敏感性等多个方面。然而,现有的城市洪水风险评估方法主要基于洪水平原地图,强调一些特定的特征,主要是洪水威胁和暴露特征,而不考虑特征之间的依赖关系或城市区域之间的复杂关系。为了解决这一漏洞,本研究提出了一种基于深度学习模型(称为洪水风险网络)的城市洪水风险评估模型。洪水风险网络能够捕捉城市区域之间的空间依赖关系以及洪水威胁和城市特征之间的复杂非线性互动。使用美国多个都会区统计数据,模型将城市洪水风险分为六个不同的城市特定水平。模型可解释性强,允许对每个城市区域进行特征分析,并将每个都会区内的三种架构类型划分为最高洪水风险。洪水风险发现在每个都会区中具有层次结构,核心城市占据最高洪水风险。许多城市具有高总洪水风险水平和低空间不平等,表明城市发展和洪水风险减少之间有限的选择空间。研究提出了适应城市发展和洪水风险减少的措施,并考虑了洪水风险最高和空间分布不均的形成原因。
Efficient Post-training Quantization with FP8 Formats
for: The paper is written to study the advantages of FP8 data formats for post-training quantization of deep learning models, and to develop a quantization workflow that generalizes across different network architectures.
methods: The paper examines three different FP8 representations (E5M2, E4M3, and E3M4) and compares their effects on model accuracy, and also uses Intel Neural Compressor for quantization.
results: The paper finds that FP8 formats outperform INT8 in multiple aspects, including workload coverage, model accuracy, and suitability for a broader range of operations. Additionally, the paper finds that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.Here are the three points in Simplified Chinese text:
results: 论文发现FP8格式在多种方面比INT8高效,包括工作负载覆盖率(92.64% vs. 65.87%)、模型准确性和更广泛的操作范围。此外,论文还发现E4M3适合语言模型,而E3M4在计算机视觉任务上表现marginally更好。Abstract
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.
摘要
Joint Communication and Computation Framework for Goal-Oriented Semantic Communication with Distortion Rate Resilience
methods: 本研究使用 rate-distortion theory 分析语音和semantic压缩induced的扰动,以估计人工智能任务的实际性能。
results: 实验结果表明,提出的方法可以保持人工智能任务的准确性,同时遵循网络限制。这成为goal-oriented semantic communication领域的一个有价值贡献。此外,本研究也提出了数据驱动的方法在优化智能系统性能方面的积极作用。Abstract
Recent research efforts on semantic communication have mostly considered accuracy as a main problem for optimizing goal-oriented communication systems. However, these approaches introduce a paradox: the accuracy of artificial intelligence (AI) tasks should naturally emerge through training rather than being dictated by network constraints. Acknowledging this dilemma, this work introduces an innovative approach that leverages the rate-distortion theory to analyze distortions induced by communication and semantic compression, thereby analyzing the learning process. Specifically, we examine the distribution shift between the original data and the distorted data, thus assessing its impact on the AI model's performance. Founding upon this analysis, we can preemptively estimate the empirical accuracy of AI tasks, making the goal-oriented semantic communication problem feasible. To achieve this objective, we present the theoretical foundation of our approach, accompanied by simulations and experiments that demonstrate its effectiveness. The experimental results indicate that our proposed method enables accurate AI task performance while adhering to network constraints, establishing it as a valuable contribution to the field of signal processing. Furthermore, this work advances research in goal-oriented semantic communication and highlights the significance of data-driven approaches in optimizing the performance of intelligent systems.
摘要
Speech Audio Synthesis from Tagged MRI and Non-Negative Matrix Factorization via Plastic Transformer
paper_authors: Xiaofeng Liu, Fangxu Xing, Maureen Stone, Jiachen Zhuo, Sidney Fels, Jerry L. Prince, Georges El Fakhri, Jonghye Woo for:* 这个论文旨在研究语音生成的方法,具体来说是将weighting map翻译成对应的声音波形。methods:* 该论文使用了非负矩阵分解方法来估算函数单元的运动特征,并使用了深度学习框架来翻译weighting map到对应的声音波形。results:* 该论文的实验结果表明,使用该方法可以Synthesize speech audio waveforms from weighting maps,并且超过了传统的 convolution 和 transformer 模型。Abstract
The tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.
摘要
tongue's intricate 3D structure, comprising localized functional units, plays a crucial role in the production of speech. When measured using tagged MRI, these functional units exhibit cohesive displacements and derived quantities that facilitate the complex process of speech production. Non-negative matrix factorization-based approaches have been shown to estimate the functional units through motion features, yielding a set of building blocks and a corresponding weighting map. Investigating the link between weighting maps and speech acoustics can offer significant insights into the intricate process of speech production. To this end, in this work, we utilize two-dimensional spectrograms as a proxy representation, and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms. Our proposed plastic light transformer (PLT) framework is based on directional product relative position bias and single-level spatial pyramid pooling, thus enabling flexible processing of weighting maps with variable size to fixed-size spectrograms, without input information loss or dimension expansion. Additionally, our PLT framework efficiently models the global correlation of wide matrix input. To improve the realism of our generated spectrograms with relatively limited training samples, we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training. Experimental results on a dataset of 29 subjects speaking two utterances demonstrated that our framework is able to synthesize speech audio waveforms from weighting maps, outperforming conventional convolution and transformer models.Here's the text with some notes on the translation:1. "tongue's intricate 3D structure" is translated as "舌头的精确三维结构" (shí yì zhèng qié sān wéi jié gòng)2. "comprising localized functional units" is translated as "包括本地功能单位" (bā jīn běn dì gōng chéng dān yì)3. "plays a crucial role in the production of speech" is translated as "对话调制中发挥重要的作用" (duì huì tiáng zhèng zhōng yì de zuò yì)4. "Non-negative matrix factorization-based approaches" is translated as "基于非负矩阵分解的方法" (jī yú fēi shū jí zhèng fāng yì)5. "yielding a set of building blocks and a corresponding weighting map" is translated as "生成一个集合和对应的权重图" (shēng jìn yī gè jí hù yì de quán zhòng tú)6. "Investigating the link between weighting maps and speech acoustics" is translated as "研究权重图和话语音响之间的关联" (yán jí quán zhòng tú yǔ huì yǔ yīn jiān zhì)7. "two-dimensional spectrograms as a proxy representation" is translated as "作为代表的二维спектロграм" (zuò weǐ de dì yì xiàng yì zhèng)8. "and develop an end-to-end deep learning framework for translating weighting maps to their corresponding audio waveforms" is translated as "并开发一套从权重图到对应的音频波形的深度学习框架" (qǐng dào yī yī jī zhèng xué xí guī fám)9. "Our proposed plastic light transformer (PLT) framework" is translated as "我们提出的塑料光Transformer(PLT)框架" (wǒ men tím chuī de zhāng liào guāng tīng zhèng jì)10. "efficiently models the global correlation of wide matrix input" is translated as "能够有效地模型宽度矩阵输入的全球相互关系" (néng kě yǐ jì dì módeli xiàng dào jí zhèng zhì yì)11. "To improve the realism of our generated spectrograms with relatively limited training samples" is translated as "以增强我们生成的спектロграм中的实际感受,使用有限的训练样本" (yǐ jìn cháng wèi de xiàng yì zhèng zhì yì, shǐ yòng yǒu xiàng de xiàng yì zhèng zhì)12. "we apply pair-wise utterance consistency with Maximum Mean Discrepancy constraint and adversarial training" is translated as "我们运用对应的说话遗传性和最大差异约束,以及反对攻击训练" (wǒ men yù yòng duì bìng de jiàn chuī zhèng zhì yì, yǐ jìn cháng wèi de zhèng zhì yì)Note that the translation is not word-for-word, but rather a more natural and idiomatic translation that captures the meaning and nuances of the original text.
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
methods: 本研究提出了一种新的损失函数called Continuously Weighted Contrastive Loss (CWCL),该损失函数使用连续的相似度测量来对一个模式的表示空间与另一个模式的表示空间进行对齐。
results: 对多种模型、数据集和模式进行实验,本研究发现CWCL可以在零shot传输中超过现有方法的性能,尤其是在图像分类和语音分类中。 Specifically, the models achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification.Abstract
This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained model in one modality is used for representation learning in another domain using pairwise data. The learnt models in the latter domain can then be used for a diverse set of tasks in a zero-shot way, similar to ``Contrastive Language-Image Pre-training (CLIP)'' and ``Locked-image Tuning (LiT)'' that have recently gained considerable attention. Most existing works for cross-modal representation alignment (including CLIP and LiT) use the standard contrastive training objective, which employs sets of positive and negative examples to align similar and repel dissimilar training data samples. However, similarity amongst training examples has a more continuous nature, thus calling for a more `non-binary' treatment. To address this, we propose a novel loss function called Continuously Weighted Contrastive Loss (CWCL) that employs a continuous measure of similarity. With CWCL, we seek to align the embedding space of one modality with another. Owing to the continuous nature of similarity in the proposed loss function, these models outperform existing methods for 0-shot transfer across multiple models, datasets and modalities. Particularly, we consider the modality pairs of image-text and speech-text and our models achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification.
摘要