cs.AI - 2023-12-04

Stock Movement and Volatility Prediction from Tweets, Macroeconomic Factors and Historical Prices

  • paper_url: http://arxiv.org/abs/2312.03758
  • repo_url: https://github.com/hao1zhao/bigdata23
  • paper_authors: Shengkun Wang, YangXiao Bai, Taoran Ji, Kaiqun Fu, Linhan Wang, Chang-Tien Lu
  • for: 预测股票市场,提高投资者和政策制定者对经济健康的了解。
  • methods: 利用社交媒体数据,即公众情绪的重要来源,并与政府编制的统计数据相结合,以提高股票市场预测精度。
  • results: 提出一种新的预测模型ECON,具有高效的微博筛选、自我意识机制和多级关系分析等优势,可以更好地预测股票市场运动和波动。
    Abstract Predicting stock market is vital for investors and policymakers, acting as a barometer of the economic health. We leverage social media data, a potent source of public sentiment, in tandem with macroeconomic indicators as government-compiled statistics, to refine stock market predictions. However, prior research using tweet data for stock market prediction faces three challenges. First, the quality of tweets varies widely. While many are filled with noise and irrelevant details, only a few genuinely mirror the actual market scenario. Second, solely focusing on the historical data of a particular stock without considering its sector can lead to oversight. Stocks within the same industry often exhibit correlated price behaviors. Lastly, simply forecasting the direction of price movement without assessing its magnitude is of limited value, as the extent of the rise or fall truly determines profitability. In this paper, diverging from the conventional methods, we pioneer an ECON. The framework has following advantages: First, ECON has an adept tweets filter that efficiently extracts and decodes the vast array of tweet data. Second, ECON discerns multi-level relationships among stocks, sectors, and macroeconomic factors through a self-aware mechanism in semantic space. Third, ECON offers enhanced accuracy in predicting substantial stock price fluctuations by capitalizing on stock price movement. We showcase the state-of-the-art performance of our proposed model using a dataset, specifically curated by us, for predicting stock market movements and volatility.
    摘要 预测股市是对投资者和政策制定者而言非常重要,作为经济健康的指标。我们利用社交媒体数据,这是一种强大的公众情感数据源,与官方统计指标共同进行精细化股市预测。然而,先前的研究使用微博数据进行股市预测存在三大挑战。首先,微博质量异相差很大,只有一部分真正反映股市实际情况。其次,只强调历史数据的特定股票而不考虑相关行业,可能导致忽视。最后,简单地预测价格方向而忽略其大小,实际上对利润的评估具有有限的价值。在这篇论文中,我们抛弃传统方法,提出了一种新的ECON框架。ECON框架具有以下优势:一、ECON有高效的微博筛选和解码功能,可以高效地提取和解码大量微博数据。二、ECON通过自我意识机制在semantic空间中掌握多级关系,包括股票、行业和macro经济因素之间的关系。三、ECON可以在股价波动中提高预测精度,通过利用股价运动来补做传统方法的缺陷。我们使用自己特制的数据集,展示了ECON框架在预测股市运动和波动中的 state-of-the-art 性能。

CityTFT: Temporal Fusion Transformer for Urban Building Energy Modeling

  • paper_url: http://arxiv.org/abs/2312.02375
  • repo_url: None
  • paper_authors: Ting-Yu Dai, Dev Niyogi, Zoltan Nagy
  • for: investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels
  • methods: data-driven UBEM framework, accurately model the energy demands in urban environments
  • results: predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98% and RMSE of loads of 13.57 kWh
    Abstract Urban Building Energy Modeling (UBEM) is an emerging method to investigate urban design and energy systems against the increasing energy demand at urban and neighborhood levels. However, current UBEM methods are mostly physic-based and time-consuming in multiple climate change scenarios. This work proposes CityTFT, a data-driven UBEM framework, to accurately model the energy demands in urban environments. With the empowerment of the underlying TFT framework and an augmented loss function, CityTFT could predict heating and cooling triggers in unseen climate dynamics with an F1 score of 99.98 \% while RMSE of loads of 13.57 kWh.
    摘要 城市建筑能源模拟(UBEM)是一种emerging方法,用于调查城市规划和能源系统,面对城市和社区层次上的增长能源需求。然而,当前UBEM方法多为物理基础的,在多种气候变化enario下时间consuming。这项工作提出了CityTFT,一个基于数据驱动的UBEM框架,以高度准确地模拟城市环境中的能源需求。通过TFT基础和增强的损失函数,CityTFT可以在未看到气候动力学的情况下预测冷却和加热触发器,F1分数为99.98%,载荷差值为13.57 kWh。

Towards General Purpose Vision Foundation Models for Medical Image Analysis: An Experimental Study of DINOv2 on Radiology Benchmarks

  • paper_url: http://arxiv.org/abs/2312.02366
  • repo_url: https://github.com/mohammedsb/dinov2formedical
  • paper_authors: Mohammed Baharoon, Waseem Qureshi, Jiahong Ouyang, Yanwu Xu, Kilian Phol, Abdulrhman Aljouie, Wei Peng
  • for: 本研究用DINOv2基础模型进行了广泛的评估,以evaluate its adaptability to radiological imaging and assess its generalization capabilities across diverse modalities.
  • methods: 本研究采用了多种 Settings,包括kNN、few-shot learning、linear-probing、end-to-end fine-tuning和parameter-efficient fine-tuning,以测试DINOv2的特征编码器的效果和普适性。
  • results: 研究发现DINOv2在分类任务中获得了竞争性的结果,而在分割任务中则表现出了非常出色的表现,与已有的医疗影像分析模型相比,DINOv2在这些任务中表现出了明显的优势。
    Abstract The integration of deep learning systems into the medical domain has been hindered by the resource-intensive process of data annotation and the inability of these systems to generalize to different data distributions. Foundation models, which are models pre-trained on large datasets, have emerged as a solution to reduce reliance on annotated data and enhance model generalizability and robustness. DINOv2, an open-source foundation model pre-trained with self-supervised learning on 142 million curated natural images, excels in extracting general-purpose visual representations, exhibiting promising capabilities across various vision tasks. Nevertheless, a critical question remains unanswered regarding DINOv2's adaptability to radiological imaging, and the clarity on whether its features are sufficiently general to benefit radiology image analysis is yet to be established. Therefore, this study comprehensively evaluates DINOv2 for radiology, conducting over 100 experiments across diverse modalities (X-ray, CT, and MRI). Tasks include disease classification and organ segmentation on both 2D and 3D images, evaluated under different settings like kNN, few-shot learning, linear-probing, end-to-end fine-tuning, and parameter-efficient fine-tuning, to measure the effectiveness and generalizability of the DINOv2 feature embeddings. Comparative analyses with established medical image analysis models, U-Net and TransUnet for segmentation, and CNN and ViT models pre-trained via supervised, weakly supervised, and self-supervised learning for classification, reveal DINOv2's superior performance in segmentation tasks and competitive results in disease classification. The findings contribute insights to potential avenues for optimizing pre-training strategies for medical imaging and enhancing the broader understanding of DINOv2's role in bridging the gap between natural and radiological image analysis.
    摘要 《深度学习系统在医学领域的整合受到了数据标注过程的资源占用和模型无法泛化到不同的数据分布问题的限制。基于大量数据预训练的基础模型,如DINOv2,已经出现为解决这些问题提供了一个解决方案。DINOv2采用自然图像142万个 curae的自我supervised学习预训练,在EXTRACTING general-purpose visual representations方面表现出色,在不同的视觉任务中显示出了扎实的能力。然而,关于DINOv2在医学影像领域的适应性和其特征是否足够普遍以便医学影像分析的问题,还没有得到充分的答案。因此,本研究对DINOv2进行了广泛的评估,在多种多Modalities(X射、CT和MRI)中进行了100多个实验,包括疾病类别和器官分割任务。在不同的设置下,如kNN、少量学习、线性探测、终端练化和参数效率练化等,评估了DINOv2特征嵌入的效果和普遍性。与已有的医学影像分析模型,如U-Net和TransUnet,进行了比较分析,发现DINOv2在分 segmentation任务中表现出色,与其他模型相比,在疾病类别任务中表现也是相对竞争力强。研究结果对推进医学影像分析领域的预训练策略产生了新的思路,并为DINOv2在自然图像和医学影像分析之间的桥接作用提供了更深刻的理解。

Class-Discriminative Attention Maps for Vision Transformers

  • paper_url: http://arxiv.org/abs/2312.02364
  • repo_url: https://github.com/lenbrocki/cdam
  • paper_authors: Lennart Brocki, Neo Christopher Chung
  • For: The paper is written to introduce a novel post-hoc explanation method called class-discriminative attention maps (CDAM) to explain the predictions of vision transformers (ViT) models.* Methods: The paper uses a self-supervised learning (SSL) training method to train the ViT model, and introduces CDAM to provide explanations of the model’s predictions. CDAM scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head.* Results: The paper shows that CDAM is highly class-discriminative and semantically relevant, and provides implicit regularization of relevance scores. The results also demonstrate that CDAM outperforms alternative explanation methods such as relevance propagation (RP) and token ablation maps (TAM) in terms of class-discriminativity and semantic relevance.Here is the information in Simplified Chinese text:* для: 本文是为了介绍一种新的后期解释方法 called class-discriminative attention maps (CDAM),用于解释vision transformers (ViT)模型的预测。* 方法: 本文使用了一种自动学习 (SSL) 训练方法来训练 ViT 模型,并引入 CDAM 来提供模型预测的解释。CDAM 将注意力分数缩放到类ifier head 的预测中的相关性。* 结果: 本文显示了 CDAM 高度类分强制和semantic relevance,并提供了隐式的正则化注意力分数。结果还表明 CDAM 在类分强制和semantic relevance方面高于替代的解释方法 relevance propagation (RP) 和 token ablation maps (TAM)。
    Abstract Interpretability methods are critical components for examining and exploring deep neural networks (DNN), as well as increasing our understanding of and trust in them. Vision transformers (ViT), which can be trained to state-of-the-art performance with a self-supervised learning (SSL) training method, provide built-in attention maps (AM). While AMs can provide high-quality semantic segmentation of input images, they do not account for any signal coming from a downstream classifier. We introduce class-discriminative attention maps (CDAM), a novel post-hoc explanation method that is highly sensitive to the target class. Our method essentially scales attention scores by how relevant the corresponding tokens are for the predictions of a classifier head. Alternative to classifier outputs, CDAM can also explain a user-defined concept by targeting similarity measures in the latent space of the ViT. This allows for explanations of arbitrary concepts, defined by the user through a few sample images. We investigate the operating characteristics of CDAM in comparison with relevance propagation (RP) and token ablation maps (TAM), an alternative to pixel occlusion methods. CDAM is highly class-discriminative and semantically relevant, while providing implicit regularization of relevance scores. PyTorch implementation: \url{https://github.com/lenbrocki/CDAM} Web live demo: \url{https://cdam.informatism.com/}
    摘要 易于理解方法是深度神经网络(DNN)的关键组件,帮助我们更深入理解和信任它们。视觉转换器(ViT)可以通过自我超vised学习(SSL)训练方法来达到状态码表现。而这些转换器提供内置的注意力地图(AM),它们可以为输入图像提供高质量的semantic segmentation。但是,AM不会考虑任何来自下游分类器的信号。我们介绍了一种新的后置解释方法——类别推理注意力地图(CDAM)。这种方法可以高度敏感地评估输入图像中的特定类别,并且可以通过评估 tokens 的相关性来权重调整注意力分数。此外,CDAM 还可以用于解释用户定义的概念,只需要通过一些示例图像来定义。我们对 CDAM 的运行特性进行了比较分析,并与 relevance propagation(RP)和 token ablation maps(TAM)进行了比较。 results 表明,CDAM 具有高度的类别推理和semantic relevance,同时提供了隐式的regularization of relevance scores。PyTorch 实现:\url{https://github.com/lenbrocki/CDAM}网络实时 demo:\url{https://cdam.informatism.com/}

Peer attention enhances student learning

  • paper_url: http://arxiv.org/abs/2312.02358
  • repo_url: https://github.com/songlinxu/peeredu
  • paper_authors: Songlin Xu, Dongyin Hu, Ru Wang, Xinyu Zhang
  • for: 这个论文是为了研究学生在线视频课程时的视觉注意力如何受社会影响的。
  • methods: 研究使用了展示同学视觉注意力区域的方法,以便学生在观看视频课程时更加集中注意力。
  • results: 研究发现,通过 displaying peer visual attention regions,可以提高学生的集中度和参与度,但学生仍然保持了适应性,能够根据自己的需要选择是否遵循同学的注意力。总之,帮助学生注意力帮助学生提高学习经验和成绩。这些发现可以帮助设计适应性的在线学习优化方案,以提高学生的注意力和成功。
    Abstract Human visual attention is susceptible to social influences. In education, peer effects impact student learning, but their precise role in modulating attention remains unclear. Our experiment (N=311) demonstrates that displaying peer visual attention regions when students watch online course videos enhances their focus and engagement. However, students retain adaptability in following peer attention cues. Overall, guided peer attention improves learning experiences and outcomes. These findings elucidate how peer visual attention shapes students' gaze patterns, deepening understanding of peer influence on learning. They also offer insights into designing adaptive online learning interventions leveraging peer attention modelling to optimize student attentiveness and success.
    摘要 人类视觉注意力受社会影响。在教育中,同学影响学生学习,但其具体的作用 Modulating attention 仍然不清楚。我们的实验(N=311)表明,当学生观看在线课程视频时,显示同学视觉注意区域可以提高 их集中力和参与度。然而,学生保留了适应同学注意引导的能力。总之,帮助同学注意力改善学习经验和成绩。这些发现深入了解同学视觉注意shape学生的观察方式,也为设计适应同学注意模拟来优化学生集中度和成功提供了意见。

When is Offline Policy Selection Sample Efficient for Reinforcement Learning?

  • paper_url: http://arxiv.org/abs/2312.02355
  • repo_url: None
  • paper_authors: Vincent Liu, Prabhat Nagarajan, Andrew Patterson, Martha White
  • for: 本文旨在提供关于无线网络束规约学习算法的基本限制,特别是在无线网络环境下选择策略的问题上。
  • methods: 本文使用了连接OPS和OPE(偏向策略评估)以及BE(贝尔曼误差)估计来解释OPS问题的基本限制。
  • results: 本文首先证明了OPS问题在最坏情况下与OPE问题等价,从而表明不可能有更高效的OPS方法。然后,提出了一种BE方法 дляOPS,称为可识别BE选择(IBES),该方法可以自动选择自己的超参数。最后,通过对OPE和IBES的实验研究和无线Atari benchmark数据集上OPS的困难性的证明。
    Abstract Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.
    摘要 <> transtable text into Simplified Chinese.Offline reinforcement learning algorithms often require careful hyperparameter tuning. Consequently, before deployment, we need to select amongst a set of candidate policies. As yet, however, there is little understanding about the fundamental limits of this offline policy selection (OPS) problem. In this work we aim to provide clarity on when sample efficient OPS is possible, primarily by connecting OPS to off-policy policy evaluation (OPE) and Bellman error (BE) estimation. We first show a hardness result, that in the worst case, OPS is just as hard as OPE, by proving a reduction of OPE to OPS. As a result, no OPS method can be more sample efficient than OPE in the worst case. We then propose a BE method for OPS, called Identifiable BE Selection (IBES), that has a straightforward method for selecting its own hyperparameters. We highlight that using IBES for OPS generally has more requirements than OPE methods, but if satisfied, can be more sample efficient. We conclude with an empirical study comparing OPE and IBES, and by showing the difficulty of OPS on an offline Atari benchmark dataset.翻译结果:Offline 学习算法常需要仔细调整超参数。因此,在部署之前,我们需要选择一组候选策略。然而,目前对这个Offline Policy Selection(OPS)问题的基本限制还不够了解。在这项工作中,我们想要提供关于sample efficient OPS是可能的时间的 clarity,主要通过OPS与Off-policy Policy Evaluation(OPE)和Bellman Error(BE)估计的连接来实现。我们首先显示了一个困难性结论,即在最坏情况下,OPS与OPE相当困难,通过证明OPE到OPS的减少来证明。这意味着,无论OPS方法可能都不能在最坏情况下比OPE更sample efficient。然后,我们提出了一种BE方法 дляOPS,称为Identifiable BE Selection(IBES),它具有选择自己超参数的直观方法。我们指出,使用IBES进行OPS通常需要更多的要求,但如果满足这些要求,则可以更sample efficient。我们结束于对OPE和IBES进行比较性研究,并通过示出OPS在Offline Atari benchmark数据集上的困难性来结束。

Working Backwards: Learning to Place by Picking

  • paper_url: http://arxiv.org/abs/2312.02352
  • repo_url: None
  • paper_authors: Oliver Limoyo, Abhisek Konar, Trevor Ablett, Jonathan Kelly, Francois R. Hogan, Gregory Dudek
  • for: autonomously collecting demonstrations for a family of placing tasks
  • methods: using a combination of tactile sensing and compliant control for grasps, and training a policy directly from visual observations through behavior cloning
  • results: outperforming policies trained with kinesthetic teaching in terms of performance and data efficiency, while requiring no human supervision.
    Abstract We present Learning to Place by Picking (LPP), a method capable of autonomously collecting demonstrations for a family of placing tasks in which objects must be manipulated to specific locations. With LPP, we approach the learning of robotic object placement policies by reversing the grasping process and exploiting the inherent symmetry of the pick and place problems. Specifically, we obtain placing demonstrations from a set of grasp sequences of objects that are initially located at their target placement locations. Our system is capable of collecting hundreds of demonstrations without human intervention by using a combination of tactile sensing and compliant control for grasps. We train a policy directly from visual observations through behaviour cloning, using the autonomously-collected demonstrations. By doing so, the policy can generalize to object placement scenarios outside of the training environment without privileged information (e.g., placing a plate picked up from a table and not at the original placement location). We validate our approach on home robotic scenarios that include dishwasher loading and table setting. Our approach yields robotic placing policies that outperform policies trained with kinesthetic teaching, both in terms of performance and data efficiency, while requiring no human supervision.
    摘要 我们介绍了一种名为学习放置(LPP)的方法,可以自主收集放置任务中对象的示范。LPP方法利用机器人对象放置策略的学习,通过对抓取过程的反转和抓取问题的自然对称性来进行学习。特别是,我们从一组对象初始位置为目标放置位置的抓取序列中获得放置示范。我们的系统可以自动收集百多个示范,不需人工干预,通过感知和弹性控制来实现抓取。我们通过视觉观察行为启发来直接训练策略,使其可以在培aument environments中外部泛化,无需特权信息(例如,将桌上搅狗抓起来并不是原始放置位置)。我们验证了我们的方法在家用机器人场景中,包括洗碗和设备。我们的方法可以在表现和数据使用效率方面超越由骨骼教学训练的策略,而且不需人工指导。

Expressive Sign Equivariant Networks for Spectral Geometric Learning

  • paper_url: http://arxiv.org/abs/2312.02339
  • repo_url: https://github.com/cptq/sign-equivariant-nets
  • paper_authors: Derek Lim, Joshua Robinson, Stefanie Jegelka, Haggai Maron
  • for: 本文旨在提出 sign 变异的机器学习模型,以便实现对卷积 graph 的遥测和预测。
  • methods: 本文使用 novel sign 等差网络架构,基于新的 sign 等差 polynomials 的分析性表述,以实现 theoretically 有利的 sign 等差模型。
  • results: 控制synthetic experiment 表明,我们的网络可以实现 theoretically 预测的 beneficial 效果。
    Abstract Recent work has shown the utility of developing machine learning models that respect the structure and symmetries of eigenvectors. These works promote sign invariance, since for any eigenvector v the negation -v is also an eigenvector. However, we show that sign invariance is theoretically limited for tasks such as building orthogonally equivariant models and learning node positional encodings for link prediction in graphs. In this work, we demonstrate the benefits of sign equivariance for these tasks. To obtain these benefits, we develop novel sign equivariant neural network architectures. Our models are based on a new analytic characterization of sign equivariant polynomials and thus inherit provable expressiveness properties. Controlled synthetic experiments show that our networks can achieve the theoretically predicted benefits of sign equivariant models. Code is available at https://github.com/cptq/Sign-Equivariant-Nets.
    摘要 近期研究表明,开发尊重特征向量结构和对称的机器学习模型具有Utility。这些研究推动签名对称性,因为任何特征向量v,其负数(-v)也是特征向量。然而,我们显示出签名对称性在建立正交对称模型和学习图中节点位征编码以供链接预测中具有理论上的限制。在这种情况下,我们证明签名对称性带来了优点。为了获得这些优点,我们开发了新的签名对称神经网络架构。我们的模型基于新的签名对称多项式的分析性特征,因此具有可证明表达能力性质。控制的synthetic实验显示,我们的网络可以实现理论上预测的签名对称模型的优点。代码可以在https://github.com/cptq/Sign-Equivariant-Nets中获取。

A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics

  • paper_url: http://arxiv.org/abs/2312.02338
  • repo_url: https://github.com/zhuxiangru/winoground-t2i
  • paper_authors: Xiangru Zhu, Penglei Sun, Chengyu Wang, Jingping Liu, Zhixu Li, Yanghua Xiao, Jun Huang
  • for: 评估文本到图像(T2I)合成模型的可组合性(compositionality)。
  • methods: 提出了一个名为 Winoground-T2I 的数据集,包含 11K 个复杂、高质量的对比句子对,涵盖 20 个类别。这些对比句子对具有微妙的差异,可以进行细致的评估 T2I 合成模型。此外,提出了一种策略来评估不同指标之间的一致性,通过使用对比句子对来评估指标的可靠性。
  • results: 通过使用 Winoground-T2I 数据集和双目标函数,评估了 T2I 模型和评估指标的性能。发现当前的 T2I 模型在处理复杂的 compositional 类别时存在一些弱点,并提供了一些有价值的探索和改进的方向。
    Abstract Text-to-image (T2I) synthesis has recently achieved significant advancements. However, challenges remain in the model's compositionality, which is the ability to create new combinations from known components. We introduce Winoground-T2I, a benchmark designed to evaluate the compositionality of T2I models. This benchmark includes 11K complex, high-quality contrastive sentence pairs spanning 20 categories. These contrastive sentence pairs with subtle differences enable fine-grained evaluations of T2I synthesis models. Additionally, to address the inconsistency across different metrics, we propose a strategy that evaluates the reliability of various metrics by using comparative sentence pairs. We use Winoground-T2I with a dual objective: to evaluate the performance of T2I models and the metrics used for their evaluation. Finally, we provide insights into the strengths and weaknesses of these metrics and the capabilities of current T2I models in tackling challenges across a range of complex compositional categories. Our benchmark is publicly available at https://github.com/zhuxiangru/Winoground-T2I .
    摘要

An Evaluation Framework for Mapping News Headlines to Event Classes in a Knowledge Graph

  • paper_url: http://arxiv.org/abs/2312.02334
  • repo_url: https://github.com/mbouadeus/news-headline-event-linking
  • paper_authors: Steve Fonin Mbouadeu, Martin Lorenzo, Ken Barker, Oktie Hassanzadeh
  • for: 这篇论文的目的是创建一个基准数据集,用于将新闻标题映射到事件类别中。
  • methods: 这篇论文使用的方法包括修改了经典实体链接方法,以及将问题视为零例文本分类问题。
  • results: 论文的评估结果显示,使用修改后的实体链接方法可以达到较高的溢亮率,而使用零例文本分类方法则需要更多的训练数据。
    Abstract Mapping ongoing news headlines to event-related classes in a rich knowledge base can be an important component in a knowledge-based event analysis and forecasting solution. In this paper, we present a methodology for creating a benchmark dataset of news headlines mapped to event classes in Wikidata, and resources for the evaluation of methods that perform the mapping. We use the dataset to study two classes of unsupervised methods for this task: 1) adaptations of classic entity linking methods, and 2) methods that treat the problem as a zero-shot text classification problem. For the first approach, we evaluate off-the-shelf entity linking systems. For the second approach, we explore a) pre-trained natural language inference (NLI) models, and b) pre-trained large generative language models. We present the results of our evaluation, lessons learned, and directions for future work. The dataset and scripts for evaluation are made publicly available.
    摘要 <>将新闻头条映射到事件相关类别在丰富的知识库中可以是一项重要的组成部分。在这篇论文中,我们提出了一种方法ологи? для创建一个基准数据集,其中新闻头条映射到事件类别在Wikidata中。此外,我们还提供了评估这些方法的资源。我们使用这个数据集来研究两类无监督方法:1)基于类传统实体链接方法的修改,和2)将问题视为零例文本分类问题。对于第一种方法,我们评估了商业化的实体链接系统。对于第二种方法,我们探索了a)预先训练的自然语言推理(NLI)模型,和b)预先训练的大型生成语言模型。我们发布了评估结果,学习的经验,以及未来工作的方向。数据集和评估脚本都公开提供。

InstructBooth: Instruction-following Personalized Text-to-Image Generation

  • paper_url: http://arxiv.org/abs/2312.03011
  • repo_url: None
  • paper_authors: Daewon Chae, Nokyung Park, Jinkyu Kim, Kimin Lee
  • for: 这 paper 的目的是提高个性化文本到图像模型的图像-文本对应能力,使其能够更好地满足用户的需求。
  • methods: 这 paper 使用了一种新的方法called InstructBooth,它首先使用一小quantity of subject-specific images来个性化文本到图像模型,然后通过反射学习来提高图像-文本对应能力。
  • results: 对比baseline方法,InstructBooth 显示出了更高的图像-文本对应能力,同时保持了个性化能力。在人类评估中,InstructBooth 也在所有因素上超过 DreamBooth。
    Abstract Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often encounter challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to baselines while maintaining personalization ability. In human evaluations, InstructBooth outperforms DreamBooth when considering all comprehensive factors.
    摘要 personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often encounter challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to baselines while maintaining personalization ability. In human evaluations, InstructBooth outperforms DreamBooth when considering all comprehensive factors.Here's the translation in Traditional Chinese:personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often encounter challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to baselines while maintaining personalization ability. In human evaluations, InstructBooth outperforms DreamBooth when considering all comprehensive factors.

GNN2R: Weakly-Supervised Rationale-Providing Question Answering over Knowledge Graphs

  • paper_url: http://arxiv.org/abs/2312.02317
  • repo_url: https://github.com/ruijie-wang-uzh/gnn2r
  • paper_authors: Ruijie Wang, Luca Rossetto, Michael Cochez, Abraham Bernstein
  • For: This paper aims to provide explanation-based multi-hop question answering over knowledge graphs, which is useful in real-world scenarios where users need to understand the reasoning process behind the answers.* Methods: The proposed Graph Neural Network-based Two-Step Reasoning model (GNN2R) uses weak supervision from question-final answer pairs to generate both final answers and reasoning subgraphs efficiently.* Results: The results show that GNN2R outperforms existing state-of-the-art methods in terms of effectiveness, efficiency, and quality of generated explanations.Here is the information in Simplified Chinese text:* For: 这篇论文的目的是提供基于知识图的多步问答,并提供解释,这有用于实际场景中,用户需要理解问答的逻辑过程。* Methods: 提议的图 Néural Network-based Two-Step Reasoning model (GNN2R) 使用问题-答案对的弱监督来生成效率高的解释。* Results: 结果表明,GNN2R 在有效性、效率和解释质量方面超越了现有的状态作法。
    Abstract Most current methods for multi-hop question answering (QA) over knowledge graphs (KGs) only provide final conclusive answers without explanations, such as a set of KG entities that is difficult for normal users to review and comprehend. This issue severely limits the application of KG-based QA in real-world scenarios. However, it is non-trivial to solve due to two challenges: First, annotations of reasoning chains of multi-hop questions, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. In this paper, we propose a novel Graph Neural Network-based Two-Step Reasoning model (GNN2R) to solve this issue. GNN2R can provide both final answers and reasoning subgraphs as a rationale behind final answers efficiently with only weak supervision that is available through question-final answer pairs. We extensively evaluated GNN2R with detailed analyses in experiments. The results demonstrate that, in terms of effectiveness, efficiency, and quality of generated explanations, GNN2R outperforms existing state-of-the-art methods that are applicable to this task. Our code and pre-trained models are available at https://github.com/ruijie-wang-uzh/GNN2R.
    摘要 现有的多步问答(QA)方法 sobre 知识 graphs(KGs)只提供最终的答案而无法提供解释,如一组Difficult to review and comprehend的 KG entities. 这个问题限制了 KG-based QA 在实际应用中的使用。 However, it is non-trivial to solve due to two challenges: First, multi-hop question reasoning chain annotations, which could serve as supervision for explanation generation, are usually lacking. Second, it is difficult to maintain high efficiency when explicit KG triples need to be retrieved to generate explanations. 在本文中,我们提出了一种基于图神经网络的Two-Step Reasoning模型(GNN2R)来解决这个问题。 GNN2R可以提供最终的答案和reasoning subgraphs作为答案的证明,高效地使用Only weak supervision available through question-final answer pairs. 我们进行了详细的实验分析,结果表明,在效果、效率和解释质量方面,GNN2R在这种任务上比现有的State-of-the-art方法更高效。 我们的代码和预训练模型可以在 上获取。

Fine-tuning pre-trained extractive QA models for clinical document parsing

  • paper_url: http://arxiv.org/abs/2312.02314
  • repo_url: None
  • paper_authors: Ashwyn Sharma, David I. Feldman, Aneesh Jain
    for:这篇论文的目的是为了开发一个可以自动分析电子健康纪录(EHRs)中的声明数据,并将其转换为适合实时和回顾分析的数据,以便为患有心脏病(HF)的病人提供更好的跟踪和管理。methods:这篇论文使用了一个 pré-trained 的抽出式Question Answering(QA)transformer 模型,并将其在特定的标签数据上进行了微调。此外,还使用了一些 Running experiments 来评估模型的性能。results:根据结果,这个系统可以将声明数据自动化,并实现了适合实时和回顾分析的数据。此外,这个系统还可以节省了过 1500 小时的时间,并在 12 个月内实现了大规模的自动化。
    Abstract Electronic health records (EHRs) contain a vast amount of high-dimensional multi-modal data that can accurately represent a patient's medical history. Unfortunately, most of this data is either unstructured or semi-structured, rendering it unsuitable for real-time and retrospective analyses. A remote patient monitoring (RPM) program for Heart Failure (HF) patients needs to have access to clinical markers like EF (Ejection Fraction) or LVEF (Left Ventricular Ejection Fraction) in order to ascertain eligibility and appropriateness for the program. This paper explains a system that can parse echocardiogram reports and verify EF values. This system helps identify eligible HF patients who can be enrolled in such a program. At the heart of this system is a pre-trained extractive QA transformer model that is fine-tuned on custom-labeled data. The methods used to prepare such a model for deployment are illustrated by running experiments on a public clinical dataset like MIMIC-IV-Note. The pipeline can be used to generalize solutions to similar problems in a low-resource setting. We found that the system saved over 1500 hours for our clinicians over 12 months by automating the task at scale.
    摘要 电子健康记录(EHRs)包含大量高维多Modal数据,可以准确反映患者的医疗历史。然而,大多数这些数据都是无结构或半结构化,使其不适合实时和回顾分析。一个远程患者监测(RPM)Program for Heart Failure(HF)患者需要访问临床标志如EF(脱出率)或LVEF(左心室脱出率),以确定参与计划的适应性。本文介绍一种系统,可以解析echo声报告并验证EF值。这种系统可以识别适合参与RPM计划的HF患者。系统的核心是一个已经预训练的提取式QA变换模型,通过自定义标注数据进行微调。我们使用MIMIC-IV-Note公共医疗数据集进行实验,演示了如何准备这种模型 для部署。这种管道可以普遍化到类似问题的低资源环境中。我们发现,这种系统在12个月内将 clinicians 的工作时间减少了超过1500小时。

Visual Encoders for Data-Efficient Imitation Learning in Modern Video Games

  • paper_url: http://arxiv.org/abs/2312.02312
  • repo_url: None
  • paper_authors: Lukas Schäfer, Logan Jones, Anssi Kanervisto, Yuhan Cao, Tabish Rashid, Raluca Georgescu, Dave Bignell, Siddhartha Sen, Andrea Treviño Gavito, Sam Devlin
  • for: 研究游戏AI代理人的决策过程
  • methods: 使用公开available的视觉编码器进行模仿学习,与常见的终端到终端培育方法进行比较
  • results: 对 Minecraft、Minecraft Dungeons 和 Counter-Strike: Global Offensive 等现代游戏进行比较,研究使用公开available的视觉编码器是否可以具备适用于sequential decision making的表示学习。
    Abstract Video games have served as useful benchmarks for the decision making community, but going beyond Atari games towards training agents in modern games has been prohibitively expensive for the vast majority of the research community. Recent progress in the research, development and open release of large vision models has the potential to amortize some of these costs across the community. However, it is currently unclear which of these models have learnt representations that retain information critical for sequential decision making. Towards enabling wider participation in the research of gameplaying agents in modern games, we present a systematic study of imitation learning with publicly available visual encoders compared to the typical, task-specific, end-to-end training approach in Minecraft, Minecraft Dungeons and Counter-Strike: Global Offensive.
    摘要 видео游戏已经作为决策ommunity的标准 benchmark,但是向现代游戏中训练代理人的成本是大多数研究community的禁制的。现在,随着大视力模型的研究、开发和公共发布,这些成本可能会在社区中卷积。然而,目前还不清楚哪些模型学习了保留Sequential Decision Making中重要信息的表示。为推广现代游戏中Gameplaying Agent的研究,我们展示了使用公共可用的视觉编码器进行仿制学习的系统性研究,与常见的终端到终点的端到端培训方法在 Minecraft、 Minecraft Dungeons 和Counter-Strike: Global Offensive 中进行比较。

VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding

  • paper_url: http://arxiv.org/abs/2312.02310
  • repo_url: None
  • paper_authors: Yizhou Wang, Ruiyi Zhang, Haoliang Wang, Uttaran Bhattacharya, Yun Fu, Gang Wu
  • for: 本研究旨在提高语言模型基于视频理解的能力,并提出一种新的框架VaQuitA,以增强视频和文本信息之间的相互作用。
  • methods: 我们在数据层次使用CLIP-score排名来随机选择帧,并在特征层次使用可训练的视频感知器和视觉问题转换器(简称VQ-Former),以增强输入问题和视频特征之间的关系。此外,我们还发现在LLM输入中添加一个简单的提示”请批评”可以显著提高视频理解能力。
  • results: 我们的实验结果表明,VaQuitA在零代码视频问答任务中成功设置新的 bencmark,并且能够生成高质量、多turn的视频对话。
    Abstract Recent advancements in language-model-based video understanding have been progressing at a remarkable pace, spurred by the introduction of Large Language Models (LLMs). However, the focus of prior research has been predominantly on devising a projection layer that maps video features to tokens, an approach that is both rudimentary and inefficient. In our study, we introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features. We also discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.
    摘要 At the data level, instead of uniformly sampling frames, we implement a sampling method guided by CLIP-score rankings, which enables a more aligned selection of frames with the given question. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer (abbreviated as VQ-Former), which bolsters the interplay between the input question and the video features.Furthermore, we discover that incorporating a simple prompt, "Please be critical", into the LLM input can substantially enhance its video comprehension capabilities. Our experimental results indicate that VaQuitA consistently sets a new benchmark for zero-shot video question-answering tasks and is adept at producing high-quality, multi-turn video dialogues with users.

Training Reinforcement Learning Agents and Humans With Difficulty-Conditioned Generators

  • paper_url: http://arxiv.org/abs/2312.02309
  • repo_url: None
  • paper_authors: Sidney Tio, Jimmy Ho, Pradeep Varakantham
  • for: 这 paper 是用于训练人工智能代理人和人类学习者在参数化环境中的方法。
  • methods: 这 paper 使用 Parameterized Environment Response Model (PERM),基于 Item Response Theory (IRT) 来对环境 difficulty 和个体能力进行直接模型化,创造一个 Zone of Proximal Development-based 课程。
  • results: 这 paper 在一个实验研究中表明,使用 PERM 可以在不同的学生中实现高效的训练,并且不需要实时更新 reinforcement learning 算法。
    Abstract We adapt Parameterized Environment Response Model (PERM), a method for training both Reinforcement Learning (RL) Agents and human learners in parameterized environments by directly modeling difficulty and ability. Inspired by Item Response Theory (IRT), PERM aligns environment difficulty with individual ability, creating a Zone of Proximal Development-based curriculum. Remarkably, PERM operates without real-time RL updates and allows for offline training, ensuring its adaptability across diverse students. We present a two-stage training process that capitalizes on PERM's adaptability, and demonstrate its effectiveness in training RL agents and humans in an empirical study.
    摘要 我们适应Parameterized Environment Response Model(PERM),一种专门训练人工智能学习者和人类学习者在参数化环境中的方法。这种方法受到Item Response Theory(IRT)的启发,将环境困难度与个人能力相对应,创造一个 Zone of Proximal Development-based 课程。特别是,PERM 不需要实时循环学习更新,可以在网上训练,因此适用于多种学生。我们提出了一个两阶段训练过程,利用 PERM 的适应性,并在实验研究中证明其效果。

AdsorbRL: Deep Multi-Objective Reinforcement Learning for Inverse Catalysts Design

  • paper_url: http://arxiv.org/abs/2312.02308
  • repo_url: https://github.com/rlacombe/adsorbrl
  • paper_authors: Romain Lacombe, Lucas Hendren, Khalid El-Awady
  • for: 本研究旨在开发一种基于深度学习的潜在催化剂设计方法,用于低排放技术的发展。
  • methods: 本研究使用了深度学习游戏机器人(Deep Reinforcement Learning,DRL)方法,并在Offline学习中使用Open Catalyst 2020和Materials Project数据集训练。
  • results: 研究发现,使用Random Edge Traversal和Objective Sub-Sampling等方法可以强化目标绑定能量,并在多目标情况下同时提高多个目标附着能量。这些结果表明深度学习可以有效应用于潜在催化剂设计问题。
    Abstract A central challenge of the clean energy transition is the development of catalysts for low-emissions technologies. Recent advances in Machine Learning for quantum chemistry drastically accelerate the computation of catalytic activity descriptors such as adsorption energies. Here we introduce AdsorbRL, a Deep Reinforcement Learning agent aiming to identify potential catalysts given a multi-objective binding energy target, trained using offline learning on the Open Catalyst 2020 and Materials Project data sets. We experiment with Deep Q-Network agents to traverse the space of all ~160,000 possible unary, binary and ternary compounds of 55 chemical elements, with very sparse rewards based on adsorption energy known for only between 2,000 and 3,000 catalysts per adsorbate. To constrain the actions space, we introduce Random Edge Traversal and train a single-objective DQN agent on the known states subgraph, which we find strengthens target binding energy by an average of 4.1 eV. We extend this approach to multi-objective, goal-conditioned learning, and train a DQN agent to identify materials with the highest (respectively lowest) adsorption energies for multiple simultaneous target adsorbates. We experiment with Objective Sub-Sampling, a novel training scheme aimed at encouraging exploration in the multi-objective setup, and demonstrate simultaneous adsorption energy improvement across all target adsorbates, by an average of 0.8 eV. Overall, our results suggest strong potential for Deep Reinforcement Learning applied to the inverse catalysts design problem.
    摘要 中心挑战清能过渡的问题是开发低排放技术的氧化剂。最近的机器学习在量子化学中的进步可以极大地加速计算催化活性指标,如吸附能量。我们介绍了AdsorbRL,一个基于深度强化学习的智能代理,可以根据多目标绑定能量目标,从Open Catalyst 2020和Materials Project数据集中训练。我们使用深度Q学习网络(DQN)来探索所有可能的单元、二元和三元化合物的空间,并且使用非常罕见的奖励来驱动吸附能量的优化。为了约束行动空间,我们引入随机边游走和单个目标DQN agent的训练,并发现这会提高目标绑定能量的均值4.1 eV。我们还扩展了这种方法,使其适用于多目标、目标吸附能量均衡的学习,并训练一个DQN agent来标识多个同时目标吸附物的最佳材料。我们实验了一种新的训练方案,即目标子批量采样,以鼓励探索在多目标设置下。最终,我们的结果表明深度强化学习在反催化剂设计问题中具有强大的潜力。

LineConGraphs: Line Conversation Graphs for Effective Emotion Recognition using Graph Neural Networks

  • paper_url: http://arxiv.org/abs/2312.03756
  • repo_url: None
  • paper_authors: Gokul S Krishnan, Sarala Padi, Craig S. Greenberg, Balaraman Ravindran, Dinesh Manoch, Ram D. Sriram
  • for: 这篇论文旨在提出一种新的对话情感识别(ERC)方法,以便在医疗、教育、聊天机器人和社交媒体平台等实际应用中进行情感识别。
  • methods: 该论文提出了一种基于对话情感图(LineConGraphs)的新方法,该方法不需要说话人信息,同时只使用短期对话上下文(一个前一个后一个)进行情感识别。
  • results: 研究人员通过对两个 benchmark 数据集(IEMOCAP 和 MELD)进行测试,发现他们提出的 LineConGAT 模型在情感识别方面的性能比现有方法高,F1 分数为 64.58% 和 76.50%。此外,他们还发现将情感转移信息 embed 到对话情感图中可以进一步提高 ERC 性能。
    Abstract Emotion Recognition in Conversations (ERC) is a critical aspect of affective computing, and it has many practical applications in healthcare, education, chatbots, and social media platforms. Earlier approaches for ERC analysis involved modeling both speaker and long-term contextual information using graph neural network architectures. However, it is ideal to deploy speaker-independent models for real-world applications. Additionally, long context windows can potentially create confusion in recognizing the emotion of an utterance in a conversation. To overcome these limitations, we propose novel line conversation graph convolutional network (LineConGCN) and graph attention (LineConGAT) models for ERC analysis. These models are speaker-independent and built using a graph construction strategy for conversations -- line conversation graphs (LineConGraphs). The conversational context in LineConGraphs is short-term -- limited to one previous and future utterance, and speaker information is not part of the graph. We evaluate the performance of our proposed models on two benchmark datasets, IEMOCAP and MELD, and show that our LineConGAT model outperforms the state-of-the-art methods with an F1-score of 64.58% and 76.50%. Moreover, we demonstrate that embedding sentiment shift information into line conversation graphs further enhances the ERC performance in the case of GCN models.
    摘要 《情感识别在对话中》(ERC)是情感计算的关键方面,它在医疗、教育、虚拟助手和社交媒体平台上有广泛的实际应用。早期的ERC分析方法使用图神经网络架构模型 speaker和长期上下文信息。然而,在实际应用中,投入 speaker-独立模型是理想的。此外,长期上下文窗口可能会在识别对话中的情感 confusion。为了解决这些限制,我们提出了novel line conversation graph convolutional network(LineConGCN)和 graph attention(LineConGAT)模型 для ERC分析。这些模型 speaker-独立,使用对话Graph constructions strategy,并且在对话中的上下文窗口是短期的,只包括一个前一个和后一个的话语。我们评估了我们提出的模型在两个标准数据集上的表现,IEMOCAP和MELD,并显示了我们的 LineConGAT 模型在现有方法中的 F1 分数为 64.58% 和 76.50%。此外,我们还证明了将情感变化信息embedding into line conversation graphs可以进一步提高 ERC 性能在GCN模型中。

LLMs Accelerate Annotation for Medical Information Extraction

  • paper_url: http://arxiv.org/abs/2312.02296
  • repo_url: None
  • paper_authors: Akshay Goel, Almog Gueta, Omry Gilon, Chang Liu, Sofia Erell, Lan Huong Nguyen, Xiaohong Hao, Bolous Jaber, Shashir Reddy, Rupesh Kartha, Jean Steiner, Itay Laish, Amir Feder
  • for: 提高医疗数据中的不结构化文本检索和解释的效率
  • methods: 结合人工智能和人类专家的协作,使用大型自然语言处理(NLP)模型生成医疗文本标注的真实标注数据
  • results: 对医疗信息EXTRACTION任务进行了严格评估,显示该方法不仅可以大幅减少人工干预,还可以保持高度准确
    Abstract The unstructured nature of clinical notes within electronic health records often conceals vital patient-related information, making it challenging to access or interpret. To uncover this hidden information, specialized Natural Language Processing (NLP) models are required. However, training these models necessitates large amounts of labeled data, a process that is both time-consuming and costly when relying solely on human experts for annotation. In this paper, we propose an approach that combines Large Language Models (LLMs) with human expertise to create an efficient method for generating ground truth labels for medical text annotation. By utilizing LLMs in conjunction with human annotators, we significantly reduce the human annotation burden, enabling the rapid creation of labeled datasets. We rigorously evaluate our method on a medical information extraction task, demonstrating that our approach not only substantially cuts down on human intervention but also maintains high accuracy. The results highlight the potential of using LLMs to improve the utilization of unstructured clinical data, allowing for the swift deployment of tailored NLP solutions in healthcare.
    摘要 电子健康记录中的临床笔记具有不结构化的特点,常常隐藏着重要的病人信息,从而使得访问或解释困难。为了抽取这些隐藏的信息,需要特殊的自然语言处理(NLP)模型。然而,训练这些模型需要大量的标注数据,这是一项耗时和成本巨大的过程,当 solely 依靠人类专家进行标注时。在这篇论文中,我们提出一种方法,该方法结合大型自然语言模型(LLM)和人类专家知识来生成医疗文本标注的基准数据。通过在人类标注人员和LLM之间进行协同工作,我们可以减少人类标注劳动,并使得医疗NLP解决方案的速速投入。我们严格地评估了我们的方法在医疗信息抽取任务中的性能,结果表明,我们的方法不仅可以减少人类干预,同时也可以保持高度准确。这些结果表明,使用LLM可以改善医疗数据的利用效率,并允许快速部署适应性强的NLP解决方案。

I-PHYRE: Interactive Physical Reasoning

  • paper_url: http://arxiv.org/abs/2312.03009
  • repo_url: None
  • paper_authors: Shiqian Li, Kewen Wu, Chi Zhang, Yixin Zhu
  • for: 评估智能代理人的物理逻辑能力,尤其是在动态事件中。
  • methods: 引入I-PHYRE框架,让代理人同时展示适当的物理逻辑、多步观念与现场干预能力。
  • results: 发现现有的学习算法与人类表现之间存在明显的差距,强调对于增强代理人的物理逻辑能力进行更多研究。
    Abstract Current evaluation protocols predominantly assess physical reasoning in stationary scenes, creating a gap in evaluating agents' abilities to interact with dynamic events. While contemporary methods allow agents to modify initial scene configurations and observe consequences, they lack the capability to interact with events in real time. To address this, we introduce I-PHYRE, a framework that challenges agents to simultaneously exhibit intuitive physical reasoning, multi-step planning, and in-situ intervention. Here, intuitive physical reasoning refers to a quick, approximate understanding of physics to address complex problems; multi-step denotes the need for extensive sequence planning in I-PHYRE, considering each intervention can significantly alter subsequent choices; and in-situ implies the necessity for timely object manipulation within a scene, where minor timing deviations can result in task failure. We formulate four game splits to scrutinize agents' learning and generalization of essential principles of interactive physical reasoning, fostering learning through interaction with representative scenarios. Our exploration involves three planning strategies and examines several supervised and reinforcement agents' zero-shot generalization proficiency on I-PHYRE. The outcomes highlight a notable gap between existing learning algorithms and human performance, emphasizing the imperative for more research in enhancing agents with interactive physical reasoning capabilities. The environment and baselines will be made publicly available.
    摘要 当前的评估协议主要评估物理逻辑在静止场景中,创造了评估代理人能力与动态事件交互的空白。当代方法允许代理人修改初始场景配置和观察后果,但缺乏实时与事件交互的能力。为此,我们介绍I-PHYRE框架,挑战代理人同时展现出直观物理逻辑、多步规划和场景内干预能力。在这里,直观物理逻辑指的是快速、粗略地理解物理问题解决复杂问题的能力;多步表示需要考虑每一次干预后果,因为每一次干预都可能对后续选择产生重要影响;场景内干预则是指在场景中快速 manipulate物品,小差别的时间差可能导致任务失败。我们将场景分为四个游戏分割,探讨代理人学习和总结基本的物理逻辑原则的能力,通过与代表性场景的互动学习。我们的探讨包括三种规划策略,并对多种监督和奖励代理人进行零容量普适性评估。结果显示现有的学习算法与人类性能之间存在显著差距,强调需要更多的研究,以增强代理人的互动物理逻辑能力。环境和基准将公开发布。

PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness

  • paper_url: http://arxiv.org/abs/2312.02158
  • repo_url: https://github.com/astra-vision/PaSCo
  • paper_authors: Anh-Quan Cao, Angela Dai, Raoul de Charette
  • for: 提出了Panoptic Scene Completion(PSC)任务,扩展最近受欢迎的Semantic Scene Completion(SSC)任务,以获得更深刻的3D场景理解。
  • methods: 使用混合 mask-based 技术,在非空多scale completions 中进行hybrid mask-based 技术。
  • results: 在三个大规模自动驾驶数据集上超过所有基线,并且在voxel-wise和instance-wise uncertainty estimation中提供更好的性能。
    Abstract We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the non-empty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for robotics applications, we instead propose an efficient ensembling to estimate both voxel-wise and instance-wise uncertainties along PSC. This is achieved by building on a multi-input multi-output (MIMO) strategy, while improving performance and yielding better uncertainty for little additional compute. Additionally, we introduce a technique to aggregate permutation-invariant mask predictions. Our experiments demonstrate that our method surpasses all baselines in both Panoptic Scene Completion and uncertainty estimation on three large-scale autonomous driving datasets. Our code and data are available at https://astra-vision.github.io/PaSCo .
    摘要 我们提出了具有更高级别理解的三维场景完成任务(PSC),该任务扩展了最近受欢迎的 semantic scene completion(SSC)任务,并添加了实例级信息以生成更加 ricther的场景理解。我们的 PSC 提议使用了一种混合mask-based技术,在非空多Scale的完成中进行逐 voxel 的推断。而 SSC 文献中忽略了uncertainty,这对 роботиCS应用非常重要,我们则提议一种高效的ensemble estimate both voxel-wise和实例-wise的uncertainty,通过一种多输入多出力(MIMO)策略,从而提高性能并且更好地估计uncertainty,只需要少量的额外计算。此外,我们还介绍了一种对 permutation-invariant mask prediction进行集成的技术。我们的实验表明,我们的方法在三个大规模自主驾驶数据集上都超越了所有基准值, both PSC 和 uncertainty estimation。我们的代码和数据可以在 https://astra-vision.github.io/PaSCo 上获取。

Latent Feature-Guided Diffusion Models for Shadow Removal

  • paper_url: http://arxiv.org/abs/2312.02156
  • repo_url: None
  • paper_authors: Kangfu Mei, Luis Figueroa, Zhe Lin, Zhihong Ding, Scott Cohen, Vishal M. Patel
  • for: 提高阴影下的图像纹理还原的精度
  • methods: 使用扩散模型,通过 conditioning on 学习的秘密特征空间,避免使用受限的传统方法
  • results: 与前一代方法相比,提高了13%的RMSE在AISTD数据集上,并在DESOBA数据集上提高了82%的RMSEHere’s a breakdown of each point:
  • for: The paper is written to improve the accuracy of texture restoration in shadowy images.
  • methods: The method used in the paper is based on diffusion models, which are conditioned on a learned latent feature space that inherits the characteristics of shadow-free images. This approach avoids the limitations of conventional methods that only condition on degraded images. Additionally, the paper proposes fusing noise features with the diffusion network to alleviate potential local optima during training.
  • results: The paper demonstrates the effectiveness of the proposed approach, which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset and by 82% in terms of RMSE on the DESOBA dataset.
    Abstract Recovering textures under shadows has remained a challenging problem due to the difficulty of inferring shadow-free scenes from shadow images. In this paper, we propose the use of diffusion models as they offer a promising approach to gradually refine the details of shadow regions during the diffusion process. Our method improves this process by conditioning on a learned latent feature space that inherits the characteristics of shadow-free images, thus avoiding the limitation of conventional methods that condition on degraded images only. Additionally, we propose to alleviate potential local optima during training by fusing noise features with the diffusion network. We demonstrate the effectiveness of our approach which outperforms the previous best method by 13% in terms of RMSE on the AISTD dataset. Further, we explore instance-level shadow removal, where our model outperforms the previous best method by 82% in terms of RMSE on the DESOBA dataset.
    摘要 “恢复阴影下的 texture 是一个长期困难的问题,因为从阴影图像中推断出 shadow-free 场景很困难。在这篇论文中,我们提议使用传播模型,因为它们提供了一个有前途的方法来慢慢地细化阴影区域中的细节。我们的方法可以透过 conditioning 在学习的潜在特征空间中,从而避免传统方法对受损图像的限制。此外,我们还提议使用杂音特征与扩散网络的融合,以避免训练过程中的地方最佳化问题。我们的方法在 AISTD 资料集上比前一个最好的方法高13%的 RMSE 表现出色,并且在 DESOBA 资料集上比前一个最好的方法高82%的 RMSE 表现出色。”

Guarding Barlow Twins Against Overfitting with Mixed Samples

  • paper_url: http://arxiv.org/abs/2312.02151
  • repo_url: https://github.com/wgcban/mix-bt
  • paper_authors: Wele Gedara Chaminda Bandara, Celso M. De Melo, Vishal M. Patel
  • For: The paper focuses on improving the performance of self-supervised learning (SSL) using the Barlow Twins algorithm, and explores a new method called Mixed Barlow Twins to address the challenge of feature overfitting.* Methods: The paper uses the Barlow Twins algorithm to pre-train a network on a large dataset, and introduces a new regularization term to the original objective function to improve sample interaction during training.* Results: The paper reports improved performance on several benchmark datasets (CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet) when using the Mixed Barlow Twins method compared to the original Barlow Twins algorithm, indicating that the proposed method can mitigate feature overfitting and enhance downstream performance.Here is the simplified Chinese translation of the three key points:* For: 这篇论文关注提高自然语言处理(SSL)的性能,使用Barlow Twins算法进行预训练,并提出一种新的方法called Mixed Barlow Twins来解决特征泛化问题。* Methods: 论文使用Barlow Twins算法在大量数据上进行预训练,并提出一种新的正则化项来提高预训练过程中的样本交互。* Results: 论文在多个 benchmark 数据集(CIFAR-10、CIFAR-100、TinyImageNet、STL-10、ImageNet)上发现,使用Mixed Barlow Twins方法比原始 Barlow Twins 算法更高的表现,表明该方法可以避免特征泛化并提高下游性能。
    Abstract Self-supervised Learning (SSL) aims to learn transferable feature representations for downstream applications without relying on labeled data. The Barlow Twins algorithm, renowned for its widespread adoption and straightforward implementation compared to its counterparts like contrastive learning methods, minimizes feature redundancy while maximizing invariance to common corruptions. Optimizing for the above objective forces the network to learn useful representations, while avoiding noisy or constant features, resulting in improved downstream task performance with limited adaptation. Despite Barlow Twins' proven effectiveness in pre-training, the underlying SSL objective can inadvertently cause feature overfitting due to the lack of strong interaction between the samples unlike the contrastive learning approaches. From our experiments, we observe that optimizing for the Barlow Twins objective doesn't necessarily guarantee sustained improvements in representation quality beyond a certain pre-training phase, and can potentially degrade downstream performance on some datasets. To address this challenge, we introduce Mixed Barlow Twins, which aims to improve sample interaction during Barlow Twins training via linearly interpolated samples. This results in an additional regularization term to the original Barlow Twins objective, assuming linear interpolation in the input space translates to linearly interpolated features in the feature space. Pre-training with this regularization effectively mitigates feature overfitting and further enhances the downstream performance on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and ImageNet datasets. The code and checkpoints are available at: https://github.com/wgcban/mix-bt.git
    摘要 然而,SSL 目标可能会导致特征过拟合,因为样本之间的强相互作用不够,与对比学习方法不同。我们的实验表明,仅仅优化 Barlow Twins 目标函数不能保证持续改善表示质量,可能会在某些数据集上降低下游性能。为解决这个挑战,我们引入混合巴罗姐妹(Mixed Barlow Twins),通过在 Barlow Twins 训练中 linearly interpolated samples 来提高样本之间的交互。这会增加一个额外的正则化项到原始 Barlow Twins 目标函数中,假设输入空间中的线性插值翻译为特征空间中的线性插值。在这种情况下,我们发现预训练时使用这种正则化可以有效避免特征过拟合,并进一步提高 CIFAR-10、CIFAR-100、TinyImageNet、STL-10 和 ImageNet 等数据集上的下游性能。我们的代码和检查点可以在 GitHub 上找到:https://github.com/wgcban/mix-bt.git。

Generative Powers of Ten

  • paper_url: http://arxiv.org/abs/2312.02149
  • repo_url: None
  • paper_authors: Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steve Seitz, Ira Kemelmacher, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, Aleksander Holynski
  • for: 生成多scale图像具有一致的内容,实现极 semantics zoom。
  • methods: 使用文本-图像模型,并采用多scale diffusion sampling方法,保持不同scale的一致性,同时保持每个采样过程的完整性。
  • results: 比较traditional super-resolution和outpainting方法,显示我们的方法最有效地生成多scale内容。
    Abstract We present a method that uses a text-to-image model to generate consistent content across multiple image scales, enabling extreme semantic zooms into a scene, e.g., ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt, our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting, and show that our method is most effective at generating consistent multi-scale content.
    摘要 我们提出了一种使用文本到图像模型来生成多个尺度的一致内容,以实现极端semantic zoom,例如从一个宽角景观 forest 到一个树枝上坐落的一只昆虫的macroshot。我们通过联合多尺度扩散采样方法来保证不同尺度之间的一致性,同时保持每个采样过程的完整性。由于每个生成的尺度受到不同的文本提示指导,我们的方法可以更深入地进行 zoom than traditional super-resolution methods,这些方法可能会在不同尺度上创建新的上下文结构。我们与其他图像超解析和填充技术进行比较,并显示了我们的方法在生成多尺度内容上最为有效。

Competition-Level Problems are Effective LLM Evaluators

  • paper_url: http://arxiv.org/abs/2312.02143
  • repo_url: None
  • paper_authors: Yiming Huang, Zhenghao Lin, Xiao Liu, Yeyun Gong, Shuai Lu, Fangyu Lei, Yaobo Liang, Yelong Shen, Chen Lin, Nan Duan, Weizhu Chen
  • for: 评估大型自然语言模型(LLMs)的逻辑能力,尤其是在解决 Codeforces 竞赛级编程问题上。
  • methods: 使用 GPT-4 模型进行零shot 性能评估,考虑各种问题的发布时间、Difficulty 等因素,并研究不同approach的影响。
  • results: GPT-4 模型在 Codeforces 问题上经历了一个 “峰值衰落” 现象,即在2021 年九月以来的问题上表现下降,这表明数据污染的可能性,以及现有 LLM 解决复杂逻辑问题的挑战。
    Abstract Large language models (LLMs) have demonstrated impressive reasoning capabilities, yet there is ongoing debate about these abilities and the potential data contamination problem recently. This paper aims to evaluate the reasoning capacities of LLMs, specifically in solving recent competition-level programming problems in Codeforces, which are expert-crafted and unique, requiring deep understanding and robust reasoning skills. We first provide a comprehensive evaluation of GPT-4's peiceived zero-shot performance on this task, considering various aspects such as problems' release time, difficulties, and types of errors encountered. Surprisingly, the peiceived performance of GPT-4 has experienced a cliff like decline in problems after September 2021 consistently across all the difficulties and types of problems, which shows the potential data contamination, as well as the challenges for any existing LLM to solve unseen complex reasoning problems. We further explore various approaches such as fine-tuning, Chain-of-Thought prompting and problem description simplification, unfortunately none of them is able to consistently mitigate the challenges. Through our work, we emphasis the importance of this excellent data source for assessing the genuine reasoning capabilities of LLMs, and foster the development of LLMs with stronger reasoning abilities and better generalization in the future.
    摘要 大型语言模型(LLM)已经表现出了吸引人的理解能力,然而还有一些争议和数据污染问题。这篇论文的目的是评估 LLM 的理解能力,具体来说是解决 Codeforces 上的最新竞赛程度问题,这些问题需要深刻的理解和坚强的理解能力。我们首先对 GPT-4 的Zero-shot表现进行了全面的评估,考虑了问题发布时间、难度和错误类型等方面。结果显示,GPT-4 在9月2021年以后的问题上表现了一个“峰值下降”的趋势,这表明可能存在数据污染问题,以及解决无法见到的复杂逻辑问题的挑战。我们进一步 explore了多种方法,如精度调整、链式思维提示和简化问题描述,但 None of them 能够一直稳定地 Mitigate 这些挑战。通过我们的工作,我们强调了 Codeforces 上的问题作为评估 LLM 的真正理解能力的出色数据源,并促进未来 LLM 的发展。

EMDM: Efficient Motion Diffusion Model for Fast, High-Quality Motion Generation

  • paper_url: http://arxiv.org/abs/2312.02256
  • repo_url: None
  • paper_authors: Wenyang Zhou, Zhiyang Dou, Zeyu Cao, Zhouyingcheng Liao, Jingbo Wang, Wenjia Wang, Yuan Liu, Taku Komura, Wenping Wang, Lingjie Liu
  • for: 该 paper 是为了提高人体动作生成的速度和质量而设计的。
  • methods: 该 paper 使用了 Conditional Denoising Diffusion GAN 来模型多modal 数据分布,以便在 fewer sample steps 中生成高质量的人体动作。
  • results: 该 paper 实现了在生成过程中快速生成高质量的人体动作,并且在训练过程中使用了 motion geometric loss 来提高动作质量和训练效率。
    Abstract We introduce Efficient Motion Diffusion Model (EMDM) for fast and high-quality human motion generation. Although previous motion diffusion models have shown impressive results, they struggle to achieve fast generation while maintaining high-quality human motions. Motion latent diffusion has been proposed for efficient motion generation. However, effectively learning a latent space can be non-trivial in such a two-stage manner. Meanwhile, accelerating motion sampling by increasing the step size, e.g., DDIM, typically leads to a decline in motion quality due to the inapproximation of complex data distributions when naively increasing the step size. In this paper, we propose EMDM that allows for much fewer sample steps for fast motion generation by modeling the complex denoising distribution during multiple sampling steps. Specifically, we develop a Conditional Denoising Diffusion GAN to capture multimodal data distributions conditioned on both control signals, i.e., textual description and denoising time step. By modeling the complex data distribution, a larger sampling step size and fewer steps are achieved during motion synthesis, significantly accelerating the generation process. To effectively capture the human dynamics and reduce undesired artifacts, we employ motion geometric loss during network training, which improves the motion quality and training efficiency. As a result, EMDM achieves a remarkable speed-up at the generation stage while maintaining high-quality motion generation in terms of fidelity and diversity.
    摘要 我们介绍高效运动扩散模型(EMDM),用于快速高品质人体动作生成。过往的动作扩散模型已经展示了杰出的结果,但是它们在维持高品质人体动作的同时又很难实现快速生成。动作扩散对于快速生成进行了建议。然而,实际上学习隐藏空间可以是非常困难的,特别是在这种两阶段的方式下。此外,通过增加步长,例如DDIM,可以快速增加动作样本,但是这通常会导致动作质量下降,因为简化复杂的数据分布时的步骤大小增加。在这篇文章中,我们提出EMDM,它允许在快速生成过程中使用许多步骤的少数步骤,以便快速增加动作样本。具体而言,我们开发了受控扩散GAN,以捕捉受控的数据分布,包括控制信号和去噪时间步。通过模型复杂的数据分布,我们可以使用更多的步骤大小和更少的步骤来快速生成动作,很大的提高生成过程的速度。为了有效地捕捉人体动作和减少不愿的错误,我们在网络训练中使用动作几何损失,这有助于提高动作质量和训练效率。因此,EMDM可以实现快速生成过程中的很大速度增加,同时保持高品质动作生成。

DiffiT: Diffusion Vision Transformers for Image Generation

  • paper_url: http://arxiv.org/abs/2312.02139
  • repo_url: https://github.com/nvlabs/diffit
  • paper_authors: Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat
  • for: 这篇论文主要是为了研究基于扩散模型的生成学习,尤其是使用视transformer来进行扩散基本的数据生成。
  • methods: 该论文提出了一种新的模型,即Diffusion Vision Transformers(DiffiT),它采用了一种混合层次结构,包括U-型编码器和解码器,以及一种新的时间依赖自动注意机制。
  • results: 该论文的实验结果表明,DiffiT surprisingly有效地生成高质量图像,并在多种类征conditional和无条件合成任务中实现了状态之最(SOTA)的成绩。在幂空间中,DiffiT achieve了一个新的SOTA FID分数1.73在ImageNet-256 dataset上。
    Abstract Diffusion models with their powerful expressivity and high sample quality have enabled many new applications and use-cases in various domains. For sample generation, these models rely on a denoising neural network that generates images by iterative denoising. Yet, the role of denoising network architecture is not well-studied with most efforts relying on convolutional residual U-Nets. In this paper, we study the effectiveness of vision transformers in diffusion-based generative learning. Specifically, we propose a new model, denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid hierarchical architecture with a U-shaped encoder and decoder. We introduce a novel time-dependent self-attention module that allows attention layers to adapt their behavior at different stages of the denoising process in an efficient manner. We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation. Our results show that DiffiT is surprisingly effective in generating high-fidelity images, and it achieves state-of-the-art (SOTA) benchmarks on a variety of class-conditional and unconditional synthesis tasks. In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT
    摘要 “扩散模型具有强大的表达力和高品质样本,它们在不同领域中产生了许多新的应用和用途。在样本生成方面,这些模型靠扩散网络进行噪声除除掉噪音,但这些网络架构的研究仍然不够完善。在这篇论文中,我们研究了扩散模型在扩散基本学习中的效iveness,具体而言,我们提出了一个新的模型,称为扩散视觉网络(DiffiT)。这个模型包括一个混合层次架构,包括U型对应网络。我们引入了一个新的时间相依性自我注意模组,让注意层可以在不同阶段的噪音除除过程中进行有效的适应。我们还引入了秘密DiffiT,它是一个包含我们所提出的自我注意层的transformer模型,用于高分辨率图像生成。我们的结果显示,DiffiT surprisingly有效地将高质量图像生成出来,并在许多类别条件和无条件实验中实现了新的SOTA指标。在对应空间中,DiffiT实现了一个新的SOTA FID分数1.73,在ImageNet-256 dataset上。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The translation is based on the original text in English, and some cultural references or expressions may be adjusted or omitted to ensure clarity and accuracy in the target language.

Hot PATE: Private Aggregation of Distributions for Diverse Task

  • paper_url: http://arxiv.org/abs/2312.02132
  • repo_url: None
  • paper_authors: Edith Cohen, Xin Lyu, Jelani Nelson, Tamas Sarlos, Uri Stemmer
  • for: 革新 Privacy-Preserving Machine Learning 框架,它可以在多种敏感数据上进行隐私保护的机器学习模型训练。
  • methods: 在 PATE 框架中,多个教师模型在不同的敏感数据上进行训练,并将其预测结果私有地聚合以为学生模型标注新的训练示例。
  • results: 在多种多样化任务中,hot PATE 可以保持隐私和各种响应的多样性,同时实现与基准“冷”PATE 相当或甚至超过的隐私利用性质。
    Abstract The Private Aggregation of Teacher Ensembles (PATE) framework~\cite{PapernotAEGT:ICLR2017} is a versatile approach to privacy-preserving machine learning. In PATE, teacher models are trained on distinct portions of sensitive data, and their predictions are privately aggregated to label new training examples for a student model. Until now, PATE has primarily been explored with classification-like tasks, where each example possesses a ground-truth label, and knowledge is transferred to the student by labeling public examples. Generative AI models, however, excel in open ended \emph{diverse} tasks with multiple valid responses and scenarios that may not align with traditional labeled examples. Furthermore, the knowledge of models is often encapsulated in the response distribution itself and may be transferred from teachers to student in a more fluid way. We propose \emph{hot PATE}, tailored for the diverse setting. In hot PATE, each teacher model produces a response distribution and the aggregation method must preserve both privacy and diversity of responses. We demonstrate, analytically and empirically, that hot PATE achieves privacy-utility tradeoffs that are comparable to, and in diverse settings, significantly surpass, the baseline ``cold'' PATE.
    摘要 PRIVATE AGGREGATION OF TEACHER ENSembles (PATE) 框架(PapernotAEGT: ICLR2017)是一种灵活的隐私保护机器学习方法。在 PATE 中,教师模型在不同的敏感数据上进行训练,并将其预测结果私有地聚合以标注新的学生模型训练示例。 Until now, PATE 主要被应用于分类类型任务,每个示例具有明确的标签,并将知识传递给学生模型。然而,生成 AI 模型在开放结构的任务中表现出色,其中每个示例可能有多个有效的回答和场景,这些场景可能与传统的标签示例不符。此外,模型的知识通常被封装在响应分布中,可以从教师模型传递给学生模型的更加流畅的方式。我们提议使用热 PATE,适用于多元的设定。在热 PATE 中,每个教师模型生成响应分布,并且聚合方法必须保持隐私和响应分布的多样性。我们通过分析和实验表明,热 PATE 可以与基准“冷” PATE 的隐私-用途质量做比较,并在多元的设定中显著超越基准。

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM

  • paper_url: http://arxiv.org/abs/2312.02126
  • repo_url: None
  • paper_authors: Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, Jonathon Luiten
  • for: dense simultaneous localization and mapping (SLAM) in embodied scene understanding
  • methods: using 3D Gaussians for high-quality reconstruction and real-time rendering of scenes with a single unposed monocular RGB-D camera
  • results: achieves up to 2X state-of-the-art performance in camera pose estimation, map construction, and novel-view synthesis, while allowing real-time rendering of a high-resolution dense 3D map
    Abstract Dense simultaneous localization and mapping (SLAM) is pivotal for embodied scene understanding. Recent work has shown that 3D Gaussians enable high-quality reconstruction and real-time rendering of scenes using multiple posed cameras. In this light, we show for the first time that representing a scene by 3D Gaussians can enable dense SLAM using a single unposed monocular RGB-D camera. Our method, SplaTAM, addresses the limitations of prior radiance field-based representations, including fast rendering and optimization, the ability to determine if areas have been previously mapped, and structured map expansion by adding more Gaussians. We employ an online tracking and mapping pipeline while tailoring it to specifically use an underlying Gaussian representation and silhouette-guided optimization via differentiable rendering. Extensive experiments show that SplaTAM achieves up to 2X state-of-the-art performance in camera pose estimation, map construction, and novel-view synthesis, demonstrating its superiority over existing approaches, while allowing real-time rendering of a high-resolution dense 3D map.
    摘要 dense同时地址和地图(SLAM)是场景理解的关键。最近的工作表明,使用多个姿态摄像头捕捉的3D гаус矩阵可以实现高质量的重建和实时渲染场景。在这种情况下,我们首次表明,使用场景的3D гаус矩阵可以实现精密的SLAM,使用单个不 pose的RGB-D摄像头。我们的方法(SplaTAM)超越了先前基于辐射场的表示方法的限制,包括快速渲染和优化、判断地区是否已经被映射,以及结构化地图扩展。我们使用在线跟踪和地图管理管道,并特意使用下面的 Gaussian 表示法和渲染导航的微分学习。广泛的实验表明,SplaTAM可以达到2倍的状态艺术性能在摄像头姿态估计、地图构建和新视图合成方面,这demonstrates its superiority over existing approaches, while allowing real-time rendering of a high-resolution dense 3D map。

TPPoet: Transformer-Based Persian Poem Generation using Minimal Data and Advanced Decoding Techniques

  • paper_url: http://arxiv.org/abs/2312.02125
  • repo_url: None
  • paper_authors: Amir Panahandeh, Hanie Asemi, Esmaeil Nourani
  • for: 这个研究旨在使用 transformer 架构来生成波斯古典诗歌,并提出了一种新的解码方法来提高生成的诗歌准确性和意义性。
  • methods: 研究使用一个特殊化的dataset,并没有预训练。另外,提出了一种新的解码方法来提高生成的诗歌准确性和意义性,并且通过了全自动和人工评价来评估其效果。
  • results: 研究显示,使用提出的解码方法可以生成更加准确和意义的波斯古典诗歌,并且在自动和人工评价中都有显著的优势。
    Abstract Recent advances in language models (LMs), have demonstrated significant efficacy in tasks related to the arts and humanities. While LMs have exhibited exceptional performance across a wide range of natural language processing tasks, there are notable challenges associated with their utilization on small datasets and their ability to replicate more creative human capacities. In this study, we aim to address these challenges by training a Persian classical poetry generation model using a transformer architecture on a specialized dataset with no pretraining. Additionally, we propose a novel decoding method to enhance coherence and meaningfulness in the generated poetry, effectively managing the tradeoff between diversity and quality. Furthermore, the results of our training approach and the proposed decoding method are evaluated through comprehensive set of automatic and human evaluations and showed its superior capability to generate coherent and meaningful poetry in compare to other decoding methods and an existing Persian large language model (LLM).
    摘要

Magicoder: Source Code Is All You Need

  • paper_url: http://arxiv.org/abs/2312.02120
  • repo_url: https://github.com/ise-uiuc/magicoder
  • paper_authors: Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, Lingming Zhang
  • for: 这 paper 的主要目标是提高 Code Language Models (LLMs) 的质量和多样性,以便在代码生成、多语言编程和数据科学程序完成等领域取得更好的性能。
  • methods: 这 paper 使用了一种新的数据生成方法 called OSS-Instruct,它使用了大量的开源代码片断来生成高质量的指令数据,以减少 LLMS 中的遗传性偏见。此外, paper 还使用了 Evol-Instruct 等其他数据生成方法来进一步提高 Magicoder 的性能。
  • results: Magicoder 和 MagicoderS 在多种编程 bencmarks 上均表现出色,包括 Python 文本生成、多语言编程和数据科学程序完成等。尤其是 MagicoderS-CL-7B 基于 CodeLlama 甚至超越了知名的 ChatGPT 在 HumanEval+ 中的表现(66.5 vs. 65.9 in pass@1)。总的来说,OSS-Instruct 开启了一个新的低偏度高质量指令调整方向,使用了丰富的开源参考来生成更多、更真实、更可控的数据。
    Abstract We introduce Magicoder, a series of fully open-source (code, weights, and data) Large Language Models (LLMs) for code that significantly closes the gap with top code models while having no more than 7B parameters. Magicoder models are trained on 75K synthetic instruction data using OSS-Instruct, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code. Our main motivation is to mitigate the inherent bias of the synthetic data generated by LLMs by empowering them with a wealth of open-source references for the production of more diverse, realistic, and controllable data. The orthogonality of OSS-Instruct and other data generation methods like Evol-Instruct further enables us to build an enhanced MagicoderS. Both Magicoder and MagicoderS substantially outperform state-of-the-art code models with similar or even larger sizes on a wide range of coding benchmarks, including Python text-to-code generation, multilingual coding, and data-science program completion. Notably, MagicoderS-CL-7B based on CodeLlama even surpasses the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall, OSS-Instruct opens a new direction for low-bias and high-quality instruction tuning using abundant open-source references.
    摘要 我们介绍Magicoder,一系列 completelly开源(代码、参数和数据)的大型语言模型(LLMs),用于代码,它在参数数量不超过7B的情况下,与顶尖代码模型之间减小了差距。 Magicoder模型在75K个合成指令数据上进行训练,使用OSS-Instruct,一种新的方法,通过使用开源代码片段来生成高质量的指令数据 для代码。我们的主要动机是解决 LLMS 生成的synthetic数据的内在偏见问题,通过激发LLMS 使用丰富的开源参考来生成更多、更真实和更控制的数据。 OSS-Instruct 的正交性和其他数据生成方法如 Evol-Instruct ,使得我们可以构建更加强大的 MagicoderS。 Magicoder 和 MagicoderS 在各种编程benchmark上表现出色,包括Python文本到代码生成、多语言编程和数据科学程序完成。尤其是 MagicoderS-CL-7B 基于 CodeLlama ,even surpasses the prominent ChatGPT on HumanEval+(66.5 vs. 65.9 in pass@1)。总之,OSS-Instruct 开启了一个新的方向,用于低偏见和高质量的指令调整,使用丰富的开源参考。

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically

  • paper_url: http://arxiv.org/abs/2312.02119
  • repo_url: https://github.com/ricommunity/tap
  • paper_authors: Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, Amin Karbasi
  • for: 这个论文是为了提出一种自动生成销坏推理模型(LLM)的方法,以便减少人类设计的攻击。
  • methods: 该方法使用目标LLM进行迭代地折衔候选(attack)提问,并使用树思维理解来缩小搜索空间。在发送提问之前,方法会评估提问的可能性,并将不可能逃脱的提问排除。
  • results: 实验表明,该方法可以在80%以上的提问中逃脱目标LLM,并且只需要少量的查询。这比前一个黑盒方法减少了许多。
    Abstract While Large Language Models (LLMs) display versatile functionality, they continue to generate harmful, biased, and toxic content, as demonstrated by the prevalence of human-designed jailbreaks. In this work, we present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks that only requires black-box access to the target LLM. TAP utilizes an LLM to iteratively refine candidate (attack) prompts using tree-of-thoughts reasoning until one of the generated prompts jailbreaks the target. Crucially, before sending prompts to the target, TAP assesses them and prunes the ones unlikely to result in jailbreaks. Using tree-of-thought reasoning allows TAP to navigate a large search space of prompts and pruning reduces the total number of queries sent to the target. In empirical evaluations, we observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries. This significantly improves upon the previous state-of-the-art black-box method for generating jailbreaks.
    摘要 大型语言模型(LLM)显示了多样化的功能,但它们仍然生成了危险、偏见和恶势力内容,如人类设计的监狱break。在这项工作中,我们提出了Tree of Attacks with Pruning(TAP),一种自动生成监狱break的方法,只需要黑盒访问目标LLM。TAP利用目标LLM进行迭代的候选(攻击)提示使用树思维理解,直到一个生成的提示破坏了目标。关键是在发送提示之前,TAP会评估它们,并将不可能导致破坏的提示排除。使用树思维理解,TAP可以尝试大量的提示和排除不可能导致破坏的提示,从而减少发送到目标的总数量。在实验中,我们发现TAP可以在80%以上的提示中Generate监狱breakstate-of-the-art LLM(包括GPT4和GPT4-Turbo),只需要一小部分的查询。这大大超过了之前的黑盒方法的状态。

Innovations in Agricultural Forecasting: A Multivariate Regression Study on Global Crop Yield Prediction

  • paper_url: http://arxiv.org/abs/2312.02254
  • repo_url: None
  • paper_authors: Ishaan Gupta, Samyutha Ayalasomayajula, Yashas Shashidhara, Anish Kataria, Shreyas Shashidhara, Krishita Kataria, Aditya Undurti
  • for: 预测国际农业产量是农业研究中的一项关键目标,这项研究采用6种回归模型(直线回归、树状回归、梯度下降、梯度拟合、最近邻居分类和Random Forest回归)来预测196个国家的农业产量。
  • methods: 研究采用了6种回归模型,并对4个关键训练参数(农药(吨)、降水(毫升)、温度(摄氏度)和产量(公斤/亩))进行了分析,以确定最佳的回归模型。
  • results: 研究发现,使用Random Forest回归模型可以达到0.94的拟合度(r^2),误差(ME)为0.03。这些模型使用了食品和农业组织联合国数据,以及世界银行气候变化数据目录进行训练和测试。此外,每个参数都进行了分析,以了解农业产量如何受到不同因素的影响。
    Abstract The prediction of crop yields internationally is a crucial objective in agricultural research. Thus, this study implements 6 regression models (Linear, Tree, Gradient Descent, Gradient Boosting, K- Nearest Neighbors, and Random Forest) to predict crop yields in 196 countries. Given 4 key training parameters, pesticides (tonnes), rainfall (mm), temperature (Celsius), and yield (hg/ha), it was found that our Random Forest Regression model achieved a determination coefficient (r^2) of 0.94, with a margin of error (ME) of .03. The models were trained and tested using the Food and Agricultural Organization of the United Nations data, along with the World Bank Climate Change Data Catalog. Furthermore, each parameter was analyzed to understand how varying factors could impact overall yield. We used unconventional models, contrary to generally used Deep Learning (DL) and Machine Learning (ML) models, combined with recently collected data to implement a unique approach in our research. Existing scholarship would benefit from understanding the most optimal model for agricultural research, specifically using the United Nations data.
    摘要 《预测国际农作物产量是农业研究中的一项非常重要目标。因此,这项研究采用了6种回归模型(线性、树型、梯度下降、梯度聚合、k-最近邻居和Random Forest)来预测196个国家的农作物产量。通过4个关键的训练参数(药品(吨)、雨量(毫米)、温度(摄氏度)和收成(公斤/亩)),我们的Random Forest回归模型实现了决定系数(r^2)为0.94,错误范围(ME)为0.03。模型在食品和农业组织联合国数据和世界银行气候变化数据目录上训练和测试。此外,每个参数都进行了分析,以便理解哪些因素如何影响总产量。我们采用了不同于常用的深度学习(DL)和机器学习(ML)模型的不同方法,并使用最新的数据来实现这项研究。现有学术研究可以启发于最佳的农业研究模型,特别是使用联合国数据。》Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need the translation in Traditional Chinese, please let me know.

TriDeNT: Triple Deep Network Training for Privileged Knowledge Distillation in Histopathology

  • paper_url: http://arxiv.org/abs/2312.02111
  • repo_url: None
  • paper_authors: Lucas Farndale, Robert Insall, Ke Yuan
  • for: 这个论文是为了提高计算生物学模型的性能而写的。
  • methods: 这个论文使用了一种新的自我超vised学习方法,可以在训练时使用不可用的数据来提高性能。
  • results: 在各种不同的配对数据上,TriDeNT方法都可以超越其他当前的方法,在下游任务中提高性能,最高提高101%。
    Abstract Computational pathology models rarely utilise data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilising privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distil knowledge from scarce or costly data during training, to create significantly better models for routine inputs.
    摘要 Computational pathology models rarely utilize data that will not be available for inference. This means most models cannot learn from highly informative data such as additional immunohistochemical (IHC) stains and spatial transcriptomics. We present TriDeNT, a novel self-supervised method for utilizing privileged data that is not available during inference to improve performance. We demonstrate the efficacy of this method for a range of different paired data including immunohistochemistry, spatial transcriptomics, and expert nuclei annotations. In all settings, TriDeNT outperforms other state-of-the-art methods in downstream tasks, with observed improvements of up to 101%. Furthermore, we provide qualitative and quantitative measurements of the features learned by these models and how they differ from baselines. TriDeNT offers a novel method to distill knowledge from scarce or costly data during training, to create significantly better models for routine inputs.Here's the word-for-word translation of the text in Simplified Chinese:计算pathology模型很少利用不可用于推理的数据。这意味着大多数模型无法学习高度有用的数据,如额外的immunohistochemical(IHC)染色和空间表达ómics。我们介绍TriDeNT,一种新的自我超vised方法,可以在训练时使用不可用的特权数据来提高性能。我们在不同的对应数据集中证明TriDeNT的有效性,包括immunohistochemistry、空间表达ómics和专家核体注解。在所有情况下,TriDeNT在下游任务中表现出了101%的提高。此外,我们还提供了这些模型学习的质量和量化测量,以及与基线之间的差异。TriDeNT提供了一种新的方法,可以在训练时从罕见或高昂的数据中提取知识,以创建更好的模型。

Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

  • paper_url: http://arxiv.org/abs/2312.02253
  • repo_url: None
  • paper_authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee
  • for: 提高图像识别模型的性能和对于领域外泛化的能力
  • methods: 使用生成模型进行精细调教,并通过提高生成的数据质量来提高模型性能
  • results: 通过增加生成数据,模型性能得到明显提高,可以达到6倍于原始ImageNet大小的水平,表明生成数据可以提高图像识别模型的性能和对于领域外泛化的能力
    Abstract Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.
    摘要 We present a new framework that leverages off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images.Our framework consistently enhances recognition model performance with more synthetic data, up to 6 times of the original ImageNet size, showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.

Authoring Worked Examples for Java Programming with Human-AI Collaboration

  • paper_url: http://arxiv.org/abs/2312.02105
  • repo_url: None
  • paper_authors: Mohammad Hassany, Peter Brusilovsky, Jiaze Ke, Kamil Akhuseyinoglu, Arun Balajiee Lekshmi Narayanan
  • for: 这个论文是为了解决编程课程中最受欢迎的学习内容之一——工作示例(source code in a certain language,用于解释编程topic)的问题。
  • methods: 这个论文使用了人工智能和人类协作的方法来撰写Java编程工作示例。它提供了一个创建工作示例的授权系统,可以生成一个初始版本的代码解释,并让教师可以编辑如果需要。
  • results: 这个论文的研究发现,使用这种人工智能和人类协作的方法可以提供高质量的代码解释,并且可以减少教师的工作量。
    Abstract Worked examples (solutions to typical programming problems presented as a source code in a certain language and are used to explain the topics from a programming class) are among the most popular types of learning content in programming classes. Most approaches and tools for presenting these examples to students are based on line-by-line explanations of the example code. However, instructors rarely have time to provide line-by-line explanations for a large number of examples typically used in a programming class. In this paper, we explore and assess a human-AI collaboration approach to authoring worked examples for Java programming. We introduce an authoring system for creating Java worked examples that generates a starting version of code explanations and presents it to the instructor to edit if necessary. We also present a study that assesses the quality of explanations created with this approach.
    摘要 工作示例(程序编程课程中常见的学习内容,通常以某种编程语言的代码形式出现)是编程课程中最受欢迎的学习内容之一。大多数approach和工具用于向学生展示这些示例都基于代码行首解释。然而,教师很少有时间为大量常用的示例进行线条线接释。在这篇论文中,我们探讨和评估一种人工智能协作授课示例作者系统,用于生成Java编程示例。我们介绍了一种生成示例代码解释的授课系统,并将其提供给教师编辑。我们还展示了评估这种方法生成的解释质量的研究。

Physics simulation capabilities of LLMs

  • paper_url: http://arxiv.org/abs/2312.02091
  • repo_url: None
  • paper_authors: Mohamad Ali-Dib, Kristen Menou
  • For: The paper evaluates the ability of state-of-the-art large language models (LLMs) to solve PhD-level to research-level computational physics problems.* Methods: The paper uses well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains.* Results: The paper finds that current SOTA LLMs (GPT4) fail most of the problems, but about 40% of the solutions could plausibly get a passing grade. The paper identifies several failure modes of GPT4 in the computational physics domain and provides a snapshot of current computational capabilities in classical physics.Here are the three points in Simplified Chinese text:* For: 这篇论文评估了当今最先进的大语言模型(LLMs)在物理学和天文学领域的计算能力。* Methods: 论文使用了已经详细documented和广泛使用的包来激发语言模型在物理和天文学领域的编程能力。* Results: 论文发现当今最先进的LLMs(GPT4)对大多数问题失败,但是约40%的解决方案可能会得到合格的评估。论文确定了GPT4在物理计算领域的失败模式,并提供了当今计算能力在古典物理领域的概述。
    Abstract [Abridged abstract] Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute $\sim 50$ original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more "educational" Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40\% of the solutions could plausibly get a passing grade. About $70-90 \%$ of the code lines produced are necessary, sufficient and correct (coding \& physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.
    摘要 [摘要] 大型自然语言模型(LLM)可以解决一些大学生到研究生水平的物理教科书问题,并且具备编程能力。将这两个能力相结合,一天可能会使AI系统能够模拟和预测物理世界。 我们对当今最高标准(SOTA)LLM进行评估,应用在物理和天文物理领域中广泛使用的包装来引发编程能力。我们提供了50个原创和复杂的天体力学(REBOUND)、星体物理(MESA)、1D流体动力学(Dedalus)和非线性动力学(SciPy)问题。由于我们的问题没有唯一解,我们对LLM的表现进行评估,包括 coding、物理、必要性和充分性四种类型的错误计数以及更“教育”的Pass-Fail指标,旨在捕捉问题的核心物理元素。如期望,当今SOTA LLM(GPT4)零容量失败了大多数我们的问题,但约40%的解决方案可能会获得合格分。大约70-90%的代码行都是必要的、充分的和正确的(编程和物理)。物理和编程错误是最常见的,有些代码行是不必要的或不充分的。我们发现在物理领域的计算力学问题上,GPT4存在许多失败模式。我们的探索工作提供了当今物理计算能力的快照,并指出了AI系统在物理模拟能力方面的明确改进目标。

Fine-Tuning Language Models for Context-Specific SQL Query Generation

  • paper_url: http://arxiv.org/abs/2312.02251
  • repo_url: None
  • paper_authors: Amine Rebei
  • for: 本研究的目的是使用自然语言生成SQL查询,以便让非专业人员访问数据。
  • methods: 本研究使用了开源大型自然语言模型(LLM)的特点,通过适应 Snowflake SQL 和 GoogleSQL 语言方言的 synthetic 数据进行 fine-tuning,以优化模型的性能。
  • results: 研究发现,通过 fine-tuning 三个开源 LLM(Starcoder Plus、Code-Llama 和 Mistral),可以在零容量设定下达到高度的查询准确率,其中 Code-Llama 在 Snowflake SQL 和 GoogleSQL 语言方言上的准确率分别为 81.58% 和 82.66%。这些结果表明,适应域pecific任务的 LLM fine-tuning 是一种有效的方法,并且可能为抽象数据库通过自然语言界面提供更好的访问方式。
    Abstract The ability to generate SQL queries from natural language has significant implications for making data accessible to non-specialists. This paper presents a novel approach to fine-tuning open-source large language models (LLMs) for the task of transforming natural language into SQL queries within the retail domain. We introduce models specialized in generating SQL queries, trained on synthetic datasets tailored to the Snowflake SQL and GoogleSQL dialects. Our methodology involves generating a context-specific dataset using GPT-4, then fine-tuning three open-source LLMs(Starcoder Plus, Code-Llama, and Mistral) employing the LoRa technique to optimize for resource constraints. The fine-tuned models demonstrate superior performance in zero-shot settings compared to the baseline GPT-4, with Code-Llama achieving the highest accuracy rates, at 81.58% for Snowflake SQL and 82.66% for GoogleSQL. These results underscore the effectiveness of fine-tuning LLMs on domain-specific tasks and suggest a promising direction for enhancing the accessibility of relational databases through natural language interfaces.
    摘要 《使用自然语言生成SQL查询的能力对非专家来说具有重要意义,以提高数据的可访问性。这篇论文提出了一种新的方法,利用开源大型自然语言模型(LLM)进行自然语言转换为SQL查询。我们在零售领域使用GPT-4生成上下文特定的数据集,然后使用LoRa技术进行三种开源LLM(Starcoder Plus、Code-Llama和Mistral)的特化。结果显示,在零 shot 环境下,我们的特化模型在 Snowflake SQL 和 GoogleSQL 方面的准确率比基线GPT-4高,Code-Llama 模型的准确率达到了 81.58% 和 82.66%。这些结果表明,针对域特定任务进行特化的 LLM 可以提高 relational database 的可访问性,并且建议在自然语言 интерфей斯上扩展 relational database 的访问权限。》Note that the translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and other countries. If you need Traditional Chinese, please let me know.

Integrating AI into CCTV Systems: A Comprehensive Evaluation of Smart Video Surveillance in Community Space

  • paper_url: http://arxiv.org/abs/2312.02078
  • repo_url: None
  • paper_authors: Shanle Yao, Babak Rahimi Ardabili, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Christopher Neff, Hamed Tabkhi
  • for: 这个研究旨在提高社区空间的安全性,包括教育和休闲区域以及小型企业。
  • methods: 该系统采用了人工智能技术,并与现有的CCTV和有线摄像头网络集成,以便在不同的社区情况下使用。系统还使用元数据而不是像素数据进行活动识别,以保持隐私。
  • results: 我们的实验表明,该系统在社区学院中表现了强健性,可以处理16个CCTV摄像头的视频流,并在21小时的运行时间内保持了16.5帧每秒的吞吐量和30ms的终端到终端延迟。
    Abstract This article presents an AI-enabled Smart Video Surveillance (SVS) designed to enhance safety in community spaces such as educational and recreational areas, and small businesses. The proposed system innovatively integrates with existing CCTV and wired camera networks, simplifying its adoption across various community cases to leverage recent AI advancements. Our SVS system, focusing on privacy, uses metadata instead of pixel data for activity recognition, aligning with ethical standards. It features cloud-based infrastructure and a mobile app for real-time, privacy-conscious alerts in communities. This article notably pioneers a comprehensive real-world evaluation of the SVS system, covering AI-driven visual processing, statistical analysis, database management, cloud communication, and user notifications. It's also the first to assess an end-to-end anomaly detection system's performance, vital for identifying potential public safety incidents. For our evaluation, we implemented the system in a community college, serving as an ideal model to exemplify the proposed system's capabilities. Our findings in this setting demonstrate the system's robustness, with throughput, latency, and scalability effectively managing 16 CCTV cameras. The system maintained a consistent 16.5 frames per second (FPS) over a 21-hour operation. The average end-to-end latency for detecting behavioral anomalies and alerting users was 26.76 seconds.
    摘要 这篇文章介绍了一种基于人工智能的智能视频监测(SVS)系统,旨在增强社区空间中的安全性,包括教育和休闲场所以及小型企业。提案的系统通过与现有的CCTV和有线相机网络集成,使其在不同的社区案例中易于采用,并利用最新的人工智能进步。我们的SVS系统注重隐私,使用元数据而不是像素数据进行活动识别,符合伦理标准。它具有云端基础设施和移动应用程序,为社区中的实时、隐私感知警报提供云端基础设施和移动应用程序。这篇文章也突破常见的实验室评估,全面评估了SVS系统的AI驱动视处理、统计分析、数据库管理、云通信和用户通知等方面。它还是首次评估术语检测系统的综合性性能,这是鉴定公共安全事件的关键。为了评估SVS系统,我们在社区学院中实现了该系统,这是一个 идеal的示范场景,能够展示提案的系统能力。我们在这种设置下的发现表明了系统的可靠性,处理16个CCTV摄像头,并保持了16.5帧每秒的稳定性。系统的平均术语检测和用户通知的综合延迟为26.76秒。

Know Your Audience: Do LLMs Adapt to Different Age and Education Levels?

  • paper_url: http://arxiv.org/abs/2312.02065
  • repo_url: None
  • paper_authors: Donya Rooein, Amanda Cercas Curry, Dirk Hovy
  • for: 评估大语言模型(LLMs)是否能够适应不同的读者群体和他们的阅读需求。
  • methods: 我们使用四种state-of-the-art LLMs(商业和开源)来生成科学问题的答案,并对这些答案的阅读性进行评估,以评估LLMs是否能够适应不同的年龄和教育水平。
  • results: 我们发现不同的LLMs具有大量的阅读性差异,而且LLM答案的阅读性通常不符合目标读者群体的理解水平。这些结果表明LLMs需要更好地适应不同的读者群体,以提高其可理解性。这些结果还指出,现有的LLMs在教育场景中需要进一步改进适应性,以满足不同年龄和教育水平的读者需求。
    Abstract Large language models (LLMs) offer a range of new possibilities, including adapting the text to different audiences and their reading needs. But how well do they adapt? We evaluate the readability of answers generated by four state-of-the-art LLMs (commercial and open-source) to science questions when prompted to target different age groups and education levels. To assess the adaptability of LLMs to diverse audiences, we compare the readability scores of the generated responses against the recommended comprehension level of each age and education group. We find large variations in the readability of the answers by different LLMs. Our results suggest LLM answers need to be better adapted to the intended audience demographics to be more comprehensible. They underline the importance of enhancing the adaptability of LLMs in education settings to cater to diverse age and education levels. Overall, current LLMs have set readability ranges and do not adapt well to different audiences, even when prompted. That limits their potential for educational purposes.
    摘要

Near-real-time Earthquake-induced Fatality Estimation using Crowdsourced Data and Large-Language Models

  • paper_url: http://arxiv.org/abs/2312.03755
  • repo_url: None
  • paper_authors: Chenguang Wang, Davis Engler, Xuechun Li, James Hou, David J. Wald, Kishor Jaiswal, Susu Xu
  • for: 这篇论文是为了提高全球地震引起的人员亡伤预测的准确性和时效性而写的。
  • methods: 这篇论文使用了多语言社交媒体数据,combined with hierarchical casualty extraction models, physical constraint-aware dynamic truth discovery models, and Bayesian updating loss projection models to improve the accuracy and timeliness of earthquake-induced human loss forecasting.
  • results: 研究人员在使用这种 Framework 后,可以在全球多个地震事件中实时获得比较准确的人员亡伤预测结果,并且速度和准确性与美国地质调查局(USGS)的手动方法相当。
    Abstract When a damaging earthquake occurs, immediate information about casualties is critical for time-sensitive decision-making by emergency response and aid agencies in the first hours and days. Systems such as Prompt Assessment of Global Earthquakes for Response (PAGER) by the U.S. Geological Survey (USGS) were developed to provide a forecast within about 30 minutes of any significant earthquake globally. Traditional systems for estimating human loss in disasters often depend on manually collected early casualty reports from global media, a process that's labor-intensive and slow with notable time delays. Recently, some systems have employed keyword matching and topic modeling to extract relevant information from social media. However, these methods struggle with the complex semantics in multilingual texts and the challenge of interpreting ever-changing, often conflicting reports of death and injury numbers from various unverified sources on social media platforms. In this work, we introduce an end-to-end framework to significantly improve the timeliness and accuracy of global earthquake-induced human loss forecasting using multi-lingual, crowdsourced social media. Our framework integrates (1) a hierarchical casualty extraction model built upon large language models, prompt design, and few-shot learning to retrieve quantitative human loss claims from social media, (2) a physical constraint-aware, dynamic-truth discovery model that discovers the truthful human loss from massive noisy and potentially conflicting human loss claims, and (3) a Bayesian updating loss projection model that dynamically updates the final loss estimation using discovered truths. We test the framework in real-time on a series of global earthquake events in 2021 and 2022 and show that our framework streamlines casualty data retrieval, achieving speed and accuracy comparable to manual methods by USGS.
    摘要 当地震发生时,立即获取灾情信息非常重要,以便在紧急应急响应和援助机构在首几个小时和天数内进行时间敏感的决策。例如,美国地质调查局(USGS)开发的Prompt Assessment of Global Earthquakes for Response(PAGER)系统可以在全球任何重要地震事件中提供约30分钟的预测。传统的灾害中人亡数量估计方法通常基于手动收集的早期灾害报告,这是一项劳动密集和慢的过程,具有显著的时间延迟。近年来,一些系统使用关键词匹配和主题分析来从社交媒体上提取有关灾害的信息。然而,这些方法在多语言文本中存在复杂的语义 semantics 和社交媒体平台上的不可预知和常更改的灾害报告问题。在这项工作中,我们介绍一个端到端框架,以提高全球地震引起的人员亡失预测的时间性和准确性。我们的框架包括以下三个组成部分:1. 基于大语言模型、提示设计和少量学习的层次灾害提取模型,用于从社交媒体上提取人员亡失数量的量化声明。2. 基于物理约束和动态真实发现模型,用于从巨量繁杂、可能冲突的人员亡失声明中提取真实的人员亡失数量。3. 基于发现真实的损失更新损失预测模型,用于在发现真实的人员亡失数量基础上动态更新最终损失预测。我们在2021和2022年global earthquake事件中进行了实时测试,并显示了我们的框架可以快速地提取灾害数据,同时保持与手动方法相当的准确性。

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

  • paper_url: http://arxiv.org/abs/2312.02051
  • repo_url: https://github.com/renshuhuai-andy/timechat
  • paper_authors: Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, Lu Hou
  • for: 这篇论文旨在提出一种时间敏感多模式大语言模型,用于长视频理解。
  • methods: 该模型包括两大建筑性贡献:(1)一个时间戳相关帧编码器,将视觉内容绑定到每帧时间戳,以及(2)一个滑动视频Q-Former,生成视频Token序列的变长度以适应不同长度的视频。
  • results: 实验结果表明,TimeChat在不同的视频理解任务中具有强大的零基础时间本地化和推理能力,比如YouCook2上的+9.2 F1分和+2.8 CIDEr,QVHighlights上的+5.8 HIT@1,以及Charades-STA上的+27.5 R@1(IoU=0.5)。相比现有的视频大语言模型,TimeChat具有普适的视频助手潜力,适用于长形视频理解任务和真实用户需求。
    Abstract This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.
    摘要 这个工作提出了TimeChat,一种时间敏感多Modal大语言模型,特地设计用于长视频理解。我们的模型包括两个关键的建筑性贡献:(1)一个具有时间戳的帧编码器,将视觉内容与每帧时间戳绑定在一起,以及(2)一个滑块视频Q-Former,生成视频代码序列的变长度以适应不同长度的视频。此外,我们还构建了一个具有6个任务和125W instances的指令调整数据集,以进一步提高TimeChat的指令遵从性能。实验结果表明,TimeChat在不同的视频理解任务上,如紧密描述、时间安全和突出点检测等,具有强大的零shot时间本地化和逻辑能力。例如,它在YouCook2上 achieves +9.2 F1分数和+2.8 CIDEr,在QVHighlights上 achieves +5.8 HIT@1,在Charades-STA上 achieves +27.5 R@1(IoU=0.5),相比之前的视频大语言模型,具有潜在的多功能视频助手功能,满足现实的用户需求。

VLTSeg: Simple Transfer of CLIP-Based Vision-Language Representations for Domain Generalized Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2312.02021
  • repo_url: None
  • paper_authors: Christoph Hümmer, Manuel Schwonberg, Liangwei Zhong, Hu Cao, Alois Knoll, Hanno Gottschalk
  • for: 提高领域总体化(Domain Generalization,DG)的深度神经网络(Deep Neural Network,DNN)semantic segmentation。
  • methods: 使用CLIP和EVA-CLIP预训练encoder作为传输学习设定,并通过替换传统视力 только背景来提高领域总体化性能。
  • results: 在GTA5 synthetic dataset上达到了7.6% mIoU的领域总体化SOTA,并在Cityscapes-to-ACDC benchmark上达到了76.48% mIoU,比前一个SOTA方法提高6.9% mIoU。其中,我们的方法在Cityscapes测试集上表现出了强大的领域总体化能力,即86.1% mIoU。
    Abstract Domain generalization (DG) remains a significant challenge for perception based on deep neural networks (DNN), where domain shifts occur due to lighting, weather, or geolocation changes. In this work, we propose VLTSeg to enhance domain generalization in semantic segmentation, where the network is solely trained on the source domain and evaluated on unseen target domains. Our method leverages the inherent semantic robustness of vision-language models. First, by substituting traditional vision-only backbones with pre-trained encoders from CLIP and EVA-CLIP as transfer learning setting we find that in the field of DG, vision-language pre-training significantly outperforms supervised and self-supervised vision pre-training. We thus propose a new vision-language approach for domain generalized segmentation, which improves the domain generalization SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset. We further show the superior generalization capabilities of vision-language segmentation models by reaching 76.48% mIoU on the popular Cityscapes-to-ACDC benchmark, outperforming the previous SOTA approach by 6.9% mIoU on the test set at the time of writing. Additionally, our approach shows strong in-domain generalization capabilities indicated by 86.1% mIoU on the Cityscapes test set, resulting in a shared first place with the previous SOTA on the current leaderboard at the time of submission.
    摘要 领域总结 (DG) 对 Deep Neural Network (DNN) 的感知仍然是一个主要挑战,由于光照、天气或地理位置变化引起的领域差异。在这项工作中,我们提议VLTSeg,用于增强领域总结的semantic segmentation中的领域总结。我们的方法利用了视觉语言模型的内在Semantic robustness。首先,我们通过将传统的视觉Only backbone替换为CLIP和EVA-CLIP预训练encoder,发现在DG领域,视觉语言预训练显著超过了supervised和self-supervised视觉预训练。我们因此提议一种新的视觉语言方法 для领域总结,提高了领域总结SOTA by 7.6% mIoU when training on the synthetic GTA5 dataset。我们还显示了视觉语言分割模型的优秀总结能力,在受欢迎的Cityscapes-to-ACDCbenchmark上达到76.48% mIoU,比前一个SOTA方法测试集上的6.9% mIoU高。此外,我们的方法还显示了强大的领域总结能力,Cityscapes测试集上的86.1% mIoU,与当前的首位相同。

Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models

  • paper_url: http://arxiv.org/abs/2312.02019
  • repo_url: https://github.com/argmax-ai/aime
  • paper_authors: Xingyuan Zhang, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl
  • for: 本研究旨在提出一种基于世界模型的行为imitating方法,以便在没有在线交互和环境训练的情况下,快速地学习新的行为。
  • methods: 该方法包括两个阶段。在第一阶段,Agent通过自己的过去经验来学习世界模型,并通过最大化ELBO来理解自己的身体。在第二阶段,Agent receives some observation-only demonstrations of an expert performing a novel task, and tries to imitate the expert’s behavior by defining a policy as an inference model and maximizing the evidence of the demonstration under the policy and world model。
  • results: 我们通过对DeepMind Control Suite的Walker和Cheetahembodiment进行验证,发现我们的方法在零 shot imitation中表现出优于状态之前的基elines。
    Abstract Unlike most reinforcement learning agents which require an unrealistic amount of environment interactions to learn a new behaviour, humans excel at learning quickly by merely observing and imitating others. This ability highly depends on the fact that humans have a model of their own embodiment that allows them to infer the most likely actions that led to the observed behaviour. In this paper, we propose Action Inference by Maximising Evidence (AIME) to replicate this behaviour using world models. AIME consists of two distinct phases. In the first phase, the agent learns a world model from its past experience to understand its own body by maximising the ELBO. While in the second phase, the agent is given some observation-only demonstrations of an expert performing a novel task and tries to imitate the expert's behaviour. AIME achieves this by defining a policy as an inference model and maximising the evidence of the demonstration under the policy and world model. Our method is "zero-shot" in the sense that it does not require further training for the world model or online interactions with the environment after given the demonstration. We empirically validate the zero-shot imitation performance of our method on the Walker and Cheetah embodiment of the DeepMind Control Suite and find it outperforms the state-of-the-art baselines. Code is available at: https://github.com/argmax-ai/aime.
    摘要 不同于大多数强化学习代理需要无realistic的环境互动来学习新的行为,人类却能够快速学习,只需通过观察和复制他人的行为。这种能力几乎完全取决于人类 possessing a model of their own embodiment,允许他们对所见行为的可能性进行推测。在这篇论文中,我们提出了Action Inference by Maximising Evidence(AIME)来复制这种行为。AIME包括两个不同阶段。在第一阶段,代理人 learns a world model from its past experience,以便理解自己的身体,并通过最大化ELBO来实现。而在第二阶段,代理人被给予一些 observation-only 专家完成新任务的示例,并尝试通过模仿专家的行为来复制。AIME实现这一点通过定义策略为一个推理模型,并在策略和世界模型下最大化证据的方式来实现。我们的方法是 "zero-shot" 的,即不需要进一步训练世界模型或在环境上进行在线互动。我们在 Walker 和 Cheetah 的 DeepMind Control Suite 中进行了验证,并发现我们的方法在零shot imitating性能方面高于当前的基elines。代码可以在以下链接中找到:https://github.com/argmax-ai/aime。

Towards Learning a Generalist Model for Embodied Navigation

  • paper_url: http://arxiv.org/abs/2312.02010
  • repo_url: https://github.com/zd11024/NaviLLM
  • paper_authors: Duo Zheng, Shijia Huang, Lin Zhao, Yiwu Zhong, Liwei Wang
  • for: 这个研究的目的是建立一个通用的AI代理人,让其能够与世界互动,并在不同的任务下进行自适应 Navigation。
  • methods: 这个研究使用了LLMs,并将其应用到embodied navigation上,通过将Schema-based instruction引入,将多种任务转换为生成问题,因此统一了许多任务。
  • results: 实验结果显示,我们的通用模型在CVDN、SOON和ScanQA等测试中均 achieve state-of-the-art表现, Specifically, 它在CVDN上比前一代最佳方法高出29%的进步。此外,我们的模型也在未见任务上表现出强大的普遍性。
    Abstract Building a generalist agent that can interact with the world is the intriguing target of AI systems, thus spurring the research for embodied navigation, where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained, previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently, LLMs have presented remarkable capabilities across various fields, and provided a promising opportunity for embodied navigation. Drawing on this, we propose the first generalist model for embodied navigation, NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems, thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training, equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN, SOON, and ScanQA. Specifically, it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover, our model also demonstrates strong generalizability and presents impressive results on unseen tasks, e.g., embodied question answering and 3D captioning.
    摘要

A Survey on Large Language Model (LLM) Security and Privacy: The Good, the Bad, and the Ugly

  • paper_url: http://arxiv.org/abs/2312.02003
  • repo_url: None
  • paper_authors: Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Eric Sun, Yue Zhang
    for: This paper explores the intersection of large language models (LLMs) with security and privacy, and investigates how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs.methods: The paper uses a comprehensive literature review to categorize findings into “The Good” (beneficial LLM applications), “The Bad” (offensive applications), and “The Ugly” (vulnerabilities and their defenses).results: The paper finds that LLMs have proven to enhance code and data security, outperforming traditional methods, but they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. The paper identifies areas that require further research efforts, such as research on model and parameter extraction attacks and safe instruction tuning.
    Abstract Large Language Models (LLMs), such as GPT-3 and BERT, have revolutionized natural language understanding and generation. They possess deep language comprehension, human-like text generation capabilities, contextual awareness, and robust problem-solving skills, making them invaluable in various domains (e.g., search engines, customer support, translation). In the meantime, LLMs have also gained traction in the security community, revealing security vulnerabilities and showcasing their potential in security-related tasks. This paper explores the intersection of LLMs with security and privacy. Specifically, we investigate how LLMs positively impact security and privacy, potential risks and threats associated with their use, and inherent vulnerabilities within LLMs. Through a comprehensive literature review, the paper categorizes findings into "The Good" (beneficial LLM applications), "The Bad" (offensive applications), and "The Ugly" (vulnerabilities and their defenses). We have some interesting findings. For example, LLMs have proven to enhance code and data security, outperforming traditional methods. However, they can also be harnessed for various attacks (particularly user-level attacks) due to their human-like reasoning abilities. We have identified areas that require further research efforts. For example, research on model and parameter extraction attacks is limited and often theoretical, hindered by LLM parameter scale and confidentiality. Safe instruction tuning, a recent development, requires more exploration. We hope that our work can shed light on the LLMs' potential to both bolster and jeopardize cybersecurity.
    摘要 大型自然语言模型(LLM),如GPT-3和BERT,已经革命化了自然语言理解和生成。它们拥有深刻的语言理解能力、人类化的文本生成能力、上下文意识和问题解决能力,使其在多个领域(如搜索引擎、客户支持、翻译)中成为不可或缺的工具。同时,LLM也在安全社区中得到了更多的关注,揭示了它们在安全相关任务中的潜在力量。本文探讨了LLM与安全和隐私之间的交叉关系。我们特别 investigate了LLM在安全和隐私方面的积极影响,可能的风险和威胁,以及LLM内置的漏洞和防御策略。通过对相关文献进行全面的检索和分类,我们将发现结果分为“好”(有利LLM应用)、“坏”(攻击性应用)和“丑”(漏洞和防御策略)。我们发现了一些有趣的发现,例如LLM可以帮助加强代码和数据安全,超越传统方法。但是,它们也可以被用于多种攻击(特别是用户级攻击),因为它们具有人类化的思维能力。我们认为需要进一步的研究,例如模型和参数抽象攻击的研究尚未充分,模型参数秘密性和安全性也需要进一步的探讨。我们希望通过这个研究,能够突出LLM在安全领域的潜在能力和风险。

SARA-RT: Scaling up Robotics Transformers with Self-Adaptive Robust Attention

  • paper_url: http://arxiv.org/abs/2312.01990
  • repo_url: None
  • paper_authors: Isabel Leal, Krzysztof Choromanski, Deepali Jain, Avinava Dubey, Jake Varley, Michael Ryoo, Yao Lu, Frederick Liu, Vikas Sindhwani, Quan Vuong, Tamas Sarlos, Ken Oslund, Karol Hausman, Kanishka Rao
  • for: scales up Robotics Transformers (RT) for on-robot deployment
  • methods: up-training, converting pre-trained or fine-tuned Transformer-based robotic policies of quadratic time complexity into efficient linear-attention counterparts
  • results: speeds up RT-2 models and Point Cloud Transformer (PCT) robotic policies operating on large point clouds
    Abstract We present Self-Adaptive Robust Attention for Robotics Transformers (SARA-RT): a new paradigm for addressing the emerging challenge of scaling up Robotics Transformers (RT) for on-robot deployment. SARA-RT relies on the new method of fine-tuning proposed by us, called up-training. It converts pre-trained or already fine-tuned Transformer-based robotic policies of quadratic time complexity (including massive billion-parameter vision-language-action models or VLAs), into their efficient linear-attention counterparts maintaining high quality. We demonstrate the effectiveness of SARA-RT by speeding up: (a) the class of recently introduced RT-2 models, the first VLA robotic policies pre-trained on internet-scale data, as well as (b) Point Cloud Transformer (PCT) robotic policies operating on large point clouds. We complement our results with the rigorous mathematical analysis providing deeper insight into the phenomenon of SARA.
    摘要 我们提出了自适应稳定注意力(SARA-RT),一种新的思路来解决将罗宾特斯坦福(RT)扩展到机器人上的趋势挑战。SARA-RT利用我们所提出的新方法,即更新,将预训练或已经精心调整的transformer基于的机器人政策转换为其高品质的线性注意力对象,具有quadratic时间复杂度(包括巨大的vision-language-action模型或VLA)。我们显示了SARA-RT的有效性,通过对RT-2模型和Point Cloud Transformer(PCT) robotic政策进行加速。我们附加了严格的数学分析,对SARA现象提供了更深入的理解。

Learning-Based Approaches to Predictive Monitoring with Conformal Statistical Guarantees

  • paper_url: http://arxiv.org/abs/2312.01959
  • repo_url: None
  • paper_authors: Francesca Cairoli, Luca Bortolussi, Nicola Paoletti
  • for: 本文提出了一种efficient方法,用于预测系统中的未来违反要求。
  • methods: 本文使用机器学习技术,学习一个精准 yet efficient的surrogate模型,以替代 computationally expensive的模型检查器。
  • results: 本文提出了一种可靠的预测方法,可以在runtime中检测系统中的违反要求。此外,本文还提供了一种基于conformal prediction的uncertainty quantification方法,可以确定不可靠的预测结果。
    Abstract This tutorial focuses on efficient methods to predictive monitoring (PM), the problem of detecting at runtime future violations of a given requirement from the current state of a system. While performing model checking at runtime would offer a precise solution to the PM problem, it is generally computationally expensive. To address this scalability issue, several lightweight approaches based on machine learning have recently been proposed. These approaches work by learning an approximate yet efficient surrogate (deep learning) model of the expensive model checker. A key challenge remains to ensure reliable predictions, especially in safety-critical applications. We review our recent work on predictive monitoring, one of the first to propose learning-based approximations for CPS verification of temporal logic specifications and the first in this context to apply conformal prediction (CP) for rigorous uncertainty quantification. These CP-based uncertainty estimators offer statistical guarantees regarding the generalization error of the learning model, and they can be used to determine unreliable predictions that should be rejected. In this tutorial, we present a general and comprehensive framework summarizing our approach to the predictive monitoring of CPSs, examining in detail several variants determined by three main dimensions: system dynamics (deterministic, non-deterministic, stochastic), state observability, and semantics of requirements' satisfaction (Boolean or quantitative).
    摘要 Our recent work on predictive monitoring is one of the first to propose learning-based approximations for CPS verification of temporal logic specifications, and the first to apply conformal prediction (CP) for rigorous uncertainty quantification. CP-based uncertainty estimators provide statistical guarantees regarding the generalization error of the learning model, and can be used to determine unreliable predictions that should be rejected.Our approach to PM is based on a general and comprehensive framework that summarizes our method for monitoring CPSs. The framework considers three main dimensions: system dynamics (deterministic, non-deterministic, stochastic), state observability, and semantics of requirements satisfaction (Boolean or quantitative). By examining these variants in detail, we can determine the most appropriate approach for a given system and requirement.In this tutorial, we will present our approach to PM and discuss the key challenges and opportunities in this area. We will also provide examples of how our method can be applied to real-world systems and highlight the benefits of using machine learning-based approaches for PM.

Foundations for Transfer in Reinforcement Learning: A Taxonomy of Knowledge Modalities

  • paper_url: http://arxiv.org/abs/2312.01939
  • repo_url: None
  • paper_authors: Markus Wulfmeier, Arunkumar Byravan, Sarah Bechtle, Karol Hausman, Nicolas Heess
  • for: 这篇论文主要目标是探讨现代人工智能系统的总体能力和可扩展性问题,以及如何通过不同的方法和技术来提高这些系统的总体能力和可扩展性。
  • methods: 这篇论文使用了多种方法和技术来探讨现代人工智能系统的总体能力和可扩展性问题,包括动力学和奖励模型、价值函数、策略和原始数据等多种知识表示方法。
  • results: 这篇论文的结果表明,现代人工智能系统的总体能力和可扩展性问题具有很大的挑战和机会,并且需要采取多种方法和技术来提高这些系统的总体能力和可扩展性。
    Abstract Contemporary artificial intelligence systems exhibit rapidly growing abilities accompanied by the growth of required resources, expansive datasets and corresponding investments into computing infrastructure. Although earlier successes predominantly focus on constrained settings, recent strides in fundamental research and applications aspire to create increasingly general systems. This evolving landscape presents a dual panorama of opportunities and challenges in refining the generalisation and transfer of knowledge - the extraction from existing sources and adaptation as a comprehensive foundation for tackling new problems. Within the domain of reinforcement learning (RL), the representation of knowledge manifests through various modalities, including dynamics and reward models, value functions, policies, and the original data. This taxonomy systematically targets these modalities and frames its discussion based on their inherent properties and alignment with different objectives and mechanisms for transfer. Where possible, we aim to provide coarse guidance delineating approaches which address requirements such as limiting environment interactions, maximising computational efficiency, and enhancing generalisation across varying axes of change. Finally, we analyse reasons contributing to the prevalence or scarcity of specific forms of transfer, the inherent potential behind pushing these frontiers, and underscore the significance of transitioning from designed to learned transfer.
    摘要 现代人工智能系统在扩大能力和需要资源的同时也在增长。 Earlier successes 主要集中在受限的设置下,但最近的基础研究和应用呈现出创造更加一般性的趋势。这种变化的背景下,总是需要对知识的总结、传递和适应进行调整和优化。在反射学习(RL)领域,知识表达的形式包括动力学和奖励模型、价值函数、策略和原始数据。本文系统地列举这些形式,并根据它们的内在性和与不同目标和机制的相互作用进行框架。尽可能提供粗略的指导,以限制环境互动、 maximize 计算效率和在不同变化车轨上增强通用性。最后,我们分析了特定形式的传递的原因和潜在力量,以及将这些前iers推进的重要性。

Federated Active Learning for Target Domain Generalisation

  • paper_url: http://arxiv.org/abs/2312.02247
  • repo_url: https://github.com/razvancaramalau/fedalv
  • paper_authors: Razvan Caramalau, Binod Bhattarai, Danail Stoyanov
  • for: 本研究旨在应用Active Learning框架在联合学习中进行目标领域通用,具体是将Active Learning和联合学习结合,以实现从限制的源领域客户数据无需分享图像来训练一个图像分类模型,并且在未见领域中获得高准确性。
  • methods: 本研究提出了一个名为FEDALV的框架,它包含了Active Learning和联合学习两种学习方法。FEDALV包括两个优化更新,一个是在客户端,另一个是在服务器端。客户端的导入损失是降低特征复杂和条件同步,而服务器端的导入损失是限制全局模型对源和目标领域的自由能量偏差。另外,FEDALV还包括了变数预算的Active Learning,将服务器端遍历并抽出最有用的本地数据来训练目标客户的模型。
  • results: 我们进行了多次实验,包括FDG with/without AL和联合学习基eline,并与多个现代方法进行比较。我们的广泛的量化实验结果表明,FEDALV在精度和效率方面与多个前期方法相比,具有较高的性能。FEDALV可以在仅从限制的源客户数据中抽取5%的数据来训练目标客户的模型,并且实现了全训练目标准确性。
    Abstract In this paper, we introduce Active Learning framework in Federated Learning for Target Domain Generalisation, harnessing the strength from both learning paradigms. Our framework, FEDALV, composed of Active Learning (AL) and Federated Domain Generalisation (FDG), enables generalisation of an image classification model trained from limited source domain client's data without sharing images to an unseen target domain. To this end, our FDG, FEDA, consists of two optimisation updates during training, one at the client and another at the server level. For the client, the introduced losses aim to reduce feature complexity and condition alignment, while in the server, the regularisation limits free energy biases between source and target obtained by the global model. The remaining component of FEDAL is AL with variable budgets, which queries the server to retrieve and sample the most informative local data for the targeted client. We performed multiple experiments on FDG w/ and w/o AL and compared with both conventional FDG baselines and Federated Active Learning baselines. Our extensive quantitative experiments demonstrate the superiority of our method in accuracy and efficiency compared to the multiple contemporary methods. FEDALV manages to obtain the performance of the full training target accuracy while sampling as little as 5% of the source client's data.
    摘要 在本文中,我们引入了活动学习框架在联合学习中实现目标领域泛化,利用活动学习和联合领域泛化的优势。我们的框架名为FEDALV,由活动学习(AL)和联合领域泛化(FDG)组成,可以在限制来源领域客户端数据不共享图像情况下,使一个图像分类模型在未经见过的目标领域进行泛化。为此,我们的FDG部分名为FEDA,包括在训练期间两个优化更新,一个在客户端和另一个在服务器端。客户端上引入的损失是降低特征复杂性和condition对齐,而服务器端的正则化限制了source和目标领域获得的自由能量偏差。剩下的FEDAL部分是AL变量预算,它向服务器请求和抽取目标客户端最有用的本地数据来进行训练。我们进行了多个实验,包括FDG与和 без AL,并与传统FDG和联合活动学习基eline进行比较。我们的广泛的量化实验结果表明,FEDALV在精度和效率方面与多种当代方法相比,具有显著的优势。FEDALV可以在只 sampling 5%的源客户端数据情况下,达到全训练目标精度。

Conditional Variational Diffusion Models

  • paper_url: http://arxiv.org/abs/2312.02246
  • repo_url: https://github.com/wallyxie/varInferenceSoilBiogeoModelSyntheticData
  • paper_authors: Gabriel della Maggiora, Luis Alberto Croquevielle, Nikita Desphande, Harry Horsley, Thomas Heinis, Artur Yakimovich
  • for: 这 paper 的目的是提出一种学习杂度表示的方法,以提高 inverse problems 中 diffusion models 的性能。
  • methods: 这 paper 使用了一种新的方法,即在训练过程中学习杂度表示,以避免手动调整杂度表示的困难和时间消耗。
  • results: 这 paper 在两个不相关的 inverse problems 中进行了测试,得到了比或超过先前方法和微调 diffusion models 的结果。这表明了学习杂度表示可以在训练过程中稳定地学习,并且可以适应不同的应用场景。
    Abstract Inverse problems aim to determine parameters from observations, a crucial task in engineering and science. Lately, generative models, especially diffusion models, have gained popularity in this area for their ability to produce realistic solutions and their good mathematical properties. Despite their success, an important drawback of diffusion models is their sensitivity to the choice of variance schedule, which controls the dynamics of the diffusion process. Fine-tuning this schedule for specific applications is crucial but time-costly and does not guarantee an optimal result. We propose a novel approach for learning the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, proving able to adapt to different applications with minimum overhead. This approach is tested in two unrelated inverse problems: super-resolution microscopy and quantitative phase imaging, yielding comparable or superior results to previous methods and fine-tuned diffusion models. We conclude that fine-tuning the schedule by experimentation should be avoided because it can be learned during training in a stable way that yields better results.
    摘要 <>Diffusion models have gained popularity in inverse problems due to their ability to produce realistic solutions and good mathematical properties. However, the choice of variance schedule is crucial and sensitive, which can be time-consuming and may not guarantee optimal results. We propose a novel approach to learn the schedule as part of the training process. Our method supports probabilistic conditioning on data, provides high-quality solutions, and is flexible, adapting to different applications with minimal overhead. We test our approach in two unrelated inverse problems, super-resolution microscopy and quantitative phase imaging, achieving comparable or superior results to previous methods and fine-tuned diffusion models. Our conclusion is that fine-tuning the schedule by experimentation should be avoided, as it can be learned during training in a stable way that yields better results.>>>

Deep Reinforcement Learning for Community Battery Scheduling under Uncertainties of Load, PV Generation, and Energy Prices

  • paper_url: http://arxiv.org/abs/2312.03008
  • repo_url: None
  • paper_authors: Jiarong Fan, Hao Wang
  • For: 本研究旨在透过深度强化学习策略(soft actor-critic algorithm),对于社区电池系统的 scheduling 问题进行解决,以应对分布式能源资源(DERs)的增加,促进可再生能源的整合,减少峰值负载,并提高电网可靠性。* Methods: 本研究使用了深度强化学习(RL)策略,包括 soft actor-critic 算法,来调整社区电池系统的运作,以应对不确定因素,如太阳能光伏(PV)生成、本地需求和实时能源价格。* Results: 研究结果显示,RL 策略可以有效地解决社区电池系统的调整问题,并且在不同的 RL 算法中,soft actor-critic 算法最终得到了最佳性能。
    Abstract In response to the growing uptake of distributed energy resources (DERs), community batteries have emerged as a promising solution to support renewable energy integration, reduce peak load, and enhance grid reliability. This paper presents a deep reinforcement learning (RL) strategy, centered around the soft actor-critic (SAC) algorithm, to schedule a community battery system in the presence of uncertainties, such as solar photovoltaic (PV) generation, local demand, and real-time energy prices. We position the community battery to play a versatile role, in integrating local PV energy, reducing peak load, and exploiting energy price fluctuations for arbitrage, thereby minimizing the system cost. To improve exploration and convergence during RL training, we utilize the noisy network technique. This paper conducts a comparative study of different RL algorithms, including proximal policy optimization (PPO) and deep deterministic policy gradient (DDPG) algorithms, to evaluate their effectiveness in the community battery scheduling problem. The results demonstrate the potential of RL in addressing community battery scheduling challenges and show that the SAC algorithm achieves the best performance compared to RL and optimization benchmarks.
    摘要 随着分布式能源资源(DERs)的普及,社区电池已经出现为支持可再生能源集成、减少峰值负荷和改善电网可靠性的有力解决方案。本文提出了一种基于深度强化学习(RL)策略,中心是软行为评估算法(SAC),用于社区电池系统的调度。我们将社区电池定位为一个多余的角色,以集成本地PV能源,减少峰值负荷,并利用实时能源价格波动进行走私贸易,以最小化系统成本。为了提高探索和融合在RL培训中,我们利用随机网络技术。本文进行了不同RL算法,包括PPO和DDPG算法的比较研究,以评估它们在社区电池调度问题中的效果。结果表明RL可以有效地解决社区电池调度挑战,而SAC算法在RL和优化准 benchmarks 中表现最佳。

Correlation and Unintended Biases on Univariate and Multivariate Decision Trees

  • paper_url: http://arxiv.org/abs/2312.01884
  • repo_url: https://github.com/msetzu/univariate-vs-multivariate-decision-trees
  • paper_authors: Mattia Setzu, Salvatore Ruggieri
  • for: 这篇论文主要是研究决策树(DT)的不同变体,以及它们在不同数据集上的表现。
  • methods: 论文使用了多种DT变体,包括单变量DT和多变量DT。单变量DT使用了极面 hyperplane 进行分割,而多变量DT使用了倾斜 hyperplane 进行分割。
  • results: 论文发现,尽管多变量DT在理论上更具表现力,但是单变量DT在实际应用中却能够达到相似的表现。研究人员认为,这可能是因为存在的数据集偏见导致的。
    Abstract Decision Trees are accessible, interpretable, and well-performing classification models. A plethora of variants with increasing expressiveness has been proposed in the last forty years. We contrast the two families of univariate DTs, whose split functions partition data through axis-parallel hyperplanes, and multivariate DTs, whose splits instead partition data through oblique hyperplanes. The latter include the former, hence multivariate DTs are in principle more powerful. Surprisingly enough, however, univariate DTs consistently show comparable performances in the literature. We analyze the reasons behind this, both with synthetic and real-world benchmark datasets. Our research questions test whether the pre-processing phase of removing correlation among features in datasets has an impact on the relative performances of univariate vs multivariate DTs. We find that existing benchmark datasets are likely biased towards favoring univariate DTs.
    摘要

Unleashing the Potential of Large Language Model: Zero-shot VQA for Flood Disaster Scenario

  • paper_url: http://arxiv.org/abs/2312.01882
  • repo_url: None
  • paper_authors: Yimin Sun, Chao Wang, Yan Peng
  • for: 这个论文是为了提出一种针对洪水灾害的零shot VQA模型(ZFDDA),以及一个新的洪水灾害图像问答集(FFD-IQA)来评估这种模型。
  • methods: 这个模型使用了一种名为chain of thought(CoT)的示例来解锁大语言模型的潜力,并且使用了一个新的图像问答集来评估模型的性能。
  • results: 实验结果表明,在使用CoT示例的情况下,模型在回答复杂问题时的准确率得到了大幅提高。这些结果提供了后续关于VQA的研究基础。
    Abstract Visual question answering (VQA) is a fundamental and essential AI task, and VQA-based disaster scenario understanding is a hot research topic. For instance, we can ask questions about a disaster image by the VQA model and the answer can help identify whether anyone or anything is affected by the disaster. However, previous VQA models for disaster damage assessment have some shortcomings, such as limited candidate answer space, monotonous question types, and limited answering capability of existing models. In this paper, we propose a zero-shot VQA model named Zero-shot VQA for Flood Disaster Damage Assessment (ZFDDA). It is a VQA model for damage assessment without pre-training. Also, with flood disaster as the main research object, we build a Freestyle Flood Disaster Image Question Answering dataset (FFD-IQA) to evaluate our VQA model. This new dataset expands the question types to include free-form, multiple-choice, and yes-no questions. At the same time, we expand the size of the previous dataset to contain a total of 2,058 images and 22,422 question-meta ground truth pairs. Most importantly, our model uses well-designed chain of thought (CoT) demonstrations to unlock the potential of the large language model, allowing zero-shot VQA to show better performance in disaster scenarios. The experimental results show that the accuracy in answering complex questions is greatly improved with CoT prompts. Our study provides a research basis for subsequent research of VQA for other disaster scenarios.
    摘要 “视觉问答(VQA)是人工智能的基础和必需任务,而VQA基于灾害情景的理解是一个热点研究领域。例如,我们可以通过VQA模型提问灾害图像,并且答案可以帮助我们判断灾害中是否有人或物被affected。然而,过去的VQA模型 для灾害损害评估有一些缺点,例如候选答案的有限性、单一的问题类型和现有模型的回答能力有限。在这篇论文中,我们提出了一种零shot VQA模型名为零shot VQA for Flood Disaster Damage Assessment(ZFDDA)。它是一种不需要预训练的VQA模型,同时我们为灾害情景建立了自由式洪水灾害图像问答集(FFD-IQA)来评估我们的VQA模型。这个新的数据集扩展了问题类型,包括自由形、多选和是否问题。同时,我们扩展了之前的数据集,总共包含2,058张图像和22,422个问题-元数据真实答案。最重要的是,我们使用了Well-designed chain of thought(CoT)示例来解锁大语言模型的潜力,使零shot VQA在灾害情景中表现更好。实验结果表明,在复杂问题中的答案准确率得到了大幅提高。我们的研究提供了对后续VQA для其他灾害情景的研究基础。”

Modular Control Architecture for Safe Marine Navigation: Reinforcement Learning and Predictive Safety Filters

  • paper_url: http://arxiv.org/abs/2312.01855
  • repo_url: None
  • paper_authors: Aksel Vaaler, Svein Jostein Husa, Daniel Menges, Thomas Nakken Larsen, Adil Rasheed
  • for: 这篇论文是为了解决自主系统安全问题而写的,特别是关于物理限制和安全约束的closed-loop控制。
  • methods: 这篇论文使用了回归学习来适应复杂的enario,但标准框架可以保证安全和稳定性的不足。因此, authors提出了Predictive Safety Filters(PSF)的方法,可以确保学习基于控制的动作满足物理和安全约束。
  • results: 结果表明,PSF可以保持安全性,而不影响RL Agent的学习率和性能。作者们对一个模拟的Cybership II模型进行了marine navigation的应用,RL Agent被训练了路径跟踪和碰撞避免,而PSF监测和修改控制动作以确保安全性。
    Abstract Many autonomous systems face safety challenges, requiring robust closed-loop control to handle physical limitations and safety constraints. Real-world systems, like autonomous ships, encounter nonlinear dynamics and environmental disturbances. Reinforcement learning is increasingly used to adapt to complex scenarios, but standard frameworks ensuring safety and stability are lacking. Predictive Safety Filters (PSF) offer a promising solution, ensuring constraint satisfaction in learning-based control without explicit constraint handling. This modular approach allows using arbitrary control policies, with the safety filter optimizing proposed actions to meet physical and safety constraints. We apply this approach to marine navigation, combining RL with PSF on a simulated Cybership II model. The RL agent is trained on path following and collision avpodance, while the PSF monitors and modifies control actions for safety. Results demonstrate the PSF's effectiveness in maintaining safety without hindering the RL agent's learning rate and performance, evaluated against a standard RL agent without PSF.
    摘要 多种自主系统面临安全挑战,需要强大的封闭控制来处理物理限制和安全约束。现实世界中的自主船只例如,会遇到非线性动力学和环境干扰。使用强化学习来适应复杂enario,但标准框架确保安全和稳定性却缺乏。预测安全筛(PSF)提供了一种有 promise的解决方案,确保学习基于控制的Constraint satisfaction,而不需要显式处理约束。这种归类approach允许使用任意控制策略,安全筛优化提议的控制动作,以满足物理和安全约束。我们在 marine navigation 中应用这种approach,combine RL with PSF on a simulated Cybership II model。RL agent 被训练以 seguir ruta y evitar colisiones,而 PSF 监视和修改控制动作以确保安全。结果表明 PSF 能够 effectively maintain safety without hindering RL agent's learning rate and performance, compared to a standard RL agent without PSF.

Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking

  • paper_url: http://arxiv.org/abs/2312.01842
  • repo_url: https://github.com/jihyunlee1/e2e-dst
  • paper_authors: Jihyun Lee, Yejin Jeon, Wonjun Lee, Yunsu Kim, Gary Geunbae Lee
  • for: 这篇论文主要是为了研究对话状态跟踪(DST)在语音模式下的应用。
  • methods: 作者使用了搅拌和端到端模型,并使用自己制作的假音频数据进行训练。
  • results: 实验结果表明,使用假音频数据训练的模型可以在真实人声数据上进行普遍化表现。这些发现可以帮助解决对话状态跟踪在语音模式下的实际应用问题。数据和代码可以在https://github.com/JihyunLee1/E2E-DST上获取。
    Abstract Dialogue state tracking plays a crucial role in extracting information in task-oriented dialogue systems. However, preceding research are limited to textual modalities, primarily due to the shortage of authentic human audio datasets. We address this by investigating synthetic audio data for audio-based DST. To this end, we develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data. To facilitate evaluation tailored to audio modalities, we introduce a novel PhonemeF1 to capture pronunciation similarity. Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data. By eliminating the dependency on human speech data collection, these insights pave the way for significant practical advancements in audio-based DST. Data and code are available at https://github.com/JihyunLee1/E2E-DST.
    摘要 对话状态跟踪(Dialogue State Tracking,DST)在任务导向对话系统中扮演着关键角色,但之前的研究主要集中在文本Modalities上,主要是因为人类Audio数据的缺乏。我们解决这个问题,通过调查人工Audio数据来研究Audio基于DST。为此,我们开发了链式和端到端模型,使用我们自己的人工Audio数据进行训练,并在实际人类speech数据上进行测试。为了适应Audio模式的评估,我们引入了一种新的PhoneMeF1来捕捉发音相似性。实验结果表明,使用人工Audio数据进行训练后,模型可以在人类voice数据上进行泛化表现。通过消除人类Speech数据收集的依赖关系,这些发现打开了大量实践上的前进之路,具有很大的实际应用前途。数据和代码可以在https://github.com/JihyunLee1/E2E-DST上下载。

Integrated Drill Boom Hole-Seeking Control via Reinforcement Learning

  • paper_url: http://arxiv.org/abs/2312.01836
  • repo_url: None
  • paper_authors: Haoqi Yan, Haoyuan Xu, Hongbo Gao, Fei Ma, Shengbo Eben Li, Jingliang Duan
  • for: 提高钻孔效率、减少安全隐患、减轻人工操作负担
  • methods: 基于强化学习(RL)集成钻孔控制方法
  • results: 在钻孔过程中提高钻孔精度,并提高钻孔效率和时间效率
    Abstract Intelligent drill boom hole-seeking is a promising technology for enhancing drilling efficiency, mitigating potential safety hazards, and relieving human operators. Most existing intelligent drill boom control methods rely on a hierarchical control framework based on inverse kinematics. However, these methods are generally time-consuming due to the computational complexity of inverse kinematics and the inefficiency of the sequential execution of multiple joints. To tackle these challenges, this study proposes an integrated drill boom control method based on Reinforcement Learning (RL). We develop an integrated drill boom control framework that utilizes a parameterized policy to directly generate control inputs for all joints at each time step, taking advantage of joint posture and target hole information. By formulating the hole-seeking task as a Markov decision process, contemporary mainstream RL algorithms can be directly employed to learn a hole-seeking policy, thus eliminating the need for inverse kinematics solutions and promoting cooperative multi-joint control. To enhance the drilling accuracy throughout the entire drilling process, we devise a state representation that combines Denavit-Hartenberg joint information and preview hole-seeking discrepancy data. Simulation results show that the proposed method significantly outperforms traditional methods in terms of hole-seeking accuracy and time efficiency.
    摘要 智能钻孔杆自动控制技术可以提高钻孔效率,降低风险和减轻人工操作员的劳动负担。现有的大多数智能钻孔杆控制方法基于倒数 Kinematics 的层次控制框架。然而,这些方法通常是时间consuming的,因为 inverse kinematics 的计算复杂度高,并且顺序执行多个 JOINTS 的不效率。为了解决这些挑战,本研究提出了基于强化学习(RL)的集成钻孔杆控制方法。我们开发了一个集成的钻孔杆控制框架,利用参数化策略直接生成每个 JOINTS 的控制输入,利用 JOINTS 姿态和目标孔信息。通过将钻孔任务定义为 Markov 决策过程,我们可以直接使用现代主流 RL 算法学习一个钻孔策略,从而废弃 inverse kinematics 解决方案,并促进多 JOINTS 合作控制。为了在整个钻孔过程中提高钻孔精度,我们设计了一个包含 Denavit-Hartenberg JOINTS 信息和预览孔差数据的状态表示。 simulation 结果表明,我们的方法可以舒展 traditional 方法,在钻孔精度和时间效率方面显著提高。

Learning Machine Morality through Experience and Interaction

  • paper_url: http://arxiv.org/abs/2312.01818
  • repo_url: None
  • paper_authors: Elizaveta Tennant, Stephen Hailes, Mirco Musolesi
  • for: 本研究旨在嵌入伦理到自动化智能系统中,以 Ensure safety of next-generation Artificial Intelligence (AI) systems.
  • methods: 本研究使用了 Learning from experience (Reinforcement Learning) 方法,以提供伦理原则给学习代理人。
  • results: 研究发现, hybrid 方法可以创建更适应、更可控和更可解释的智能代理人,并可以表达类古典伦理框架。
    Abstract Increasing interest in ensuring safety of next-generation Artificial Intelligence (AI) systems calls for novel approaches to embedding morality into autonomous agents. Traditionally, this has been done by imposing explicit top-down rules or hard constraints on systems, for example by filtering system outputs through pre-defined ethical rules. Recently, instead, entirely bottom-up methods for learning implicit preferences from human behavior have become increasingly popular, such as those for training and fine-tuning Large Language Models. In this paper, we provide a systematization of existing approaches to the problem of introducing morality in machines - modeled as a continuum, and argue that the majority of popular techniques lie at the extremes - either being fully hard-coded, or entirely learned, where no explicit statement of any moral principle is required. Given the relative strengths and weaknesses of each type of methodology, we argue that more hybrid solutions are needed to create adaptable and robust, yet more controllable and interpretable agents. In particular, we present three case studies of recent works which use learning from experience (i.e., Reinforcement Learning) to explicitly provide moral principles to learning agents - either as intrinsic rewards, moral logical constraints or textual principles for language models. For example, using intrinsic rewards in Social Dilemma games, we demonstrate how it is possible to represent classical moral frameworks for agents. We also present an overview of the existing work in this area in order to provide empirical evidence for the potential of this hybrid approach. We then discuss strategies for evaluating the effectiveness of moral learning agents. Finally, we present open research questions and implications for the future of AI safety and ethics which are emerging from this framework.
    摘要 增加对下一代人工智能系统的安全性需求新的方法来嵌入伦理到自动化代理人中。传统上,这是通过设置明确的顶层规则或硬件约束来实现的,例如通过预定的伦理规则来筛选系统输出。然而,最近,全部从底层学习来学习人类行为中的偏好而得到的方法在普及。在这篇论文中,我们提供了已有的伦理在机器中引入方法的系统化 - 模型为一个维度,并 argue that大多数流行的方法在极端 - ether是完全硬编程或完全学习,而无需显式表达任何伦理原则。considering the relative strengths and weaknesses of each type of methodology, we argue that更多的hybrid解决方案是需要创建适应性强、可控性好、可解释性好的代理人。specifically, we present three case studies of recent works which use learning from experience (i.e., reinforcement learning) to explicitly provide moral principles to learning agents - either as intrinsic rewards, moral logical constraints or textual principles for language models. for example, using intrinsic rewards in social dilemma games, we demonstrate how it is possible to represent classical moral frameworks for agents. we also present an overview of the existing work in this area in order to provide empirical evidence for the potential of this hybrid approach.finally, we discuss strategies for evaluating the effectiveness of moral learning agents, and present open research questions and implications for the future of AI safety and ethics which are emerging from this framework.

Energy-based Potential Games for Joint Motion Forecasting and Control

  • paper_url: http://arxiv.org/abs/2312.01811
  • repo_url: https://github.com/rst-tu-dortmund/diff_epo_planner
  • paper_authors: Christopher Diehl, Tobias Klosek, Martin Krüger, Nils Murzyn, Timo Osterburg, Torsten Bertram
  • for: 这个论文使用游戏理论作为数学框架,以模型多智能体交互控制和预测。
  • methods: 论文提出了一种能量基本游戏框架,将既存的方法集成起来,并使用神经网络进行游戏参数推断。
  • results: 分析表明,使用游戏理论层可以提高多种神经网络背景下的预测性能,并增加解释性。
    Abstract This work uses game theory as a mathematical framework to address interaction modeling in multi-agent motion forecasting and control. Despite its interpretability, applying game theory to real-world robotics, like automated driving, faces challenges such as unknown game parameters. To tackle these, we establish a connection between differential games, optimal control, and energy-based models, demonstrating how existing approaches can be unified under our proposed Energy-based Potential Game formulation. Building upon this, we introduce a new end-to-end learning application that combines neural networks for game-parameter inference with a differentiable game-theoretic optimization layer, acting as an inductive bias. The analysis provides empirical evidence that the game-theoretic layer adds interpretability and improves the predictive performance of various neural network backbones using two simulations and two real-world driving datasets.
    摘要 这个工作使用游戏理论作为数学框架来Address交互模型化在多智能体运动预测和控制中。虽然它具有可读性,但在实际的 робоTopics中应用游戏理论,如自动驾驶,会遇到不知道游戏参数的挑战。为解决这些挑战,我们建立了连接梯度游戏、优化和能量基本模型的连接,从而示出了现有方法可以在我们提议的能量基本游戏形式下被统一。基于这个连接,我们引入了一种新的综合学习应用程序,该应用程序将神经网络用于游戏参数推理与梯度游戏理论层,作为一种导向假设。分析表明,游戏理论层可以提供可读性并改善各种神经网络背景下的预测性能,通过使用两个 simulations和两个实际驾驶数据集进行验证。

Cone Ranking for Multi-Criteria Decision Making

  • paper_url: http://arxiv.org/abs/2312.03006
  • repo_url: None
  • paper_authors: Andreas H Hamel, Daniel Kostner
  • for: 本研究旨在将统计中引入的冠 Distribution 函数应用于多riteria 决策(MCDM)领域。
  • methods: 本研究使用了多种Weighted sum scalarization 方法,并证明这些方法可以视为一种升级的Weighted sum scalarization。
  • results: 研究发现了不同类型的排名反转现象,并解释了这可能是一种有用的分析排名过程的方式。
    Abstract Recently introduced cone distribution functions from statistics are turned into multi-criteria decision making (MCDM) tools. It is demonstrated that this procedure can be considered as an upgrade of the weighted sum scalarization insofar as it absorbs a whole collection of weighted sum scalarizations at once instead of fixing a particular one in advance. Moreover, situations are characterized in which different types of rank reversal occur, and it is explained why this might even be useful for analyzing the ranking procedure. A few examples will be discussed and a potential application in machine learning is outlined.
    摘要 最近引入的 cone distribution functions 从统计学被转化为多riteria decision making (MCDM) 工具。这种过程可以视为weighted sum scalarization的升级,因为它可以同时吸收多种weighted sum scalarization而不需要先选择特定的一个。此外,文中还描述了不同类型的排名反转现象,并解释了这可能是分析排名过程的有用工具。文中还会提供一些示例,并描述了机器学习的潜在应用。

LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics

  • paper_url: http://arxiv.org/abs/2312.01797
  • repo_url: None
  • paper_authors: Hengjia Xiao, Peng Wang
  • for: 这种研究旨在探讨如何使用大自然语言模型(LLM)来帮助移动有机体机器人进行路径规划,并在人类在Loop和交互方式下进行协作。
  • methods: 该研究提出了一种名为LLM A的新框架,利用LLM的常识和优化A算法,以实现几次循环准确的路径规划。在这种方法中,提示被用来1) 提供LLM所需的环境信息,成本函数,等; 2) 将人类反馈传递给LLM进行中间规划结果的反馈。这使得整个路径规划过程变成了一个”白盒”,人类反馈导引LLM A*快速协调,相比其他数据驱动方法如强化学习路径规划。
  • results: 相比A和强化学习路径规划,LLM A更有效率,在搜索空间方面更快,并且可以实现与A相似的路径,而且比RL更好。此外,LLM A的交互性也使其成为在合作人类-机器人任务中的优秀工具。
    Abstract This research focuses on how Large Language Models (LLMs) can help with path planning for mobile embodied agents such as robots, in a human-in-the-loop and interactive manner. A novel framework named LLM A*, aims to leverage the commonsense of LLMs, and the utility-optimal A* is proposed to facilitate few-shot near-optimal path planning. Prompts are used to 1) provide LLMs with essential information like environment, cost, heuristics, etc.; 2) communicate human feedback to LLMs on intermediate planning results. This makes the whole path planning process a `white box' and human feedback guides LLM A* to converge quickly compared to other data-driven methods such as reinforcement learning-based (RL) path planning. In addition, it makes code-free path planning practical, henceforth promoting the inclusiveness of artificial intelligence techniques. Comparative analysis against A* and RL shows that LLM A* is more efficient in terms of search space and achieves an on-a-par path with A* and a better path than RL. The interactive nature of LLM A* also makes it a promising tool for deployment in collaborative human-robot tasks.
    摘要
  1. Provide LLMs with essential information like environment, cost, heuristics, etc.2. Communicate human feedback to LLMs on intermediate planning results.This makes the whole path planning process a “white box” and human feedback guides LLM A* to converge quickly compared to other data-driven methods such as reinforcement learning-based (RL) path planning. In addition, it makes code-free path planning practical, henceforth promoting the inclusiveness of artificial intelligence techniques. Comparative analysis against A* and RL shows that LLM A* is more efficient in terms of search space and achieves an on-a-par path with A* and a better path than RL. The interactive nature of LLM A* also makes it a promising tool for deployment in collaborative human-robot tasks.Translation in Simplified Chinese:这些研究集中于如何使用大型自然语言模型(LLM)来帮助移动 embedding 代理人类如机器人进行路径规划,并在人类与机器人之间进行交互式的方式。一个新的框架名为 LLM A,旨在利用 LLM 的通用情况,并提出了基于实用 A 的几步最佳路径规划。提示用于:1. 为 LLM 提供环境、成本、规则等重要信息。2. 将人类反馈传递给 LLM 进行中间规划结果的沟通。这使得整个路径规划过程变成了一个 “白色盒”,人类反馈导引 LLM A* 快速 converges 到其他数据驱动方法,如强化学习基于的路径规划。此外,它还使得无需编程的路径规划实现得以进行,从而促进人工智能技术的包容性。相比 A* 和 RL 的分析表明,LLM A* 在搜索空间上更高效,并实现了与 A* 和 RL 的相当的路径。此外,LLM A* 的交互性也使其在人类和机器人之间的合作任务中表现出了承诺。

Contrastive Learning-Based Spectral Knowledge Distillation for Multi-Modality and Missing Modality Scenarios in Semantic Segmentation

  • paper_url: http://arxiv.org/abs/2312.02240
  • repo_url: None
  • paper_authors: Aniruddh Sikdar, Jayant Teotia, Suresh Sundaram
  • for: 提高 semantic segmentation 模型的性能,特别是在低光照和不利条件下。
  • methods: 使用 contrastive learning 和自动混合特征交换机制,实现 multi-modal fusion,并提高模型的鲁棒性和可靠性。
  • results: CSK-Net 在三个公共 benchmarking 数据集上超过了 state-of-the-art 模型,并在缺失模式下 achieve 性能提升,无需额外计算成本。
    Abstract Improving the performance of semantic segmentation models using multispectral information is crucial, especially for environments with low-light and adverse conditions. Multi-modal fusion techniques pursue either the learning of cross-modality features to generate a fused image or engage in knowledge distillation but address multimodal and missing modality scenarios as distinct issues, which is not an optimal approach for multi-sensor models. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A Novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.
    摘要 提高 semantic segmentation 模型的性能使用多спектル信息非常重要,尤其是在低光照和不利条件下。多模态融合技术通常采取 Either learning cross-modality features to generate a fused image or engage in knowledge distillation, but these approaches do not fully address multimodal and missing modality scenarios. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.Here's the translation in Traditional Chinese:提高 semantic segmentation 模型的性能使用多спектル信息非常重要,尤其是在低光照和不利条件下。多模态融合技术通常采取 Either learning cross-modality features to generate a fused image or engage in knowledge distillation, but these approaches do not fully address multimodal and missing modality scenarios. To address this, a novel multi-modal fusion approach called CSK-Net is proposed, which uses a contrastive learning-based spectral knowledge distillation technique along with an automatic mixed feature exchange mechanism for semantic segmentation in optical (EO) and infrared (IR) images. The distillation scheme extracts detailed textures from the optical images and distills them into the optical branch of CSK-Net. The model encoder consists of shared convolution weights with separate batch norm (BN) layers for both modalities, to capture the multi-spectral information from different modalities of the same objects. A novel Gated Spectral Unit (GSU) and mixed feature exchange strategy are proposed to increase the correlation of modality-shared information and decrease the modality-specific information during the distillation process. Comprehensive experiments show that CSK-Net surpasses state-of-the-art models in multi-modal tasks and for missing modalities when exclusively utilizing IR data for inference across three public benchmarking datasets. For missing modality scenarios, the performance increase is achieved without additional computational costs compared to the baseline segmentation models.

Developing Linguistic Patterns to Mitigate Inherent Human Bias in Offensive Language Detection

  • paper_url: http://arxiv.org/abs/2312.01787
  • repo_url: https://github.com/tanyelai/lingda
  • paper_authors: Toygar Tanyel, Besher Alkurdi, Serkan Ayvaz
  • for: 寻找一种减少人类偏见的方法,以提高社交媒体上的粗鲁语言识别率和公平性。
  • methods: 提出了一种语言数据增强方法,通过利用机器的力量来改善标签过程的准确性和公平性。
  • results: 该方法可以在多种语言上提高粗鲁语言识别 tasks 的准确率,并减少社交媒体上的粗鲁内容。
    Abstract With the proliferation of social media, there has been a sharp increase in offensive content, particularly targeting vulnerable groups, exacerbating social problems such as hatred, racism, and sexism. Detecting offensive language use is crucial to prevent offensive language from being widely shared on social media. However, the accurate detection of irony, implication, and various forms of hate speech on social media remains a challenge. Natural language-based deep learning models require extensive training with large, comprehensive, and labeled datasets. Unfortunately, manually creating such datasets is both costly and error-prone. Additionally, the presence of human-bias in offensive language datasets is a major concern for deep learning models. In this paper, we propose a linguistic data augmentation approach to reduce bias in labeling processes, which aims to mitigate the influence of human bias by leveraging the power of machines to improve the accuracy and fairness of labeling processes. This approach has the potential to improve offensive language classification tasks across multiple languages and reduce the prevalence of offensive content on social media.
    摘要 Natural language-based deep learning models require extensive training with large, comprehensive, and labeled datasets. However, manually creating such datasets is not only costly but also prone to errors. Moreover, the presence of human bias in offensive language datasets is a major concern for deep learning models.To address these challenges, we propose a linguistic data augmentation approach to reduce bias in labeling processes. By leveraging the power of machines, we aim to mitigate the influence of human bias and improve the accuracy and fairness of labeling processes. This approach has the potential to improve offensive language classification tasks across multiple languages and reduce the prevalence of offensive content on social media.

CZL-CIAE: CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation

  • paper_url: http://arxiv.org/abs/2312.01758
  • repo_url: None
  • paper_authors: Yuntao Shou, Wei Ai, Tao Meng, Keqin Li
  • for: zero-shot age estimation for improving efficiency and accuracy of various applications such as age verification and secure access control, and promoting research on multi-modal and zero-shot learning in the social media field.
  • methods: CLIP model for extracting image features and text semantic information, and a new Transformer architecture (FourierFormer) for fusing image and text semantic information, and reversible age estimation with end-to-end error feedback.
  • results: better age prediction results through extensive experiments on multiple data sets.
    Abstract Zero-shot age estimation aims to learn feature information about age from input images and make inferences about a given person's image or video frame without specific sample data. The development of zero-shot age estimation can improve the efficiency and accuracy of various applications (e.g., age verification and secure access control, etc.), while also promoting research on multi-modal and zero-shot learning in the social media field. For example, zero-sample age estimation can be used to create social networks focused on specific age groups. However, existing methods mainly focus on supervised, labeled age estimation learning, and the prediction effect of zero-shot learning is very poor. To tackle the above issues, we propose a novel CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation (CZL-CIAE). Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture (i.e., FourierFormer) to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions. Through extensive experiments on multiple data sets, CZL-CIAE has achieved better age prediction results.
    摘要 zero-shot年龄估计目标是从输入图像中学习出年龄信息并对给定人的图像或视频帧进行无特定样本数据的推理。发展zero-shot年龄估计可以提高不同应用程序(如年龄验证和安全访问控制等)的效率和准确率,同时也会促进社交媒体领域的多Modal和Zero-shot学习研究。例如,zero-sample年龄估计可以创建专门targeting specific age groups的社交网络。然而,现有方法主要集中在supervised、标注年龄估计学习上,zero-shot学习预测效果很差。为解决以上问题,我们提出了一种基于CLIP模型的Zero-shot学习方法 для Correcting Inverse Age Estimation(CZL-CIAE)。 Specifically, we first introduce the CLIP model to extract image features and text semantic information respectively, and map them into a highly semantically aligned high-dimensional feature space. Next, we designed a new Transformer architecture(i.e., FourierFormer)to achieve channel evolution and spatial interaction of images, and to fuse image and text semantic information. Finally, we introduce reversible age estimation, which uses end-to-end error feedback to reduce the error rate of age predictions。通过对多个数据集进行广泛的实验,CZL-CIAE实现了更好的年龄预测结果。

A Comprehensive Literature Review on Sweet Orange Leaf Diseases

  • paper_url: http://arxiv.org/abs/2312.01756
  • repo_url: None
  • paper_authors: Yousuf Rayhan Emon, Md Golam Rabbani, Dr. Md. Taimur Ahad, Faruk Ahmed
  • for: 这个研究旨在开发一个自动化的护叶病诊断系统,以早期检测和诊断甘蔗叶病,提高农业生产效率。
  • methods: 这个研究使用了不同的图像处理技术和机器学习模型,包括视transformer(ViT)、神经网络(CNN)、CNN with SoftMax和RBF SVM、Hybrid CNN-SVM、HLB-ConvMLP、EfficientNet-b0、YOLOv5、YOLOv7、卷积神经网络等。这些机器学习模型在不同的数据集上进行测试,并成功地检测到病诊断。
  • results: 本研究通过对不同机器学习模型的比较,发现这些模型在检测甘蔗叶病方面的精度、准确率、回归率等指标均达到了一定的水平。这些模型在实际应用中具有广泛的应用前景和潜在的商业价值。
    Abstract Sweet orange leaf diseases are significant to agricultural productivity. Leaf diseases impact fruit quality in the citrus industry. The apparition of machine learning makes the development of disease finder. Early detection and diagnosis are necessary for leaf management. Sweet orange leaf disease-predicting automated systems have already been developed using different image-processing techniques. This comprehensive literature review is systematically based on leaf disease and machine learning methodologies applied to the detection of damaged leaves via image classification. The benefits and limitations of different machine learning models, including Vision Transformer (ViT), Neural Network (CNN), CNN with SoftMax and RBF SVM, Hybrid CNN-SVM, HLB-ConvMLP, EfficientNet-b0, YOLOv5, YOLOv7, Convolutional, Deep CNN. These machine learning models tested on various datasets and detected the disease. This comprehensive review study related to leaf disease compares the performance of the models; those models' accuracy, precision, recall, etc., were used in the subsisting studies
    摘要 甘蔗叶病病菌对农业生产力有重要影响。叶病菌会影响柑橘业的果质。机器学习的出现使得叶病菌发现的开发成为可能。早期检测和诊断是叶管理的必要条件。已经开发了使用不同图像处理技术的柑橘叶病预测自动系统。这个全面的文献综述基于叶病和机器学习方法在检测损害叶片的图像分类方面。不同机器学习模型的优缺点,包括视觉转移(ViT)、神经网络(CNN)、CNN with SoftMax和RBF SVM、гибрид CNN-SVM、HLB-ConvMLP、EfficientNet-b0、YOLOv5、YOLOv7、卷积神经网络等,这些机器学习模型在不同的数据集上进行测试,检测到了病菌。本全面的文献综述研究与叶病相关,对各种机器学习模型的性能进行了比较,包括准确率、特异性、回归率等指标。这些指标在现有的研究中被广泛应用。

Model-based Deep Learning for Beam Prediction based on a Channel Chart

  • paper_url: http://arxiv.org/abs/2312.02239
  • repo_url: None
  • paper_authors: Taha Yassine, Baptiste Chatelier, Vincent Corlay, Matthieu Crussière, Stephane Paquelet, Olav Tirkkonen, Luc Le Magoarou
  • for: Channel charting builds a map of the radio environment in an unsupervised way, which can be used for various applications such as beam prediction.
  • methods: Advanced model-based neural network architectures are proposed for both channel charting and beam prediction.
  • results: Promising results are yielded on realistic synthetic channels.Here’s the simplified Chinese text:
  • for: channel charting 建立无监督的电磁环境地图,可以用于多种应用,如扫描 Prediction。
  • methods: 提议使用高级模型基于神经网络架构进行通道映射和扫描预测。
  • results: 在实际synthetic通道上进行评估,获得了良好的结果。
    Abstract Channel charting builds a map of the radio environment in an unsupervised way. The obtained chart locations can be seen as low-dimensional compressed versions of channel state information that can be used for a wide variety of applications, including beam prediction. In non-standalone or cell-free systems, chart locations computed at a given base station can be transmitted to several other base stations (possibly operating at different frequency bands) for them to predict which beams to use. This potentially yields a dramatic reduction of the overhead due to channel estimation or beam management, since only the base station performing charting requires channel state information, the others directly predicting the beam from the chart location. In this paper, advanced model-based neural network architectures are proposed for both channel charting and beam prediction. The proposed methods are assessed on realistic synthetic channels, yielding promising results.
    摘要 通道映射建立了无监督的广播环境地图。得到的地图位置可以看作是压缩后的通道状态信息,可以用于许多应用程序,包括扫描预测。在不受管控或免费系统中,计算出的地图位置可以被传输到几个其他基站(可能在不同频率带上运行),以便他们根据地图位置预测要使用哪些扫描。这可能导致通道估计或扫描管理的开销减少,因为只有执行通道映射的基站需要通道状态信息,其他基站直接从地图位置预测扫描。在这篇论文中,提出了先进的模型基于神经网络架构,用于通道映射和扫描预测。提出的方法在实际的synthetic通道上进行评估,获得了抢手的结果。

Cybersecurity threats in FinTech: A systematic review

  • paper_url: http://arxiv.org/abs/2312.01752
  • repo_url: None
  • paper_authors: Danial Javaheri, Mahdi Fahmideh, Hassan Chizari, Pooia Lalbakhsh, Junbeom Hur
  • for: 本研究旨在描述金融科技领域的安全威胁和防御策略,以帮助各类利益相关者,从银行和企业到全球政府机构,更好地理解现有的挑战和有效的防御策略。
  • methods: 本研究采用PRISMA方法ологи和主题挖掘分析74篇研究论文,总结了11种中心的网络威胁,43篇论文详细介绍了这些威胁,并标识了9种防御策略,31篇论文详细介绍了这些防御策略。
  • results: 本研究的深入分析为各类利益相关者提供了不可或缺的洞察,揭示了现有的挑战和有效的防御策略,同时也提供了未来研究的方向。
    Abstract The rapid evolution of the Smart-everything movement and Artificial Intelligence (AI) advancements have given rise to sophisticated cyber threats that traditional methods cannot counteract. Cyber threats are extremely critical in financial technology (FinTech) as a data-centric sector expected to provide 24/7 services. This paper introduces a novel and refined taxonomy of security threats in FinTech and conducts a comprehensive systematic review of defensive strategies. Through PRISMA methodology applied to 74 selected studies and topic modeling, we identified 11 central cyber threats, with 43 papers detailing them, and pinpointed 9 corresponding defense strategies, as covered in 31 papers. This in-depth analysis offers invaluable insights for stakeholders ranging from banks and enterprises to global governmental bodies, highlighting both the current challenges in FinTech and effective countermeasures, as well as directions for future research.
    摘要 随着智能一切运动和人工智能技术的快速发展,财务科技领域面临着高度复杂的网络威胁,传统方法无法应对。 FinTech 是一个数据导向的领域,需要提供 24/7 服务,网络安全问题尤为抢夺。这篇论文提出了一种新的和精细的安全威胁分类法,并进行了全面的系统性文献综述。通过 PRISMA 方法应用于 74 篇选择的研究和主题分析,我们确定了 11 种中心的网络威胁,其中 43 篇论文详细描述了这些威胁,并指出了 9 种防御策略,这些策略在 31 篇论文中得到了详细的描述。这份深入的分析为 FinTech 领域的投资者、银行和企业提供了宝贵的指导,同时也为未来研究提供了方向。

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

  • paper_url: http://arxiv.org/abs/2312.02238
  • repo_url: https://github.com/showlab/X-Adapter
  • paper_authors: Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou
  • For: The paper aims to enable pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining.* Methods: The proposed method, called X-Adapter, keeps a frozen copy of the old model to preserve the connectors of different plugins, and adds trainable mapping layers to bridge the decoders from models of different versions for feature remapping.* Results: X-Adapter demonstrates universal compatibility with various plugins and enables plugins of different versions to work together, expanding the functionalities of diffusion community. The proposed method is evaluated through extensive experiments, showing its effectiveness in facilitating wider application in the upgraded foundational diffusion model.Here’s the simplified Chinese version of the three key points:* For: 论文目标是让预训练的插件(例如ControlNet、LoRA)与升级的文本到图像扩散模型(例如SDXL)无需进一步 retraining 工作。* Methods: 提议的方法是 X-Adapter,它保留了老版本模型的冻结 копиー,以保持不同插件的连接器,并添加了可训练的映射层,将模型不同版本的解码器桥接起来。* Results: X-Adapter 实现了不同插件的通用兼容性,并使得不同版本的插件可以共同工作,扩展扩散社区的功能。 论文通过广泛的实验证明了 X-Adapter 的有效性,以便在升级基础扩散模型中进行更广泛的应用。
    Abstract We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.
    摘要 我们介绍X-Adapter,一种通用升级器,可以使得预训练的插件(例如ControlNet、LoRA)与升级的文本到图像扩散模型(例如SDXL)直接兼容,无需进一步重新训练。我们实现了这一目标 by 训练一个额外的网络,用于控制冰结的升级模型的新文本图像数据对。在详细的实现方式上,X-Adapter保留了老版本模型的冰结版本,以保持不同插件之间的连接。此外,X-Adapter添加了可训练的映射层,用于将模型不同版本之间的各种器件映射。这些映射后的特征将用作升级模型的导航。为了增强X-Adapter的导航能力,我们采用了一种空文本训练策略 для升级模型。之后,我们还提出了一种两Stage denoising策略,以协调X-Adapter和升级模型的初始缺陷。感谢我们的策略,X-Adapter可以 universal兼容不同的插件,同时也使得不同版本的插件可以共同工作,从而扩展扩散社区的功能。为了证明我们的方法的效果,我们进行了广泛的实验,实验结果表明,X-Adapter可以为升级基础扩散模型提供更广泛的应用。

Divide-and-Conquer Strategy for Large-Scale Dynamic Bayesian Network Structure Learning

  • paper_url: http://arxiv.org/abs/2312.01739
  • repo_url: None
  • paper_authors: Hui Ouyang, Cheng Chen, Ke Tang
  • For: The paper is written for researchers and practitioners who work with large-scale Bayesian networks, particularly in the fields of gene expression analysis, healthcare, and traffic prediction.* Methods: The paper introduces a novel divide-and-conquer strategy for large-scale structure learning of dynamic Bayesian networks (DBNs), which is originally developed for static Bayesian networks (BNs). The approach leverages prior knowledge of 2-time-sliced Bayesian networks (2-TBNs) to enhance performance.* Results: The paper shows that the proposed approach significantly improves the scalability and accuracy of 2-TBN structure learning. Experimental results demonstrate substantial improvements over existing algorithms in both computational efficiency and structure learning accuracy, with an average runtime reduction of 93.65% and an average improvement of 74.45% and 110.94% in two accuracy metrics, respectively.
    Abstract Dynamic Bayesian Networks (DBNs), renowned for their interpretability, have become increasingly vital in representing complex stochastic processes in various domains such as gene expression analysis, healthcare, and traffic prediction. Structure learning of DBNs from data is challenging, particularly for datasets with thousands of variables. Most current algorithms for DBN structure learning are adaptations from those used in static Bayesian Networks (BNs), and are typically focused on small-scale problems. In order to solve large-scale problems while taking full advantage of existing algorithms, this paper introduces a novel divide-and-conquer strategy, originally developed for static BNs, and adapts it for large-scale DBN structure learning. In this work, we specifically concentrate on 2 Time-sliced Bayesian Networks (2-TBNs), a special class of DBNs. Furthermore, we leverage the prior knowledge of 2-TBNs to enhance the performance of the strategy we introduce. Our approach significantly improves the scalability and accuracy of 2-TBN structure learning. Experimental results demonstrate the effectiveness of our method, showing substantial improvements over existing algorithms in both computational efficiency and structure learning accuracy. On problem instances with more than 1,000 variables, our approach improves two accuracy metrics by 74.45% and 110.94% on average , respectively, while reducing runtime by 93.65% on average.
    摘要 动态感知网络(DBN),因其可读性而著称,在各个领域,如基因表达分析、医疗和交通预测等,已成为表示复杂的随机过程的重要工具。然而,从数据中学习 DBN 结构是具有挑战性,特别是面临千个变量的数据集。现有大多数 DBN 结构学习算法都是从静止感知网络(BN)中改进而来,主要是针对小规模问题进行优化。为解决大规模问题而且充分利用现有算法,本文提出了一种新的分治策略,原来用于静止 BN。在这种情况下,我们专门关注 2 个时间层 Bayesian Networks(2-TBN),一种特殊的 DBN 类型。此外,我们利用 2-TBN 的先验知识来提高我们的策略性能。我们的方法可以在计算效率和结构学习准确性两个方面带来显著改进。实验结果表明,我们的方法在计算效率和结构学习准确性两个方面均有显著改进,相比 existed 算法,在更大的问题集中可以提高两个精度指标的平均提升率为 74.45% 和 110.94%,同时降低了平均运行时间的提升率为 93.65%。

Learning Multi-graph Structure for Temporal Knowledge Graph Reasoning

  • paper_url: http://arxiv.org/abs/2312.03004
  • repo_url: None
  • paper_authors: Jinchuan Zhang, Bei Hui, Chong Mu, Ling Tian
  • for: 预测未来事件 based on 历史快照分布在时间戳中的 Temporal Knowledge Graph (TKG) 推理。
  • methods: 提出了一种 Learning Multi-graph Structure (LMS) 方法,包括三个独特模块,涵盖 TKG 中多个时间戳的同时和演化模式、时间戳之间的查询特定相关性、以及时间戳的 semantics,以 capture TKG 的多种表达形式。
  • results: 实验结果表明,LMS 模型在五个事件基数据集上表现出色,超越了现有推理模型,说明模型多 graph 视角的 TKG 推理具有优势。
    Abstract Temporal Knowledge Graph (TKG) reasoning that forecasts future events based on historical snapshots distributed over timestamps is denoted as extrapolation and has gained significant attention. Owing to its extreme versatility and variation in spatial and temporal correlations, TKG reasoning presents a challenging task, demanding efficient capture of concurrent structures and evolutional interactions among facts. While existing methods have made strides in this direction, they still fall short of harnessing the diverse forms of intrinsic expressive semantics of TKGs, which encompass entity correlations across multiple timestamps and periodicity of temporal information. This limitation constrains their ability to thoroughly reflect historical dependencies and future trends. In response to these drawbacks, this paper proposes an innovative reasoning approach that focuses on Learning Multi-graph Structure (LMS). Concretely, it comprises three distinct modules concentrating on multiple aspects of graph structure knowledge within TKGs, including concurrent and evolutional patterns along timestamps, query-specific correlations across timestamps, and semantic dependencies of timestamps, which capture TKG features from various perspectives. Besides, LMS incorporates an adaptive gate for merging entity representations both along and across timestamps effectively. Moreover, it integrates timestamp semantics into graph attention calculations and time-aware decoders, in order to impose temporal constraints on events and narrow down prediction scopes with historical statistics. Extensive experimental results on five event-based benchmark datasets demonstrate that LMS outperforms state-of-the-art extrapolation models, indicating the superiority of modeling a multi-graph perspective for TKG reasoning.
    摘要 temporal Knowledge Graph (TKG) 预测基于历史快照分布在时间戳上的未来事件,称为拟合,在预测领域中备受关注。由于 TKG 的极高灵活性和时空相关性,TKG 预测Task 是一项挑战性的任务,需要效果地捕捉兼施 concurrent 结构和演化交互之间的关系。 existing 方法尽管在这个方向下进步了很多,但还没有充分利用 TKG 的内在表达semantics,包括时间戳之间的实体关系和时间信息的周期性。这些限制使得它们无法彻底反映历史依赖关系和未来趋势。为了解决这些缺点,本文提出了一种创新的逻辑方法,即 Learning Multi-graph Structure (LMS)。具体来说,LMS 包括三个独立模块,每个模块专注于 TKG 中不同方面的图结构知识,包括同步和演化模式、时间戳之间的查询相关性和时间戳之间的semantic Dependencies。此外,LMS 还包括一个适应门户,用于在时间戳上归一化实体表示,以及一个时间感知计算和时间感知解码器,以在历史统计信息的基础上强制时间约束事件,缩小预测范围。广泛的实验结果表明,LMS 在五个事件基本数据集上表现出色,超越了现有的拟合模型, indicating that modeling a multi-graph perspective for TKG reasoning is superior.

Rethinking Adversarial Training with Neural Tangent Kernel

  • paper_url: http://arxiv.org/abs/2312.02236
  • repo_url: None
  • paper_authors: Guanlin Li, Han Qiu, Shangwei Guo, Jiwei Li, Tianwei Zhang
  • for: 本文对深度学习安全领域的反攻性训练(AT)进行了深入的调查和分析,使用内置kernel(NTK)来描述神经网络训练动态。
  • methods: 本文使用NTK进行反攻性训练的深入研究,揭示了NTK的演化过程中的三个新发现:首先,探讨数据normalization对AT的影响和批处理层中的偏好估计的重要性;其次,通过实验探索kernel动态,提出更加时间效率的AT方法;第三,研究内置kernel中的 спектル特征,以解决极端过拟合问题。
  • results: 本文的研究显示,通过利用NTK的观察来改进现有的AT方法,可以提高AT的效果和稳定性。这些发现可能为深度学习安全领域的发展提供新的思路和方法。
    Abstract Adversarial training (AT) is an important and attractive topic in deep learning security, exhibiting mysteries and odd properties. Recent studies of neural network training dynamics based on Neural Tangent Kernel (NTK) make it possible to reacquaint AT and deeply analyze its properties. In this paper, we perform an in-depth investigation of AT process and properties with NTK, such as NTK evolution. We uncover three new findings that are missed in previous works. First, we disclose the impact of data normalization on AT and the importance of unbiased estimators in batch normalization layers. Second, we experimentally explore the kernel dynamics and propose more time-saving AT methods. Third, we study the spectrum feature inside the kernel to address the catastrophic overfitting problem. To the best of our knowledge, it is the first work leveraging the observations of kernel dynamics to improve existing AT methods.
    摘要 <>传统的深度学习安全研究中有一个重要和吸引人的话题是对抗训练(Adversarial Training,AT),它在深度学习中展现了一些神秘和奇怪的特性。在最近的研究中,基于神经唐氏核函数(Neural Tangent Kernel,NTK)的研究使得我们可以更深入地了解AT的过程和特性。在这篇论文中,我们进行了深入的AT过程和特性的研究,包括NTK演化。我们发现了以下三个新发现,它们在前一次的研究中未得到注意:1. 数据归一化对AT的影响和批处理层中的偏置估计的重要性。2. 我们通过实验探索了核动态的特性,并提出了更加高效的AT方法。3. 我们研究了核内部的谱特征,以解决遍历过程中的极端适应性问题。根据我们所知,这是首次通过观察核动态来改进现有的AT方法的研究。>>

Data Management For Large Language Models: A Survey

  • paper_url: http://arxiv.org/abs/2312.01700
  • repo_url: https://github.com/Aryia-Behroziuan/Other-sources
  • paper_authors: Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu
  • for: This paper aims to provide a comprehensive overview of current research in data management for Large Language Models (LLMs), covering various aspects of data management strategy design, including data quantity, data quality, domain/task composition, etc.
  • methods: The paper reviews and discusses the existing research on data management for LLMs, including the rationale behind management strategy selection, the consequential effects of data management, and methodologies for evaluating curated datasets.
  • results: The paper provides a comprehensive overview of current research in data management for LLMs, highlighting the challenges and limitations of existing approaches and outlining promising directions for future development. The paper also provides a collection of the latest papers on data management for LLMs, which can serve as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices.
    Abstract Data plays a fundamental role in the training of Large Language Models (LLMs). Effective data management, particularly in the formulation of a well-suited training dataset, holds significance for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning phases. Despite the considerable importance of data management, the current research community still falls short in providing a systematic analysis of the rationale behind management strategy selection, its consequential effects, methodologies for evaluating curated datasets, and the ongoing pursuit of improved strategies. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey provides a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various noteworthy aspects of data management strategy design: data quantity, data quality, domain/task composition, etc. Looking toward the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through effective data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.
    摘要 大数据在大语言模型(LLM)的训练中扮演着基本性的角色。有效的数据管理,特别是在预训练和精度调整阶段的数据集的设计方面,对于提高模型性能和训练效率具有重要意义。尽管现有研究社区对数据管理策略的选择和其后果的系统分析仍然缺乏,但是随着数据管理的探索,研究者们在这个领域的兴趣也在不断增加。本文提供了 LLM 预训练和精度调整阶段中数据管理的全面概述,涵盖了不同的值得注意的数据管理策略:数据量、数据质量、领域/任务组合等。借鉴未来的挑战,我们还提出了可能的发展方向,因此,这篇文章可作为实践者们构建强大 LLM 的效果性数据管理实践的指南。相关最新的论文集可以在 GitHub 上找到:https://github.com/ZigeW/data_management_LLM。

Rethinking Urban Mobility Prediction: A Super-Multivariate Time Series Forecasting Approach

  • paper_url: http://arxiv.org/abs/2312.01699
  • repo_url: https://github.com/chengyui/sumformer
  • paper_authors: Jinguo Cheng, Ke Li, Yuxuan Liang, Lijun Sun, Junchi Yan, Yuankai Wu
  • for: 这个研究旨在提高城市流动性预测的精度和可靠性,以便更好地管理城市设施和服务。
  • methods: 本研究使用Super-Multivariate Urban Mobility Transformer(SUMformer)模型,这个模型利用特殊设计的注意力机制来计算时间和跨变量相互相关,并使用低频率滤波器提取重要信息,以提高长期预测的精度。
  • results: SUMformer模型在三个真实世界数据集上与现有的州前方法进行比较,实现了城市流动性模式建模和长期预测的出色表现,并且在computational costs和运算效率方面具有明显的改善。
    Abstract Long-term urban mobility predictions play a crucial role in the effective management of urban facilities and services. Conventionally, urban mobility data has been structured as spatiotemporal videos, treating longitude and latitude grids as fundamental pixels. Consequently, video prediction methods, relying on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have been instrumental in this domain. In our research, we introduce a fresh perspective on urban mobility prediction. Instead of oversimplifying urban mobility data as traditional video data, we regard it as a complex multivariate time series. This perspective involves treating the time-varying values of each grid in each channel as individual time series, necessitating a thorough examination of temporal dynamics, cross-variable correlations, and frequency-domain insights for precise and reliable predictions. To address this challenge, we present the Super-Multivariate Urban Mobility Transformer (SUMformer), which utilizes a specially designed attention mechanism to calculate temporal and cross-variable correlations and reduce computational costs stemming from a large number of time series. SUMformer also employs low-frequency filters to extract essential information for long-term predictions. Furthermore, SUMformer is structured with a temporal patch merge mechanism, forming a hierarchical framework that enables the capture of multi-scale correlations. Consequently, it excels in urban mobility pattern modeling and long-term prediction, outperforming current state-of-the-art methods across three real-world datasets.
    摘要 长期城市流动预测对城市设施和服务的有效管理起着关键作用。传统上,城市流动数据被视为空间时间视频,将 longitude 和 latitude 网格视为基本像素。因此,视频预测方法,基于卷积神经网络(CNN)和视觉变换器(ViT),在这个领域中得到了广泛的应用。在我们的研究中,我们提出了城市流动预测的新视角。而不是将城市流动数据简化为传统视频数据,我们将其视为复杂的多变量时间序列。这个视角需要对时间变化、交叉变量关系和频率域理解进行仔细的分析,以确保准确和可靠的预测。为解决这个挑战,我们提出了超多变量城市流动变换器(SUMformer)。SUMformer 使用特制的注意力机制来计算时间和交叉变量关系,并将计算量减少到最小化。此外,SUMformer 还使用低频滤波器提取关键信息,以便进行长期预测。此外,SUMformer 采用时间补充机制,形成层次结构,以便捕捉多 scales 的相关性。因此,它在城市流动模式建模和长期预测方面表现出色,在三个真实世界数据集上超越当前状态的先进方法。

Hulk: A Universal Knowledge Translator for Human-Centric Tasks

  • paper_url: http://arxiv.org/abs/2312.01697
  • repo_url: https://github.com/opengvlab/humanbench
  • paper_authors: Yizhou Wang, Yixuan Wu, Shixiang Tang, Weizhen He, Xun Guo, Feng Zhu, Lei Bai, Rui Zhao, Jian Wu, Tong He, Wanli Ouyang
    for:This paper proposes a multimodal human-centric generalist model called Hulk, which can address most mainstream human-centric tasks simultaneously without task-specific fine-tuning.methods:The proposed method uses two general heads, one for discrete representations and the other for continuous representations, to integrate knowledge across a wide range of tasks and treat human-centric tasks as modality translation.results:Experimental results on 11 benchmarks across 8 human-centric tasks demonstrate the superiority of the proposed method, surpassing previous methods substantially. The code will be available on GitHub.Here’s the simplified Chinese text:for:这篇论文提出了一种多模态人Centric总体模型,名为Hulk,可以同时解决大多数人Centric任务,无需特定任务 fine-tuning。methods:该方法使用两个通用头,一个为数据表示,另一个为位置坐标,将不同任务的知识集成到一起,将人Centric任务看作多模态翻译。results:对11个benchmark数据集的8种人Centric任务进行了广泛的实验,证明提出的方法的优越性,比前方法有substantial提升。代码将在GitHub上公开。
    Abstract Human-centric perception tasks, e.g., human mesh recovery, pedestrian detection, skeleton-based action recognition, and pose estimation, have wide industrial applications, such as metaverse and sports analysis. There is a recent surge to develop human-centric foundation models that can benefit a broad range of human-centric perception tasks. While many human-centric foundation models have achieved success, most of them only excel in 2D vision tasks or require extensive fine-tuning for practical deployment in real-world scenarios. These limitations severely restrict their usability across various downstream tasks and situations. To tackle these problems, we present Hulk, the first multimodal human-centric generalist model, capable of addressing most of the mainstream tasks simultaneously without task-specific finetuning, covering 2D vision, 3D vision, skeleton-based, and vision-language tasks. The key to achieving this is condensing various task-specific heads into two general heads, one for discrete representations, e.g., languages, and the other for continuous representations, e.g., location coordinates. The outputs of two heads can be further stacked into four distinct input and output modalities. This uniform representation enables Hulk to treat human-centric tasks as modality translation, integrating knowledge across a wide range of tasks. To validate the effectiveness of our proposed method, we conduct comprehensive experiments on 11 benchmarks across 8 human-centric tasks. Experimental results surpass previous methods substantially, demonstrating the superiority of our proposed method. The code will be available on https://github.com/OpenGVLab/HumanBench.
    摘要 人体中心视觉任务,如人体网格恢复、人脸检测、skeleton基于动作识别和姿势估计,在metaverse和体育分析等领域有广泛的应用。近年来,开发人体中心基本模型的需求增加,以便涵盖多种人体中心视觉任务。虽然许多人体中心基本模型取得了成功,但大多数其只能在2D视觉任务中达到优秀表现,或者需要大量的细化调整才能在实际场景中应用。这些限制了它们在不同下游任务和情况下的可用性。为了解决这些问题,我们提出了Hulk,首个多模态人体中心通用模型,能同时解决大多数主流任务,不需要任务特定的细化调整,涵盖2D视觉、3D视觉、skeleton基于和视语任务。途径是将各种任务特定的头部压缩到两个通用头部,一个用于整数表示,另一个用于连续表示。两个头部的输出可以进一步堆叠成四个不同的输入和输出模式。这种均衡的表示方式使得Hulk能将人体中心任务看作模态翻译,结合各种任务的知识。为验证我们提出的方法的效果,我们在11个标准准的测试 benchmark 上进行了广泛的实验。实验结果明显超越了先前的方法,表明了我们的方法的优越性。代码将在https://github.com/OpenGVLab/HumanBench 上提供。

Risk-Controlling Model Selection via Guided Bayesian Optimization

  • paper_url: http://arxiv.org/abs/2312.01692
  • repo_url: None
  • paper_authors: Bracha Laufer-Goldshtein, Adam Fisch, Regina Barzilay, Tommi Jaakkola
  • for: 这个研究的目的是找到一组具有用户指定的限制的机器学习模型的参数配置,以实现与其他矛盾的指标之间的兼顾。
  • methods: 这个研究使用了数学优化(BO)与严格的风险控制程序,其核心思想是将BO引导向有效的测试策略。
  • results: 这个研究在许多任务上展示了其效iveness,包括低错率预测、公平预测、处理假 positives、调节率和压缩在生成模型中、以及降低计算成本。
    Abstract Adjustable hyperparameters of machine learning models typically impact various key trade-offs such as accuracy, fairness, robustness, or inference cost. Our goal in this paper is to find a configuration that adheres to user-specified limits on certain risks while being useful with respect to other conflicting metrics. We solve this by combining Bayesian Optimization (BO) with rigorous risk-controlling procedures, where our core idea is to steer BO towards an efficient testing strategy. Our BO method identifies a set of Pareto optimal configurations residing in a designated region of interest. The resulting candidates are statistically verified and the best-performing configuration is selected with guaranteed risk levels. We demonstrate the effectiveness of our approach on a range of tasks with multiple desiderata, including low error rates, equitable predictions, handling spurious correlations, managing rate and distortion in generative models, and reducing computational costs.
    摘要 Note:* 准确率 (precision) is translated as 准确率* 公平性 (fairness) is translated as 公平性* Robustness is translated as Robustness* inference cost is translated as 推理成本* Pareto optimal configurations is translated as Pareto优化的配置* statistically verified is translated as 统计上验证* guaranteed risk levels is translated as garantizado risk levels* desiderata is translated as 需求

ResEnsemble-DDPM: Residual Denoising Diffusion Probabilistic Models for Ensemble Learning

  • paper_url: http://arxiv.org/abs/2312.01682
  • repo_url: None
  • paper_authors: Shi Zhenning, Dong Changsheng, Xie Xueshuo, Pan Bin, He Along, Li Tao
  • for: 这个论文是为了提高图像分割 зада务的性能而写的。
  • methods: 这个论文使用了混合 diffusion probabilistic models 和 end-to-end 模型,通过ensemble learning来提高图像分割的性能。
  • results: 实验结果表明,我们的 ResEnsemble-DDPM 可以进一步提高现有模型的性能,并且其 ensemble learning 策略可以普遍应用于其他图像生成下游任务中,获得竞争力强的成绩。
    Abstract Nowadays, denoising diffusion probabilistic models have been adapted for many image segmentation tasks. However, existing end-to-end models have already demonstrated remarkable capabilities. Rather than using denoising diffusion probabilistic models alone, integrating the abilities of both denoising diffusion probabilistic models and existing end-to-end models can better improve the performance of image segmentation. Based on this, we implicitly introduce residual term into the diffusion process and propose ResEnsemble-DDPM, which seamlessly integrates the diffusion model and the end-to-end model through ensemble learning. The output distributions of these two models are strictly symmetric with respect to the ground truth distribution, allowing us to integrate the two models by reducing the residual term. Experimental results demonstrate that our ResEnsemble-DDPM can further improve the capabilities of existing models. Furthermore, its ensemble learning strategy can be generalized to other downstream tasks in image generation and get strong competitiveness.
    摘要 Here's the Simplified Chinese translation:现在,许多图像分割任务使用了减噪扩散概率模型。然而,现有的端到端模型已经显示出了非常出色的能力。而不是使用减噪扩散概率模型 solo, THEN 将这两种模型结合在一起可以进一步提高图像分割性能。基于这个想法,我们隐式地引入了剩余项到扩散过程中,并提出了 ResEnsemble-DDPM,这种模型通过ensemble learning来协调减噪扩散模型和端到端模型。这两种模型的输出分布与实际值分布互相对应,因此我们可以通过减少剩余项来集成这两种模型。实验结果表明,我们的 ResEnsemble-DDPM 可以进一步提高现有模型的能力。此外,它的 ensemble learning 策略可以应用于其他图像生成下游任务,并且具有强竞争力。

Jellyfish: A Large Language Model for Data Preprocessing

  • paper_url: http://arxiv.org/abs/2312.01678
  • repo_url: None
  • paper_authors: Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
  • for: 这篇论文提出了一种基于 Llama 2 13B 模型的开源 LLM,用于解决数据处理(DP)中的各种任务。
  • methods: 该模型通过 instruciton-tuning 使用多个典型的 DP 任务的数据集,如错误检测、数据填充、schema matching 和实体匹配,并且可以普适到其他任务。
  • results: 该模型在使用本地、单个、低价的 GPU 上运行,并且可以具有高度的自然语言理解能力,使得用户可以手动编写 DP 任务的指令。与现有方法不同的是,该模型在调试过程中获得域知识,并且在推理过程中可以选择注入任务和数据集特定的知识。该模型还具有解释器,可以解释其输出决策。
    Abstract In this paper, we present Jellyfish, an open-source LLM as a universal task solver for DP. Built on the Llama 2 13B model, Jellyfish is instruction-tuned with the datasets of several typical DP tasks including error detection, data imputation, schema matching, and entity matching, and delivers generalizability to other tasks. Remarkably, Jellyfish can operate on a local, single, and low-priced GPU with its 13 billion parameters, ensuring data security and enabling further tuning. Its proficiency in understanding natural language allows users to manually craft instructions for DP tasks. Unlike many existing methods that heavily rely on prior knowledge, Jellyfish acquires domain knowledge during its tuning process and integrates optional knowledge injection during inference. A distinctive feature of Jellyfish is its interpreter, which elucidates its output decisions. To construct Jellyfish, we develop a series of pre-tuning and DP-tuning techniques. Jellyfish is equipped with an instance serializer, which automatically translates raw data into model prompts, and a knowledge injector, which optionally introduces task- and dataset-specific knowledge to enhance DP performance. Our evaluation of Jellyfish, using a range of real datasets, shows its competitiveness compared to state-of-the-art methods and its strong generalizability to unseen tasks. Jellyfish's performance rivals that of GPT series models, and its interpreter offers enhanced reasoning capabilities compared to GPT-3.5. Furthermore, our evaluation highlights the effectiveness of the techniques employed in constructing Jellyfish. Our model is available at Hugging Face: https://huggingface.co/NECOUDBFM/Jellyfish .
    摘要 在这篇论文中,我们介绍了一种开源的LLM模型,即Jellyfish,作为数据处理(DP)任务的通用解决方案。Jellyfish基于Llama 2 13B模型,通过对多种典型DP任务的数据集进行指令优化,包括错误检测、数据填充、schema匹配和实体匹配,并可以扩展到其他任务。值得一提的是,Jellyfish可以在本地、单个和低价的GPU上运行,确保数据安全性和进一步的调参。它的自然语言理解能力使得用户可以通过手动制定DP任务的指令来使其进行工作。与许多现有方法不同的是,Jellyfish在调参过程中获得域知识,并在推理过程中可选择注入域知识以提高DP性能。Jellyfish的一个特点是其解释器,它解释了其输出决策的原理。为建立Jellyfish,我们开发了一系列的预训练和DP训练技术。Jellyfish具有自动将原始数据转换为模型提示的实例序列化器,以及可选地注入任务和数据集特定的知识以提高DP性能的知识注入器。我们对Jellyfish进行了一系列的实验,使用了一些真实的数据集,并证明了它的竞争力和对未看到任务的强大普适性。Jellyfish的性能与GPT系列模型相当,而其解释器提供了与GPT-3.5相比的加强逻辑能力。此外,我们的实验还证明了建立Jellyfish所使用的技术的有效性。Jellyfish模型可以在Hugging Face上获取:https://huggingface.co/NECOUDBFM/Jellyfish。

STADEE: STAtistics-based DEEp Detection of Machine Generated Text

  • paper_url: http://arxiv.org/abs/2312.01672
  • repo_url: https://github.com/hmgithub111/stadee
  • paper_authors: Zheng Chen, Huming Liu
  • for: 本研究旨在提出一种基于统计学的深度检测方法,用于识别机器生成的文本,并且解决现有方法对预训练语言模型(PLMs)的重要依赖。
  • methods: 本方法 integrate了关键的统计文本特征与深度分类器,强调词语概率和累积概率,这些特征在处理核心采样时具有重要作用。
  • results: 测试在多个 dataset 和enario (域内、域外、野外)下,STADEE 表现出优于传统统计方法和 fine-tune PLMs,尤其在域外和野外设置下,具有效果和普适性。 F1 分数为 87.05% 在域内,高于传统统计方法和 fine-tune PLMs。
    Abstract We present STADEE, a \textbf{STA}tistics-based \textbf{DEE}p detection method to identify machine-generated text, addressing the limitations of current methods that rely heavily on fine-tuning pre-trained language models (PLMs). STADEE integrates key statistical text features with a deep classifier, focusing on aspects like token probability and cumulative probability, crucial for handling nucleus sampling. Tested across diverse datasets and scenarios (in-domain, out-of-domain, and in-the-wild), STADEE demonstrates superior performance, achieving an 87.05% F1 score in-domain and outperforming both traditional statistical methods and fine-tuned PLMs, especially in out-of-domain and in-the-wild settings, highlighting its effectiveness and generalizability.
    摘要 我们介绍STADEE,一种基于统计学的深度检测方法,用于识别机器生成文本,现有方法强调精度调整预训练语言模型(PLM)的限制。STADEE将关键的统计文本特征与深度分类器结合,强调Token概率和总概率等方面,对核心采样进行了优化。在多个 dataset 和场景(域内、域外、野外)进行了测试,STADEE示出了优于传统统计方法和精度调整 PLM 的性能,达到了87.05%的 F1 分数在域内,并在域外和野外情况下表现更好,这 highlights 其效iveness 和普适性。

Analyze Drivers’ Intervention Behavior During Autonomous Driving – A VR-incorporated Approach

  • paper_url: http://arxiv.org/abs/2312.01669
  • repo_url: None
  • paper_authors: Zheng Xu
  • for: 这个论文旨在理解自动驾驶汽车(AV)的人类驾驶员干预行为,并使用这些知识来改进自动驾驶系统在类似情况下的表现。
  • methods: 这篇论文使用虚拟现实(VR)和交通微型模拟环境,对典型和多样化的交通场景进行测试。
  • results: 研究发现,驾驶员在干预过程中的行为具有某些特点,这些特点可以被用来改进自动驾驶系统的表现。此外,这种整合和充沛的工具对人机相互信任研究也具有价值。
    Abstract Given the rapid advance in ITS technologies, future mobility is pointing to vehicular autonomy. However, there is still a long way before full automation, and human intervention is required. This work sheds light on understanding human drivers' intervention behavior involved in the operation of autonomous vehicles (AVs) and utilizes this knowledge to improve the perception of critical driving scenarios. Experiment environments were implemented where the virtual reality (VR) and traffic micro-simulation are integrated, and tests were carried out under typical and diverse traffic scenes. Performance indicators such as the probability of intervention, accident rates are defined and used to quantify and compare the risk levels. By offering novel insights into drivers' intervention behavior, this work will help improve the performances of the automated control under similar scenarios. Furthermore, such an integrated and immersive tool for autonomous driving studies will be valuable for research on human-to-automation trust. To the best knowledge of the authors, this work is among the pioneer works making efforts into such types of tools.
    摘要 随着智能交通技术的快速发展,未来的交通将朝向自动驾驶。然而,全自动化仍然很遥见,人类 intervención仍然是必要的。这项工作探讨了自动驾驶汽车(AV)运行中人类驾驶员的干预行为,并利用这些知识来改善相应的驾驶情况。实验环境中,虚拟现实(VR)和交通微simulation被集成,并在典型和多样化的交通场景下进行了测试。性能指标such as干预概率和事故率被定义,用于量化和比较不同场景下的风险水平。这项工作将为相似场景下的自动控制性能提供新的意见,并且将有价值的支持人类对自动化的信任。根据作者们所知,这项工作是目前已知的先驱工作之一。

Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training

  • paper_url: http://arxiv.org/abs/2312.01663
  • repo_url: None
  • paper_authors: Runze He, Shaofei Huang, Xuecheng Nie, Tianrui Hui, Luoqi Liu, Jiao Dai, Jizhong Han, Guanbin Li, Si Liu
  • for: 这篇论文targets the adaptive source-driven 3D scene editing task, proposing a CustomNeRF model that unifies a text description or reference image as the editing prompt.
  • methods: 为了解决两个主要挑战,包括精确地编辑只有前景区域和多观点一致性,这篇论文提出了Local-Global Iterative Editing (LGIE) 训练方案和class-guided regularization。
  • results: 实验结果显示,这篇论文的CustomNeRF可以在实际场景下生成精确的编辑结果,并在文本驱动和图像驱动的设置下都能够获得良好的效果。
    Abstract In this paper, we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However, obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges, including accurate editing of only foreground regions and multi-view consistency given a single-view reference image. To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing, aimed at foreground-only manipulation while preserving the background. For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem among different views in image-driven editing. Extensive experiments show that our CustomNeRF produces precise editing results under various real scenes for both text- and image-driven settings.
    摘要 在这篇论文中,我们target了适应性的源驱动3D场景编辑任务,通过提议一个CustomNeRF模型,将文本描述或参考图像作为编辑提示。然而,实现符合编辑提示的编辑结果是非常困难,因为存在两个主要挑战,包括只编辑前景区域和多视图一致性给定单视图参考图像。为解决第一个挑战,我们提议了本地-全局迭代编辑(LGIE)训练方案,将前景区域编辑与全图像编辑进行交互,以实现前景只 manipulate 而保持背景不变。为解决第二个挑战,我们还设计了类别导向正则化,利用生成模型中的类别假设来缓解不同视图之间的不一致问题。我们的CustomNeRF在实际场景中对多种文本和图像驱动场景进行了广泛的实验,并证明了它能够生成精准的编辑结果。

ChatGPT as a Math Questioner? Evaluating ChatGPT on Generating Pre-university Math Questions

  • paper_url: http://arxiv.org/abs/2312.01661
  • repo_url: https://github.com/dxlong2000/ChatGPT-as-a-Math-Questioner
  • paper_authors: Phuoc Pham Van Long, Duc Anh Vu, Nhat M. Hoang, Xuan Long Do, Anh Tuan Luu
  • for: 这个论文主要用于探讨 ChatGPT 是否可以生成高质量的初中数学题目,以及它在不同学科级别和教学资源下的表现。
  • methods: 该论文采用了两种 Setting:context-aware 和 context-unaware。context-aware Setting 中, authors 使用现有的数学题解 benchmark 评估 ChatGPT,而 context-unaware Setting 中,authors 使用自己爬取的高中数学课程资料,并生成对应的数学题目。
  • results: 研究发现,ChatGPT 在 context-aware Setting 中表现不佳,但在 context-unaware Setting 中表现更好,可以生成高质量的数学题目。此外,authors 还提出了一些改进 ChatGPT 的方法,以提高其在数学题目生成方面的表现。
    Abstract Mathematical questioning is crucial for assessing students problem-solving skills. Since manually creating such questions requires substantial effort, automatic methods have been explored. Existing state-of-the-art models rely on fine-tuning strategies and struggle to generate questions that heavily involve multiple steps of logical and arithmetic reasoning. Meanwhile, large language models(LLMs) such as ChatGPT have excelled in many NLP tasks involving logical and arithmetic reasoning. Nonetheless, their applications in generating educational questions are underutilized, especially in the field of mathematics. To bridge this gap, we take the first step to conduct an in-depth analysis of ChatGPT in generating pre-university math questions. Our analysis is categorized into two main settings: context-aware and context-unaware. In the context-aware setting, we evaluate ChatGPT on existing math question-answering benchmarks covering elementary, secondary, and ternary classes. In the context-unaware setting, we evaluate ChatGPT in generating math questions for each lesson from pre-university math curriculums that we crawl. Our crawling results in TopicMath, a comprehensive and novel collection of pre-university math curriculums collected from 121 math topics and 428 lessons from elementary, secondary, and tertiary classes. Through this analysis, we aim to provide insight into the potential of ChatGPT as a math questioner.
    摘要 matematikalis huanjing yang jiu zhong zhong xue shuoshuo yajie yijian zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo yanjiu zhong xue shuoshuo

On Tuning Neural ODE for Stability, Consistency and Faster Convergence

  • paper_url: http://arxiv.org/abs/2312.01657
  • repo_url: None
  • paper_authors: Sheikh Waqas Akhtar
  • For: 该论文使用Neural-ODE模型来解决常数内存成本的问题,并提供了一种基于Nesterov加速器的ODE解算器来改进Neural-ODE模型的稳定性和性能。* Methods: 该论文使用了 continuous depth neural network和数字ODE-integrator来Parameterize differential equation,并提供了一种基于NAG加速器的ODE解算器来改进Neural-ODE模型的稳定性和性能。* Results: 该论文通过三个不同的任务,包括超参数数据预测、概率分布估计和时间序列预测,证明了其方法的效果,包括更快的训练速度和与其他固定步骤显式ODE解算器和离散深度模型相比的性能。
    Abstract Neural-ODE parameterize a differential equation using continuous depth neural network and solve it using numerical ODE-integrator. These models offer a constant memory cost compared to models with discrete sequence of hidden layers in which memory cost increases linearly with the number of layers. In addition to memory efficiency, other benefits of neural-ode include adaptability of evaluation approach to input, and flexibility to choose numerical precision or fast training. However, despite having all these benefits, it still has some limitations. We identify the ODE-integrator (also called ODE-solver) as the weakest link in the chain as it may have stability, consistency and convergence (CCS) issues and may suffer from slower convergence or may not converge at all. We propose a first-order Nesterov's accelerated gradient (NAG) based ODE-solver which is proven to be tuned vis-a-vis CCS conditions. We empirically demonstrate the efficacy of our approach by training faster, while achieving better or comparable performance against neural-ode employing other fixed-step explicit ODE-solvers as well discrete depth models such as ResNet in three different tasks including supervised classification, density estimation, and time-series modelling.
    摘要 neural-ODE 使用连续深度神经网络 parameterize diferencial 方程,并使用数字 ODE-integrator 解决。这些模型具有 Constant 内存成本,与有着不同层数的神经网络模型相比,内存成本将 linear 增长。除了内存效率之外,Neural-ode 还有其他优点,如输入评估方法的适应性和数值精度或快速训练的可选择性。然而,即使拥有这些优点,Neural-ode 仍有一些局限性。我们认为 ODE-integrator(也称 ODE-solver)是 Neural-ode 链中最弱的链接,它可能会出现稳定性、一致性和整合(CCS)问题,并且可能会导致更慢的收敛或者无法收敛。我们提议一种基于 Nesterov 加速器的第一个 ODE-solver,该方法已被证明可以与 CCS 条件进行调整。我们通过实验证明我们的方法可以更快地训练,而且在三个不同的任务中(包括Supervised 分类、概率分布预测和时间序列预测)都可以达到或superior 于 Neural-ode 使用其他固定步骤显式 ODE-solvers 或 Discrete depth 模型,如 ResNet。

The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

  • paper_url: http://arxiv.org/abs/2312.01656
  • repo_url: None
  • paper_authors: Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, Wei Zeng
  • for: 提高用户搜索体验,帮助用户更准确地表达搜索意图。
  • methods: 使用视觉语言模型解析和组合多模态用户输入,以提高搜索结果的准确性和用户满意度。
  • results: 在NFT搜索系统中实现了更好的用户搜索体验,并且用户可以通过Contextualized interactions进行详细的搜索意图修改和调整。
    Abstract Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.
    摘要 translate-weight: 95 translate-options: "preserve-punctuation"图像搜索是一种重要和易于使用的方法,但现有的图像搜索方法都是基于 proximity 测量,如标签匹配或图像相似性,需要用户输入精确的搜索参数以获得满意的结果。为了满足用户对当代图像搜索引擎的需求,我们提出了一个创新的用户意图扩展框架。我们的框架利用视觉语言模型来分析和组合多模态用户输入,以提供更加准确和满意的结果。它包括两个阶段:1. 分析阶段,包括语言分析模块和大型语言模型,以提高文本输入的理解;以及视觉分析模块,通过交互分割模块快速确定图像中的详细视觉元素。2. 逻辑组合阶段,将多个用户搜索意图组合成一个统一逻辑表达,以满足复杂搜索enario中的更加复杂的操作。此外,用户可以通过对搜索结果进行Contextualized交互,以进一步 specify或调整他们的详细搜索意图。我们将这个框架应用于非 fungible токен(NFT)图像搜索系统中,并进行了用户研究以评估其用户性和创新性。结果表明,我们的框架可以明显改善用户的图像搜索经验。特别是,分析和Contextualized交互都有用,允许用户更加准确地表达他们的搜索意图,并在 iterative 搜索过程中享受更加愉悦的交互体验。

Quantum Polar Metric Learning: Efficient Classically Learned Quantum Embeddings

  • paper_url: http://arxiv.org/abs/2312.01655
  • repo_url: None
  • paper_authors: Vinayak Sharma, Aviral Shrivastava
  • for: 本研究旨在提高量子计算机器的多类分类能力,通过增强希尔ベル空间中数据的分离度。
  • methods: 该方法基于量子度量学习(QMeL),包括一个类型量化电路(PQC)和一个准备层。PQC使用$R_y$和$R_z$门进行状态创建,而准备层使用可变$ZZ(\theta)$门进行学习束缚。
  • results: 相比QMeL方法,本方法可以达到3倍的多类分类分离度,同时使用的门数和深度减少到1/2。此外,本方法还可以超越类似配置的классический网络。
    Abstract Deep metric learning has recently shown extremely promising results in the classical data domain, creating well-separated feature spaces. This idea was also adapted to quantum computers via Quantum Metric Learning(QMeL). QMeL consists of a 2 step process with a classical model to compress the data to fit into the limited number of qubits, then train a Parameterized Quantum Circuit(PQC) to create better separation in Hilbert Space. However, on Noisy Intermediate Scale Quantum (NISQ) devices. QMeL solutions result in high circuit width and depth, both of which limit scalability. We propose Quantum Polar Metric Learning (QPMeL) that uses a classical model to learn the parameters of the polar form of a qubit. We then utilize a shallow PQC with $R_y$ and $R_z$ gates to create the state and a trainable layer of $ZZ(\theta)$-gates to learn entanglement. The circuit also computes fidelity via a SWAP Test for our proposed Fidelity Triplet Loss function, used to train both classical and quantum components. When compared to QMeL approaches, QPMeL achieves 3X better multi-class separation, while using only 1/2 the number of gates and depth. We also demonstrate that QPMeL outperforms classical networks with similar configurations, presenting a promising avenue for future research on fully classical models with quantum loss functions.
    摘要 深度度量学学习最近在классиical数据领域取得了非常有 promise的结果,创造了更好的特征空间。这个想法也被应用到量子计算机上via量子度量学学习(QMeL)。QMeL包括一个2步 proces,首先使用一个类型量化的模型将数据压缩到适合几个量子位的限制下,然后训练一个 Parametrized Quantum Circuit(PQC)以创造更好的分离在归结空间中。然而,在不纯量子设备(NISQ)上,QMeL解决方案会导致高宽度和深度,这些限制扩展性。我们提议使用量子偏振度度量学学习(QPMeL),它使用一个类型量化的模型学习偏振度的参数。然后,我们使用一个浅层PQC中的$R_y$和$R_z$门来创造状态,并使用一个可学习的层来学习强相关性。这个circuit还计算了对称性via SWAP Test,用于我们提议的对称 triplet损失函数,用于训练классиical和量子组件。与QMeL方法相比,QPMeL实现3倍更好的多类分离,同时使用的门数和深度分别为1/2。我们还证明QPMeL超过了类似配置的类型量化网络,提出了一个有前途的研究方向:完全类型量化网络与量子损失函数。

Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation

  • paper_url: http://arxiv.org/abs/2312.03003
  • repo_url: None
  • paper_authors: Sunjae Lee, Junyoung Choi, Jungjae Lee, Hojun Choi, Steven Y. Ko, Sangeun Oh, Insik Shin
  • for: 提高手机任务自动化的可靠性和效率,使用大语言模型(LLMs)来自动化复杂和重复的任务。
  • methods: 提出了一种基于LLMs的手机任务自动化方法——MemoDroid,通过模拟人类与手机应用的交互过程,分解任务过程为小模块,可重用、重新排序和适应不同目标。
  • results: 对50种不同的手机任务进行了50个不同的手机应用中的任务自动化测试,结果表明MemoDroid可以在不同的上下文中适应学习任务,并将任务的延迟和成本降低到69.22%和77.36%。
    Abstract The advent of large language models (LLMs) has opened up new opportunities in the field of mobile task automation. Their superior language understanding and reasoning capabilities allow users to automate complex and repetitive tasks. However, due to the inherent unreliability and high operational cost of LLMs, their practical applicability is quite limited. To address these issues, this paper introduces MemoDroid, an innovative LLM-based mobile task automator enhanced with a unique app memory. MemoDroid emulates the cognitive process of humans interacting with a mobile app -- explore, select, derive, and recall. This approach allows for a more precise and efficient learning of a task's procedure by breaking it down into smaller, modular components that can be re-used, re-arranged, and adapted for various objectives. We implement MemoDroid using online LLMs services (GPT-3.5 and GPT-4) and evaluate its performance on 50 unique mobile tasks across 5 widely used mobile apps. The results indicate that MemoDroid can adapt learned tasks to varying contexts with 100% accuracy and reduces their latency and cost by 69.22% and 77.36% compared to a GPT-4 powered baseline.
    摘要 LLMs 的出现开创了移动任务自动化领域的新机会。它们的出色的语言理解和逻辑能力 permit users 自动化复杂和重复的任务。然而,由于 LLMS 的内在不可靠性和高运行成本,它们的实际应用受限。为解决这些问题,本文介绍了 MemoDroid,一种基于 LLM 的移动任务自动化工具,具有唯一的应用内存。MemoDroid 模拟了人类与移动应用互动的认知过程,探索、选择、 derivation 和回忆。这种方法允许更精确和高效地学习任务的过程,将其拆分成更小的、可重复、可重新排序和可适应不同目标的组件。我们使用在线 LLMS 服务(GPT-3.5 和 GPT-4)实现 MemoDroid,并对 50 个不同的移动任务进行了5种常用的移动应用评估。结果表明,MemoDroid 可以在不同的上下文中精准地适应学习任务,降低了延迟和成本,相比 GPT-4 基eline,减少了69.22% 和 77.36%。

Characterizing Large Language Model Geometry Solves Toxicity Detection and Generation

  • paper_url: http://arxiv.org/abs/2312.01648
  • repo_url: https://github.com/randallbalestriero/splinellm
  • paper_authors: Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
  • for: 本研究旨在理解大语言模型(LLM)的内部表示,以提供可行的答案。
  • methods: 我们使用几何学方法来 caracterize LLM,包括计算嵌入的内在维度和每层 feedforward 网络的分区和每个区域的优化映射。
  • results: 我们的结果表明,通过我们的几何解释,我们可以控制 Llama$2$ 的嵌入维度,并提取具有抽象表示力的 $7$ 个拓扑特征,这些特征可以用于解决多种下游任务,如排除恶意语言、推断提示的域名和解决 Jigsaw 挑战等。
    Abstract Large Language Models~(LLMs) drive current AI breakthroughs despite very little being known about their internal representations, e.g., how to extract a few informative features to solve various downstream tasks. To provide a practical and principled answer, we propose to characterize LLMs from a geometric perspective. We obtain in closed form (i) the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and (ii) the partition and per-region affine mappings of the per-layer feedforward networks. Our results are informative, do not rely on approximations, and are actionable. First, we show that, motivated by our geometric interpretation, we can bypass Llama$2$'s RLHF by controlling its embedding's intrinsic dimension through informed prompt manipulation. Second, we derive $7$ interpretable spline features that can be extracted from any (pre-trained) LLM layer, providing a rich abstract representation of their inputs. Those features alone ($224$ for Mistral-7B and Llama$2$-7B) are sufficient to help solve toxicity detection, infer the domain of the prompt, and even tackle the Jigsaw challenge, which aims at characterizing the type of toxicity of various prompts. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in language models. Code: \url{https://github.com/RandallBalestriero/SplineLLM}.
    摘要 大型语言模型(LLMs)驱动当前人工智能的突破,尽管它们的内部表示仍未得到充分了解,例如如何提取一些有用的特征来解决不同下游任务。为提供实用和原则的答案,我们提出使用几何方式来描述 LLMs。我们在closed form中获得了(i)多头注意力嵌入的内部维度,以及(ii)每层 feedforward 网络的分区和每个区域的优化映射。我们的结果是有用的,不需要使用简化,并且可以行动。首先,我们根据我们的几何解释,可以让 Llama$2$ 的 RLHF 通过控制其嵌入的内部维度来绕过。其次,我们 derive了 7 个可解释的spline特征,可以从任何(预训练) LLM 层中提取,提供了丰富的抽象表示。这些特征(224 个 para Mistral-7B 和 Llama$2$-7B)alone 是足够的,可以帮助解决毒性检测、推断提示的域名和 even 解决 Jigsaw 挑战,这个挑战的目的是Characterizing 不同提示的毒性。我们的结果表明,即使在大规模情况下,精确的理论结果可以回答实际问题。代码:\url{https://github.com/RandallBalestriero/SplineLLM}.

Quality Diversity in the Amorphous Fortress (QD-AF): Evolving for Complexity in 0-Player Games

  • paper_url: http://arxiv.org/abs/2312.02231
  • repo_url: None
  • paper_authors: Sam Earle, M Charity, Dipika Rajesh, Mayu Wilson, Julian Togelius
  • for: 这个论文旨在生成多样化环境,以用于训练和测试学习算法。
  • methods: 这个论文使用了质量多样性演化搜索,将FSM节点和边重新排列以控制堡垒世界中代理人的行为。
  • results: 研究人员通过这种方法生成了多种环境,其中包括了竞争和合作多种多样性生存动力学。这些生成的世界可以作为学习算法的训练和测试用途。
    Abstract We explore the generation of diverse environments using the Amorphous Fortress (AF) simulation framework. AF defines a set of Finite State Machine (FSM) nodes and edges that can be recombined to control the behavior of agents in the `fortress' grid-world. The behaviors and conditions of the agents within the framework are designed to capture the common building blocks of multi-agent artificial life and reinforcement learning environments. Using quality diversity evolutionary search, we generate diverse sets of environments. These environments exhibit certain types of complexity according to measures of agents' FSM architectures and activations, and collective behaviors. Our approach, Quality Diversity in Amorphous Fortress (QD-AF) generates families of 0-player games akin to simplistic ecological models, and we identify the emergence of both competitive and co-operative multi-agent and multi-species survival dynamics. We argue that these generated worlds can collectively serve as training and testing grounds for learning algorithms.
    摘要 我们探索使用Amorphous Fortress(AF)模拟框架生成多样化环境。AF定义了一组finite state machine(FSM)节点和边,这些节点和边可以重新组合以控制fortress网格世界中agent的行为。在这个框架中,agent的行为和条件是为了捕捉人工智能多种生命和奖励学习环境中的共同建筑块。使用质量多样性进化搜索,我们生成了多样化的环境集。这些环境的复杂性根据代理人FSM架构和活动的度量,以及集体行为的类型。我们称这种方法为质量多样性在Amorphous Fortress(QD-AF),它生成了0 player游戏,类似于简单的生态学模型,并识别出了多种多样化的多代理人和多种族存活动动力学。我们认为这些生成的世界可以集体地作为学习算法的训练和测试场景。

GVFs in the Real World: Making Predictions Online for Water Treatment

  • paper_url: http://arxiv.org/abs/2312.01624
  • repo_url: None
  • paper_authors: Muhammad Kamran Janjua, Haseeb Shah, Martha White, Erfan Miahi, Marlos C. Machado, Adam White
  • for: 这个论文是研究基于强制学习的预测方法,用于实际的饮用水净化厂。开发这种预测系统是实现水净化自动化优化的关键步骤。
  • methods: 论文使用了强制学习的GVF预测和TD学习算法,以解决饮用水净化厂数据的预测问题。
  • results: 论文发现,使用GVF预测算法可以获得较低的 normalized mean-squared error,而且在线学习可以使预测模型适应不断变化的系统。
    Abstract In this paper we investigate the use of reinforcement-learning based prediction approaches for a real drinking-water treatment plant. Developing such a prediction system is a critical step on the path to optimizing and automating water treatment. Before that, there are many questions to answer about the predictability of the data, suitable neural network architectures, how to overcome partial observability and more. We first describe this dataset, and highlight challenges with seasonality, nonstationarity, partial observability, and heterogeneity across sensors and operation modes of the plant. We then describe General Value Function (GVF) predictions -- discounted cumulative sums of observations -- and highlight why they might be preferable to classical n-step predictions common in time series prediction. We discuss how to use offline data to appropriately pre-train our temporal difference learning (TD) agents that learn these GVF predictions, including how to select hyperparameters for online fine-tuning in deployment. We find that the TD-prediction agent obtains an overall lower normalized mean-squared error than the n-step prediction agent. Finally, we show the importance of learning in deployment, by comparing a TD agent trained purely offline with no online updating to a TD agent that learns online. This final result is one of the first to motivate the importance of adapting predictions in real-time, for non-stationary high-volume systems in the real world.
    摘要 在这篇论文中,我们 investigate了基于强制学习的预测方法在真实的饮用水净化厂中的应用。开发这种预测系统是水净化过程的优化和自动化的关键步骤。在这之前,我们需要回答许多问题,例如数据预测性、适合的神经网络架构、如何缺省性和多感知器问题等。我们首先描述了这个数据集,并强调了季节性、不同性、部分可见性和感知器的hetersonality问题。然后,我们描述了一种基于General Value Function(GVF)的预测方法,即累积观察值的折算和。我们解释了如何使用停机数据来适应我们的时间差分学习(TD)代理,包括如何选择在线细化的hyperparameter。我们发现TD预测代理在normalized mean-squared error方面比 класси型n-step预测代理更低。最后,我们展示了在部署中学习的重要性,比较了在线学习和纯在线学习的TD代理。这一结果是非站ary高量系统的实际世界中预测的第一个激励。

xNeuSM: Explainable Neural Subgraph Matching with Graph Learnable Multi-hop Attention Networks

  • paper_url: http://arxiv.org/abs/2312.01612
  • repo_url: https://github.com/martinakaduc/xneusm
  • paper_authors: Duc Q. Nguyen, Thanh Toan Nguyen, Tho quan
  • for: This paper is written for researchers and practitioners working with graph-based data and subgraph matching, particularly in the fields of database systems, biochemistry, and cognitive science.
  • methods: The paper proposes a new method called xNeuSM, which uses Graph Learnable Multi-hop Attention Networks (GLeMA) to adaptively learn the attention factor decay for each node across hops, rather than relying on fixed hyperparameters.
  • results: The paper reports substantial improvements in prediction accuracy of up to 34% compared to approximate baselines, and at least a seven-fold faster query time than exact algorithms, in empirical evaluations on real-world datasets.
    Abstract Subgraph matching is a challenging problem with a wide range of applications in database systems, biochemistry, and cognitive science. It involves determining whether a given query graph is present within a larger target graph. Traditional graph-matching algorithms provide precise results but face challenges in large graph instances due to the NP-complete problem, limiting their practical applicability. In contrast, recent neural network-based approximations offer more scalable solutions, but often lack interpretable node correspondences. To address these limitations, this article presents xNeuSM: Explainable Neural Subgraph Matching which introduces Graph Learnable Multi-hop Attention Networks (GLeMA) that adaptively learns the parameters governing the attention factor decay for each node across hops rather than relying on fixed hyperparameters. We provide a theoretical analysis establishing error bounds for GLeMA's approximation of multi-hop attention as a function of the number of hops. Additionally, we prove that learning distinct attention decay factors for each node leads to a correct approximation of multi-hop attention. Empirical evaluation on real-world datasets shows that xNeuSM achieves substantial improvements in prediction accuracy of up to 34% compared to approximate baselines and, notably, at least a seven-fold faster query time than exact algorithms. The source code of our implementation is available at https://github.com/martinakaduc/xNeuSM.
    摘要 <>将文本翻译成简化中文。<>聚合图匹配是一个复杂的问题,具有广泛的应用于数据库系统、生物化学和认知科学等领域。它涉及到判断给定查询图是否存在于大型目标图中。传统的图匹配算法可以提供精确的结果,但在大图实例中遇到NP完全问题,导致实际应用受限。相比之下,现代神经网络基于的估算方法可以提供更可扩展的解决方案,但经常缺乏可解释的节点匹配。为了解决这些限制,本文提出了可解释的神经网络多趟注意力网络(GLeMA),可以适应性地学习每个节点的注意力因子衰减参数,而不是依赖于固定的超参数。我们提供了理论分析,证明GLeMA在多趟注意力中的估算为函数关系数量的误差界限。此外,我们证明学习每个节点的注意力衰减因子可以正确地估算多趟注意力。实际评估使用实际数据显示,xNeuSM可以与基于估算的比较基准相比提高预测精度达34%,并且在精确算法的查询时间上提供至少7倍的提升。xNeuSM的实现代码可以在https://github.com/martinakaduc/xNeuSM上获取。

A Simple and Scalable Representation for Graph Generation

  • paper_url: http://arxiv.org/abs/2312.02230
  • repo_url: https://github.com/djdprogramming/adfa2
  • paper_authors: Yunhui Jang, Seul Lee, Sungsoo Ahn
  • for: 本研究旨在提出一种新的、简单、可扩展的图生成表示法,以解决现有方法对大规模图生成的限制。
  • methods: 该方法使用了一种新的图表示法 named gap encoded edge list (GEEL),具有小的表示尺寸,与节点数量成正比。此外,GEEL还采用了停止编码和带宽限制方案,从而减少了 vocabulary 大小。
  • results: 研究发现,采用这种简洁的表示方法不仅提高了可扩展性,还提高了性能,使得图生成过程变得更加简单。在十个非具有属性图生成任务和两个分子图生成任务中,GEEL显示出了效果。
    Abstract Recently, there has been a surge of interest in employing neural networks for graph generation, a fundamental statistical learning problem with critical applications like molecule design and community analysis. However, most approaches encounter significant limitations when generating large-scale graphs. This is due to their requirement to output the full adjacency matrices whose size grows quadratically with the number of nodes. In response to this challenge, we introduce a new, simple, and scalable graph representation named gap encoded edge list (GEEL) that has a small representation size that aligns with the number of edges. In addition, GEEL significantly reduces the vocabulary size by incorporating the gap encoding and bandwidth restriction schemes. GEEL can be autoregressively generated with the incorporation of node positional encoding, and we further extend GEEL to deal with attributed graphs by designing a new grammar. Our findings reveal that the adoption of this compact representation not only enhances scalability but also bolsters performance by simplifying the graph generation process. We conduct a comprehensive evaluation across ten non-attributed and two molecular graph generation tasks, demonstrating the effectiveness of GEEL.
    摘要 近期,有一wave of interest in 使用神经网络进行图生成,这是一个基本的统计学学习问题,具有重要的应用,如分子设计和社区分析。然而,大多数方法在生成大规模图时遇到了 significanthindrances。这是因为它们需要输出完整的邻接矩阵,该矩阵的大小与节点数平方成正比。面对这个挑战,我们介绍了一种新的、简单的和可扩展的图表示方法,即阻塞编码边列表(GEEL)。GEEL具有小的表示尺寸,与边的数量成正比。此外,GEEL还减少了词汇量,通过包含阻塞编码和带宽限制方案。GEEL可以通过节点位置编码和扩展来进行autoregressive生成,并且我们进一步扩展了GEEL来处理Attribute graphs,我们设计了一种新的语法。我们的发现表明,采用这种紧凑的表示不仅提高了可扩展性,而且也提高了性能,因为它简化了图生成过程。我们在十个非杂Attribute graphs和两个分子图生成任务上进行了广泛的评估,并示出了GEEL的有效性。

Local-Global History-aware Contrastive Learning for Temporal Knowledge Graph Reasoning

  • paper_url: http://arxiv.org/abs/2312.01601
  • repo_url: None
  • paper_authors: Wei Chen, Huaiyu Wan, Yuting Wu, Shuyuan Zhao, Jiayaqi Cheng, Yuxin Li, Youfang Lin
  • for: 本研究旨在提高 Temporal knowledge graphs (TKGs) 中的预测能力,特别是预测未知的事实发生在未来。
  • methods: 本研究提出了一种新的 Local-global history-aware Contrastive Learning(LogCL)模型,通过对本地和全局历史信息进行协调,更好地引导历史信息的融合,并提高对干扰的抵抗能力。
  • results: 实验结果表明,LogCL 比基eline模型表现更好和更稳定,在四个 benchmark 数据集上达到了更高的预测精度。
    Abstract Temporal knowledge graphs (TKGs) have been identified as a promising approach to represent the dynamics of facts along the timeline. The extrapolation of TKG is to predict unknowable facts happening in the future, holding significant practical value across diverse fields. Most extrapolation studies in TKGs focus on modeling global historical fact repeating and cyclic patterns, as well as local historical adjacent fact evolution patterns, showing promising performance in predicting future unknown facts. Yet, existing methods still face two major challenges: (1) They usually neglect the importance of historical information in KG snapshots related to the queries when encoding the local and global historical information; (2) They exhibit weak anti-noise capabilities, which hinders their performance when the inputs are contaminated with noise.To this end, we propose a novel \blue{Lo}cal-\blue{g}lobal history-aware \blue{C}ontrastive \blue{L}earning model (\blue{LogCL}) for TKG reasoning, which adopts contrastive learning to better guide the fusion of local and global historical information and enhance the ability to resist interference. Specifically, for the first challenge, LogCL proposes an entity-aware attention mechanism applied to the local and global historical facts encoder, which captures the key historical information related to queries. For the latter issue, LogCL designs four historical query contrast patterns, effectively improving the robustness of the model. The experimental results on four benchmark datasets demonstrate that LogCL delivers better and more robust performance than the state-of-the-art baselines.
    摘要 Temporal知识图(TKG)已被认为是表征时间统计的有前途的方法。TKG的推测是预测未知的未来事实,具有广泛的实用价值。大多数推测研究在TKG中强调模型全球历史事实重复和地方历史邻近事实演化模式,显示出预测未来未知事实的有望性。然而,现有方法仍面临两个主要挑战:(1)它们通常忽略了相关查询的历史资讯在KG快照中的重要性;(2)它们具有轻微抗噪能力,这限制了它们在噪音污染的情况下的表现。为此,我们提出了一个新的绿色 локаль-全球历史认知对应(LogCL)模型,它运用对比学来更好地导引本地和全球历史信息的融合,提高抗噪能力。具体来说,为首个挑战,LogCL提出了一个统计数据关注机制,对本地和全球历史实验器中的历史信息进行焦点关注。另一方面,LogCL设计了四种历史查询对照模式,有效地提高了模型的Robustness。实验结果显示,LogCL在四个benchmark dataset上表现更好和更有弹性的比基eline。

Synthetic Data Generation Techniques for Developing AI-based Speech Assessments for Parkinson’s Disease (A Comparative Study)

  • paper_url: http://arxiv.org/abs/2312.02229
  • repo_url: None
  • paper_authors: Mahboobeh Parsapoor
  • for: 这研究旨在提高AI基于语音评估系统的准确率,以便早期诊断 Parkinson’s disease (PD)。
  • methods: 该研究使用深度学习数据生成技术,以提高机器学习分类器的准确率。
  • results: 研究发现,使用深度学习数据生成技术可以提高机器学习分类器的准确率,从而提高AI基于语音评估系统的诊断精度。
    Abstract Changes in speech and language are among the first signs of Parkinson's disease (PD). Thus, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals' voices. Although several studies have shown reasonable results in developing such AI systems, these systems would need more data samples to achieve promising performance. This paper explores using deep learning-based data generation techniques on the accuracy of machine learning classifiers that are the core of such systems.
    摘要 Changes in speech and language are among the first signs of Parkinson's disease (PD). Therefore, clinicians have tried to identify individuals with PD from their voices for years. Doctors can leverage AI-based speech assessments to spot PD thanks to advancements in artificial intelligence (AI). Such AI systems can be developed using machine learning classifiers that have been trained using individuals' voices. Although several studies have shown reasonable results in developing such AI systems, these systems would need more data samples to achieve promising performance. This paper explores using deep learning-based data generation techniques to improve the accuracy of machine learning classifiers, which are the core of such systems.Here's the word-for-word translation:语言和语音变化是parkinson病的早期征 symptoms。因此,临床医生已经尝试通过语音来识别患有parkinson病的人们多年来。透过人工智能技术,医生可以使用语音识别系统来检测parkinson病。这些系统可以通过机器学习分类器来开发,这些分类器已经被训练使用个人语音。虽然一些研究已经得到了可靠的结果,但这些系统需要更多的数据样本来达到可靠的性能。这篇论文探讨使用深度学习数据生成技术来提高机器学习分类器的准确率,这些分类器是这些系统的核心。

OCGEC: One-class Graph Embedding Classification for DNN Backdoor Detection

  • paper_url: http://arxiv.org/abs/2312.01585
  • repo_url: https://github.com/jhy549/ocgec
  • paper_authors: Haoyu Jiang, Haiyang Yu, Nan Li, Ping Yi
  • for: 这个研究旨在提出一种新的一类分类框架,用于针对深度神经网络(DNNs)中的后门攻击进行模型级别检测,只需使用一小量的清洁数据进行训练。
  • methods: 这种框架使用图神经网络(GNNs)来转化模型的结构细节和权重特征为图数据,然后使用自然语言生成模型(GAE)来更好地学习无攻击模型的特征,以便无需知晓攻击策略就能够检测后门模型。
  • results: 对多个任务进行评估,OCGEC可以达到AUC分数超过98%,这高于现有的方法,即使它们使用大量的正样本和负样本进行训练。此外,OCGEC还可以提供新的视角,可以用于改进其他后门防御任务。
    Abstract Deep neural networks (DNNs) have been found vulnerable to backdoor attacks, raising security concerns about their deployment in mission-critical applications. There are various approaches to detect backdoor attacks, however they all make certain assumptions about the target attack to be detected and require equal and huge numbers of clean and backdoor samples for training, which renders these detection methods quite limiting in real-world circumstances. This study proposes a novel one-class classification framework called One-class Graph Embedding Classification (OCGEC) that uses GNNs for model-level backdoor detection with only a little amount of clean data. First, we train thousands of tiny models as raw datasets from a small number of clean datasets. Following that, we design a ingenious model-to-graph method for converting the model's structural details and weight features into graph data. We then pre-train a generative self-supervised graph autoencoder (GAE) to better learn the features of benign models in order to detect backdoor models without knowing the attack strategy. After that, we dynamically combine the GAE and one-class classifier optimization goals to form classification boundaries that distinguish backdoor models from benign models. Our OCGEC combines the powerful representation capabilities of graph neural networks with the utility of one-class classification techniques in the field of anomaly detection. In comparison to other baselines, it achieves AUC scores of more than 98% on a number of tasks, which far exceeds existing methods for detection even when they rely on a huge number of positive and negative samples. Our pioneering application of graphic scenarios for generic backdoor detection can provide new insights that can be used to improve other backdoor defense tasks. Code is available at https://github.com/jhy549/OCGEC.
    摘要 深度神经网络(DNN)已被发现容易受到后门攻击,这引发了应用于关键任务的安全问题。现有多种检测后门攻击的方法,但它们都假设特定的目标攻击和巨大量的干净和后门样本进行训练,这限制了这些检测方法在实际情况下的应用。这项研究提出了一种新的一类分类框架 called One-class Graph Embedding Classification(OCGEC),使用图神经网络(GNN)进行模型级别后门检测,只需要一小量的干净数据。首先,我们从少量干净数据中训练了数以千计的小模型。然后,我们设计了一种巧妙的模型到图方法,将模型的结构细节和权重特征转化为图数据。我们然后预训练了一个生成自动encoder(GAE)以更好地学习干净模型的特征,以便无需知道攻击策略,可以检测后门模型。接着,我们动态将GAE和一类分类优化目标结合,以形成分类边界,将后门模型和干净模型分开。我们的 OCGEC 结合了图神经网络的强大表示能力和一类分类技术在异常检测领域的实用性。与其他基准方法相比,它在多个任务上达到了AUC分数超过98%,这大大超过了基于巨大正例和负例样本的方法。我们在应用图形场景进行通用后门检测的先河应用,可以提供新的视角,用于改进其他后门防御任务。代码可以在 上获取。

Signed Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off

  • paper_url: http://arxiv.org/abs/2312.01581
  • repo_url: None
  • paper_authors: Sachit Kuhar, Yash Jain, Alexey Tumanov
  • for: 这篇论文的目的是提出一个简洁的深度神经网络(DNN)在资源有限的边缘设备上进行推导的有效方法。
  • methods: 这篇论文使用了量化和简洁技术,将神经网络中的重要参数转换为负数和正数,以提高推导的效率。它还使用了表示学习技术和硬件-软件系统的同步设计。
  • results: 根据结果显示,这个方法可以比以同数量的非零参数的binarization方法更加精确,并且可以实现26%的速度提升、双倍的能源效率和2.8倍的浓度减少。
    Abstract Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key algorithmic techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose Signed Binarization, a unified co-design framework that synergistically integrates hardware-software systems, quantization functions, and representation learning techniques to address this trade-off. Our results demonstrate that Signed Binarization is more accurate than binarization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters, both of the same type, for a DNN block. Finally, our approach achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8x compared to binary methods for ResNet 18, presenting an alternative solution for deploying efficient models in resource-limited environments.
    摘要 efficiently 推理深度神经网络(DNN)在限制性的边缘设备上是必要的。量化和稀疏是关键的算法技术,它们在硬件软件界面上翻译成重复和稀疏在tensor中的现象。本文介绍了重复稀疏负担的概念,帮助解释执行时的计算效率。我们提出了签名二进制化,一种统一的协设框架,它synergistically地结合硬件软件系统、量化函数和表示学习技术来解决这个负担。我们的结果表明,签名二进制化比使用同样数量的非零权重的二进制化更高准确。etailed分析表明,签名二进制化生成了一个较小的非零参数的分布,占总参数的同类型的分布的同时,对于DNN块。最后,我们的方法在实际硬件上实现了26%的速度提升,双倍的能效率,并将密度减少了2.8倍 compared to binary方法,用于部署高效的模型在限制性的环境中。

How to Configure Good In-Context Sequence for Visual Question Answering

  • paper_url: http://arxiv.org/abs/2312.01571
  • repo_url: https://github.com/garyjiajia/ofv2_icl_vqa
  • paper_authors: Li Li, Jiawei Peng, Huiyi Chen, Chongyang Gao, Xu Yang
  • for: 这个研究的目的是提高Large Vision-Language Models(LVLMs)的In-Context Learning(ICL)性能,以及更深入地理解LVLMs的内部特性。
  • methods: 研究人员使用Visual Question Answering(VQA)作为案例研究,探索不同的In-Context配置,以找到高效的配置。他们设计了多种检索方法,并采用不同的策略来操纵已经检索到的示例。
  • results: 通过对三个VQA数据集(VQAv2、VizWiz、OK-VQA)进行大量实验,研究人员发现了三种关键的LVLM内部特性,并证明了哪些策略可以一直提高ICL VQA性能。
    Abstract Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP, researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However, when implementing ICL using these LVLMs, researchers usually resort to the simplest way like random sampling to configure the in-context sequence, thus leading to sub-optimal results. To enhance the ICL performance, in this study, we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally, through observing the changes of the LVLM outputs by altering the in-context sequence, we gain insights into the inner properties of LVLMs, improving our understanding of them. Specifically, to explore in-context configurations, we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2, VizWiz, and OK-VQA, we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https://github.com/GaryJiajia/OFv2_ICL_VQA.
    摘要 受大语言模型在新任务上通过内容学习(ICL)的成功启发,研究人员还开发了大视语言模型(LVLM)的ICL功能。然而,在实现ICL使用这些LVLM时,研究人员通常采用随机抽样来配置内容序列,导致效果不佳。为了提高ICL性能,在本研究中,我们使用视觉问答(VQA)作为案例研究,探索多种内容配置,找到有力的一些。另外,通过对LVLM输出变化的内容序列进行修改,我们获得了LVLM的内部特性的深入理解,提高了对其的理解。特别是,为了探索内容配置,我们设计了多种检索方法,采用不同的策略来修改已有的示例。通过对三个VQA数据集(VQAv2、VizWiz和OK-VQA)进行了系统的实验,我们发现了三个应用LVLM的内部特性,并证明了哪些策略可以一直提高ICL VQA性能。我们的代码可以在以下链接中找到:https://github.com/GaryJiajia/OFv2_ICL_VQA。

APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

  • paper_url: http://arxiv.org/abs/2312.01564
  • repo_url: None
  • paper_authors: Sanjoy Chowdhury, Sayan Nag, Dinesh Manocha
  • for: 提高视频语言模型在几个shot设定下的总结能力
  • methods: 结合适应器和提示学习,通过强化视频语言两个Modalities的对齐,提高视频语言模型的总结能力
  • results: 在三种表示任务中,实际上 Achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
    Abstract The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We introduce trainable cross-attention-based adapter layers in conjunction with vision and language encoders to strengthen the alignment between the two modalities. We enforce consistency between the respective encoder branches (receiving augmented inputs) to prevent overfitting in downstream tasks. Our method is evaluated on three representative tasks: generalization to novel classes, cross-dataset evaluation, and unseen domain shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
    摘要 “输入文本提示选择对视语言预训练(VLP)模型的性能具有关键作用。我们提出了ApollO,一种多模式的统一方法, combinines adapter和提示学习来提高视语言模型的通用化能力。我们的方法采用可调整的跨注意力adapter层和视频语言编码器来强化两Modal之间的对接。我们在下游任务中对各自的编码器支持(接受了增强的输入)进行了一致性保持,以避免过度适应。我们的方法在三个表示任务中进行了评估:推广到新类,跨数据集评估和未看到的频率偏移。在实践中,ApollO比MaPLe(SOTA)在新类上达到了6.03%的相对提升。”Note: The translation is in Simplified Chinese, which is the standard writing system used in mainland China. The traditional Chinese writing system is also widely used in Taiwan and Hong Kong.

Explainable AI is Responsible AI: How Explainability Creates Trustworthy and Socially Responsible Artificial Intelligence

  • paper_url: http://arxiv.org/abs/2312.01555
  • repo_url: None
  • paper_authors: Stephanie Baker, Wei Xiang
  • for: 本研究旨在探讨负责任 искусственный智能(AI)的发展和应用,强调开发可靠的AI系统,以减少偏见、保护隐私、支持安全性和提高透明度和责任感。
  • methods: 本文报告 analyzes state-of-the-art literature on responsible AI (RAI) and explainable AI (XAI) technologies, demonstrating that XAI can be used to ensure fairness, robustness, privacy, security, and transparency in a wide range of contexts.
  • results: 本研究 conclude that XAI is an essential foundation for every pillar of RAI, providing a critical tool for developing trustworthy AI systems that minimize bias, protect privacy, support security, and enhance transparency and accountability.
    Abstract Artificial intelligence (AI) has been clearly established as a technology with the potential to revolutionize fields from healthcare to finance - if developed and deployed responsibly. This is the topic of responsible AI, which emphasizes the need to develop trustworthy AI systems that minimize bias, protect privacy, support security, and enhance transparency and accountability. Explainable AI (XAI) has been broadly considered as a building block for responsible AI (RAI), with most of the literature considering it as a solution for improved transparency. This work proposes that XAI and responsible AI are significantly more deeply entwined. In this work, we explore state-of-the-art literature on RAI and XAI technologies. Based on our findings, we demonstrate that XAI can be utilized to ensure fairness, robustness, privacy, security, and transparency in a wide range of contexts. Our findings lead us to conclude that XAI is an essential foundation for every pillar of RAI.
    摘要 人工智能(AI)已经被证明可以革命化各个领域,从医疗到金融 - 只要开发和部署负责任。这是负责任AI的主题,强调开发可靠的AI系统,减少偏见,保护隐私,支持安全,提高透明度和责任感。可解释AI(XAI)在负责任AI(RAI)中被广泛视为一个重要的建筑物,大多文献认为它是透明度的解决方案。本工作将 explore RAI和XAI技术的现状 literatura。根据我们的发现,我们示出XAI可以在各种情况下确保公正、Robustness、隐私、安全和透明度。我们的发现表明XAI是负责任AI的基础。

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

  • paper_url: http://arxiv.org/abs/2312.01552
  • repo_url: None
  • paper_authors: Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi
  • for: 这研究旨在探讨LLM的准确性和可靠性问题,以及如何在不使用精细调教或人工反馈的情况下进行对齐。
  • methods: 研究使用了1000个示例进行精细调教,以及人工反馈来进行偏好调教。此外,研究还提出了一种简单的、无需调教的对齐方法,即URIAL。
  • results: 研究发现,使用精细调教或人工反馈进行对齐后,LLM的性能可以大幅提高。然而,研究还发现,通过使用STRATEGIC的提示和ICL来进行准确性和可靠性的评估,可以使得不需要调教的对齐方法达到类似的性能水平。
    Abstract The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.
    摘要 大型语言模型(LLM)的准确调整过程通常包括监督精度调整(SFT)和人工反馈学习(RLHF)。一项研究(LIMA,详见周等,2023)显示,只使用1000个示例进行SFT可以 дости到显著的准确性表现,这表明对适应调整的影响可能是“ superficiale”。这种情况提出了对适应调整的转化是如何实现的问题。我们通过分析基本LLM和其调整版本之间的单词分布异同来分析适应调整的效果。我们发现,基本LLM和其调整版本在大多数单词位置上的decoding表现几乎相同。大多数分布异同发生在形式化单词上。这些证据直接支持LIMA所提出的 superficialeAlignment Hypothesis。 基于这些发现,我们重新思考了LLM的适应调整。我们提出了一个简单、无需调整的适应方法,即URIAL。URIAL通过基本LLM中的增强学习(ICL)来实现有效的适应,只需要三个常见的形式化示例和系统提示。我们在一个多样化的示例集上进行了细致和可读的评估。结果表明,基本LLM与URIAL可以与SFT或SFT+RLHF同等或者超越LLM的性能。我们发现,通过策略性的提示和ICL,可以大幅减少基于调整和无基于调整的适应方法之间的差距。我们的发现和URIAL的表现 suggeST That deeper analysis and theoretical understanding of alignment is crucial to future LLM research。

KEEC: Embed to Control on An Equivariant Geometry

  • paper_url: http://arxiv.org/abs/2312.01544
  • repo_url: None
  • paper_authors: Xiaoyuan Cheng, Yiming Yang, Wei Jiang, Yukun Hu
  • for: 本研究探讨了如何通过表示学习实现不知道和复杂的动力学系统中的优化控制,不需要借助于动力学系统的先前知识。
  • methods: 该研究提出了一种名为“Koopman Embed to Equivariant Control”(KEEC)的方法,其基于李群理论,从拟合流形上学习非线性动力学系统,并在这个对应的geometry上进行优化控制。
  • results: 研究发现,使用isometric和isoomorphic损失函数,保证geometry的紧凑性和光滑性,可以超越不具有这些性质的损失函数,并实现 quadratic convergence。
    Abstract This paper investigates how representation learning can enable optimal control in unknown and complex dynamics, such as chaotic and non-linear systems, without relying on prior domain knowledge of the dynamics. The core idea is to establish an equivariant geometry that is diffeomorphic to the manifold defined by a dynamical system and to perform optimal control within this corresponding geometry, which is a non-trivial task. To address this challenge, Koopman Embed to Equivariant Control (KEEC) is introduced for model learning and control. Inspired by Lie theory, KEEC begins by learning a non-linear dynamical system defined on a manifold and embedding trajectories into a Lie group. Subsequently, KEEC formulates an equivariant value function equation in reinforcement learning on the equivariant geometry, ensuring an invariant effect as the value function on the original manifold. By deriving analytical-form optimal actions on the equivariant value function, KEEC theoretically achieves quadratic convergence for the optimal equivariant value function by leveraging the differential information on the equivariant geometry. The effectiveness of KEEC is demonstrated in challenging dynamical systems, including chaotic ones like Lorenz-63. Notably, our findings indicate that isometric and isomorphic loss functions, ensuring the compactness and smoothness of geometry, outperform loss functions without these properties.
    摘要 KEEC begins by learning a non-linear dynamical system defined on a manifold and embedding trajectories into a Lie group. Then, it formulates an equivariant value function equation in reinforcement learning on the equivariant geometry, ensuring an invariant effect as the value function on the original manifold. By deriving analytical-form optimal actions on the equivariant value function, KEEC achieves quadratic convergence for the optimal equivariant value function.The effectiveness of KEEC is demonstrated in challenging dynamical systems, including the Lorenz-63 system, which is chaotic. The authors find that isometric and isomorphic loss functions, which ensure the compactness and smoothness of the geometry, outperform loss functions without these properties.In summary, this paper presents a new approach to optimal control in complex and unknown dynamics using representation learning and equivariant geometry. The proposed method, KEEC, leverages Lie theory and reinforcement learning to achieve quadratic convergence and outperform existing methods in challenging dynamical systems.