results: Comparative experiments 表明,该提案的方法在两个公共数据集(2a 和 2b)上实现了显著的平均识别率提高,达到 74.61% 和 81.19% 分别。相比基eline algorithm(FBCSP),该提案的算法提高了 11.44% 和 7.11% 在两个数据集上。Abstract
As a typical self-paced brain-computer interface (BCI) system, the motor imagery (MI) BCI has been widely applied in fields such as robot control, stroke rehabilitation, and assistance for patients with stroke or spinal cord injury. Many studies have focused on the traditional spatial filters obtained through the common spatial pattern (CSP) method. However, the CSP method can only obtain fixed spatial filters for specific input signals. Besides, CSP method only focuses on the variance difference of two types of electroencephalogram (EEG) signals, so the decoding ability of EEG signals is limited. To obtain more effective spatial filters for better extraction of spatial features that can improve classification to MI-EEG, this paper proposes an adaptive spatial filter solving method based on particle swarm optimization algorithm (PSO). A training and testing framework based on filter bank and spatial filters (FBCSP-ASP) is designed for MI EEG signal classification. Comparative experiments are conducted on two public datasets (2a and 2b) from BCI competition IV, which show the outstanding average recognition accuracy of FBCSP-ASP. The proposed method has achieved significant performance improvement on MI-BCI. The classification accuracy of the proposed method has reached 74.61% and 81.19% on datasets 2a and 2b, respectively. Compared with the baseline algorithm (FBCSP), the proposed algorithm improves 11.44% and 7.11% on two datasets respectively. Furthermore, the analysis based on mutual information, t-SNE and Shapley values further proves that ASP features have excellent decoding ability for MI-EEG signals, and explains the improvement of classification performance by the introduction of ASP features.
摘要
如常的自适应脑机器接口(BCI)系统中,运动想象(MI)BCI已经广泛应用在机器人控制、roke rehabilitation 和roke 或脊梁损伤患者的助手等领域。许多研究都集中在传统的空间筛选(CSP)方法上。然而,CSP 方法只能获得特定输入信号的固定空间筛选。此外,CSP 方法只关注两种电enzephalogram(EEG)信号之间的差异,因此EEG 信号的解码能力受限。为了获得更有效的空间筛选,提高MI-EEG 信号的特征提取,这篇文章提出了基于聚合粒子猎 optimization 算法(PSO)的自适应空间筛选解决方案。为MI EEG 信号类型的分类,设计了一个基于筛选银行和空间筛选(FBCSP-ASP)的训练和测试框架。对于BCI 竞赛 IV 公共数据集(2a 和 2b)进行了比较性实验,实验结果表明,提案方法在MI-BCI 中 achieved significant performance improvement。提案方法的识别率为74.61% 和 81.19% 在数据集 2a 和 2b 中,相比基准算法(FBCSP),提案算法提高了11.44% 和 7.11% 在两个数据集中。此外,基于mutual information、t-SNE 和 Shapley 值的分析进一步证明了ASP 特征对MI-EEG 信号的解码能力具有优秀性,并解释了提案方法的性能提升原因。
Enhancing Motor Imagery Decoding in Brain Computer Interfaces using Riemann Tangent Space Mapping and Cross Frequency Coupling
paper_authors: Xiong Xiong, Li Su, Jinguo Huang, Guixia Kang for: 这个论文的目的是提高motor imagery(MI)特征的编码和解码能力。methods: 这篇论文提出了一种基于Riemannian geometry和Cross-Frequency Coupling(CFC)的新方法,称为Riemann Tangent Space Mapping using Dichotomous Filter Bank with Convolutional Neural Network(DFBRTS),用于提高MI特征的表达质量和解码能力。DFBRTS使用了一个完整的二进制树结构的 dichotomous filter bank来精炼EEG信号,然后使用Riemann Tangent Space Mapping提取每个子带中的突出的EEG信号特征。最后,一个轻量级的卷积神经网络被用于进一步提取特征和分类,在彼此之间同时受到了cross-entropy和center loss的联合监督。results: 对于BCI竞赛IV 2a(BCIC-IV-2a)数据集和OpenBMI数据集进行了广泛的实验,DFBRTS在两个数据集上显示出了明显的优异性,在四个类和二个类的保留分类中分别达到了78.16%和71.58%的高精度分类率,与现有的参考值进行比较。Abstract
Objective: Motor Imagery (MI) serves as a crucial experimental paradigm within the realm of Brain Computer Interfaces (BCIs), aiming to decoding motor intentions from electroencephalogram (EEG) signals. Method: Drawing inspiration from Riemannian geometry and Cross-Frequency Coupling (CFC), this paper introduces a novel approach termed Riemann Tangent Space Mapping using Dichotomous Filter Bank with Convolutional Neural Network (DFBRTS) to enhance the representation quality and decoding capability pertaining to MI features. DFBRTS first initiates the process by meticulously filtering EEG signals through a Dichotomous Filter Bank, structured in the fashion of a complete binary tree. Subsequently, it employs Riemann Tangent Space Mapping to extract salient EEG signal features within each sub-band. Finally, a lightweight convolutional neural network is employed for further feature extraction and classification, operating under the joint supervision of cross-entropy and center loss. To validate the efficacy, extensive experiments were conducted using DFBRTS on two well-established benchmark datasets: the BCI competition IV 2a (BCIC-IV-2a) dataset and the OpenBMI dataset. The performance of DFBRTS was benchmarked against several state-of-the-art MI decoding methods, alongside other Riemannian geometry-based MI decoding approaches. Results: DFBRTS significantly outperforms other MI decoding algorithms on both datasets, achieving a remarkable classification accuracy of 78.16% for four-class and 71.58% for two-class hold-out classification, as compared to the existing benchmarks.
摘要
目的:使用电气生物 интерфей斯(BCI)中的动作幻像(MI)作为关键实验方法,从电气生物学信号(EEG)中提取动作意图。方法:基于里敦纬度 geometry和跨频相关(CFC),这篇论文提出了一种新的方法,即里敦 Tangent Space Mapping using Dichotomous Filter Bank with Convolutional Neural Network(DFBRTS),以提高MI特征的表达质量和解码能力。DFBRTS首先通过完整的 binary tree 结构的 dichotomous Filter Bank 精细筛选 EEG 信号,然后使用里敦 Tangent Space Mapping 提取每个子带中的优秀 EEG 信号特征。最后,一种轻量级的 convolutional neural network 进行进一步的特征提取和分类,在joint 超VI 和中心损失的协同监督下运行。为证明DFBRTS的效果,对DFBRTS进行了广泛的实验,并与其他里敦 geometry 基于MI解码方法进行比较。结果:DFBRTS在两个常用的benchmark数据集上(BCIC-IV-2a数据集和OpenBMI数据集)上显著地超过了其他MI解码方法,达到了78.16%的四类分类率和71.58%的二类分类率。
Conformal Normalization in Recurrent Neural Network of Grid Cells
results: 实验结果表明,通过使用准确normalization方法,grid cells可以形成六角形响应模式,并且这些模式与agent的实际位置在2D physical space有直接的关系。Abstract
Grid cells in the entorhinal cortex of the mammalian brain exhibit striking hexagon firing patterns in their response maps as the animal (e.g., a rat) navigates in a 2D open environment. The responses of the population of grid cells collectively form a vector in a high-dimensional neural activity space, and this vector represents the self-position of the agent in the 2D physical space. As the agent moves, the vector is transformed by a recurrent neural network that takes the velocity of the agent as input. In this paper, we propose a simple and general conformal normalization of the input velocity for the recurrent neural network, so that the local displacement of the position vector in the high-dimensional neural space is proportional to the local displacement of the agent in the 2D physical space, regardless of the direction of the input velocity. Our numerical experiments on the minimally simple linear and non-linear recurrent networks show that conformal normalization leads to the emergence of the hexagon grid patterns. Furthermore, we derive a new theoretical understanding that connects conformal normalization to the emergence of hexagon grid patterns in navigation tasks.
摘要
“ENTORHINAL CORTEX中的格子细胞在动物(例如鼠)在2D开放环境中探索时表现出惊人的六角发射模式。这些细胞的响应集体形成一个高维神经活动空间中的向量,该向量表示动物的自身位置在2D物理空间中。随着动物的移动,这个向量被一个循环神经网络转换,该神经网络的输入是动物的速度。在这篇论文中,我们提出了一种简单而普遍的几何正常化方法,以确保输入速度的local displacement在高维神经空间中与动物在2D物理空间中的local displacement成正比。我们的数值实验表明,几何正常化会导致格子网格模式的出现。此外,我们还derived一种新的理论理解,该理解连接几何正常化与导航任务中的格子网格模式的出现。”Note: The translation is in Simplified Chinese, which is the standard form of Chinese used in mainland China and Singapore. If you need Traditional Chinese, please let me know.
The Power of Explainability in Forecast-Informed Deep Learning Models for Flood Mitigation
for: 这 paper 是为了提出一种基于深度学习架构的洪水管理方法,以优化洪水预 release 的决策。
methods: 这 paper 使用了 Forecast Informed Deep Learning Architecture (FIDLAR), combinig 预测和深度学习来实现洪水管理。
results: 实验结果表明,FIDLAR 比现有的状态艺术有几个数量级的速度提高,并且可以提供更好的预 release 决策。这些速度提高使得 FIDLAR 可以用于实时洪水管理。此外,这 paper 还使用了工具来解释模型的决策,从而更好地理解洪水管理中环境因素的贡献。Abstract
Floods can cause horrific harm to life and property. However, they can be mitigated or even avoided by the effective use of hydraulic structures such as dams, gates, and pumps. By pre-releasing water via these structures in advance of extreme weather events, water levels are sufficiently lowered to prevent floods. In this work, we propose FIDLAR, a Forecast Informed Deep Learning Architecture, achieving flood management in watersheds with hydraulic structures in an optimal manner by balancing out flood mitigation and unnecessary wastage of water via pre-releases. We perform experiments with FIDLAR using data from the South Florida Water Management District, which manages a coastal area that is highly prone to frequent storms and floods. Results show that FIDLAR performs better than the current state-of-the-art with several orders of magnitude speedup and with provably better pre-release schedules. The dramatic speedups make it possible for FIDLAR to be used for real-time flood management. The main contribution of this paper is the effective use of tools for model explainability, allowing us to understand the contribution of the various environmental factors towards its decisions.
摘要
洪水可以带来惊人的破坏力和财产损失。然而,通过有效地使用 гидро利用结构,如坝、闸门和泵,可以减轻或缓解洪水的影响。在这种情况下,我们提出了 FIDLAR,一种基于预测的深度学习架构,通过在洪水事件前预先释放水,以达到Optimal flood management in watersheds with hydraulic structures by balancing flood mitigation and water wastage via pre-releases.我们在使用南佛瑞达水资源管理区的数据进行实验,这是一个常遇洪水的沿海地区。结果显示,FIDLAR比当前状态艺术有几个数量级的速度减速,并且可以证明更好的预release schedule。这些减速使得FIDLAR可以用于实时洪水管理。本文的主要贡献是通过工具来描述模型的解释,以便理解响应不同环境因素的决策的贡献。
RAIFLE: Reconstruction Attacks on Interaction-based Federated Learning with Active Data Manipulation
paper_authors: Dzung Pham, Shreyas Kulkarni, Amir Houmansadr
for: 这个论文关注了 federated learning (FL) 中的用户隐私问题,具体来说是在用户互动域中的 recommender systems (RS) 和 online learning to rank (OLTR) 中。
methods: 这篇论文使用了一种名为 RAIFLE 的整合优化基于攻击框架,用于攻击 IFL 系统中的用户隐私。RAIFLE 使用了一种新的攻击技术名为 Active Data Manipulation (ADM),通过在训练特征上操纵ITEMS来导致本地 FL 更新中的 adversarial 行为。
results: 论文表明 RAIFLE 可以在 IFL 系统中更有效地攻击用户隐私,并且可以干扰隐私防御技术,如安全汇聚和私人信息检索。基于这些发现,论文提出了一些Countermeasure 建议来 mitigate 这种攻击。Abstract
Federated learning (FL) has recently emerged as a privacy-preserving approach for machine learning in domains that rely on user interactions, particularly recommender systems (RS) and online learning to rank (OLTR). While there has been substantial research on the privacy of traditional FL, little attention has been paid to studying the privacy properties of these interaction-based FL (IFL) systems. In this work, we show that IFL can introduce unique challenges concerning user privacy, particularly when the central server has knowledge and control over the items that users interact with. Specifically, we demonstrate the threat of reconstructing user interactions by presenting RAIFLE, a general optimization-based reconstruction attack framework customized for IFL. RAIFLE employs Active Data Manipulation (ADM), a novel attack technique unique to IFL, where the server actively manipulates the training features of the items to induce adversarial behaviors in the local FL updates. We show that RAIFLE is more impactful than existing FL privacy attacks in the IFL context, and describe how it can undermine privacy defenses like secure aggregation and private information retrieval. Based on our findings, we propose and discuss countermeasure guidelines to mitigate our attack in the context of federated RS/OLTR specifically and IFL more broadly.
摘要
federated learning(FL)已经被认为是一种保护用户隐私的机器学习方法,尤其是在用户互动域中,如推荐系统(RS)和在线学习排名(OLTR)等领域。 although there has been extensive research on the privacy of traditional FL, little attention has been paid to the privacy properties of these interaction-based FL(IFL)systems. in this work, we show that IFL can introduce unique challenges concerning user privacy, particularly when the central server has knowledge and control over the items that users interact with. specifically, we demonstrate the threat of reconstructing user interactions by presenting RAIFLE, a general optimization-based reconstruction attack framework customized for IFL. RAIFLE employs Active Data Manipulation(ADM), a novel attack technique unique to IFL, where the server actively manipulates the training features of the items to induce adversarial behaviors in the local FL updates. we show that RAIFLE is more impactful than existing FL privacy attacks in the IFL context, and describe how it can undermine privacy defenses like secure aggregation and private information retrieval. based on our findings, we propose and discuss countermeasure guidelines to mitigate our attack in the context of federated RS/OLTR specifically and IFL more broadly.
Transfer Learning in Transformer-Based Demand Forecasting For Home Energy Management System
results: 研究人员发现,使用转移学习设置可以比仅使用单一家用电力负载数据更好地预测家用电力负载,具体而言,可以降低预测误差率约15%,并且可以降低家用电力负载成本约2%。Abstract
Increasingly, homeowners opt for photovoltaic (PV) systems and/or battery storage to minimize their energy bills and maximize renewable energy usage. This has spurred the development of advanced control algorithms that maximally achieve those goals. However, a common challenge faced while developing such controllers is the unavailability of accurate forecasts of household power consumption, especially for shorter time resolutions (15 minutes) and in a data-efficient manner. In this paper, we analyze how transfer learning can help by exploiting data from multiple households to improve a single house's load forecasting. Specifically, we train an advanced forecasting model (a temporal fusion transformer) using data from multiple different households, and then finetune this global model on a new household with limited data (i.e. only a few days). The obtained models are used for forecasting power consumption of the household for the next 24 hours~(day-ahead) at a time resolution of 15 minutes, with the intention of using these forecasts in advanced controllers such as Model Predictive Control. We show the benefit of this transfer learning setup versus solely using the individual new household's data, both in terms of (i) forecasting accuracy ($\sim$15\% MAE reduction) and (ii) control performance ($\sim$2\% energy cost reduction), using real-world household data.
摘要
HOMEOWNERS 对光伏系统和/或电池储存系统的选择在增加,以最大化能源成本和可再生能源使用。这导致了高级控制算法的发展,以最大化这些目标。然而,开发这些控制器时常遇到缺乏精确的家用电力消耗预测,特别是在短时间尺度(15分钟)和高效率下。在这篇文章中,我们分析了如何使用传播学习来解决这个问题。我们使用多个不同的家庭的数据来训练进阶预测模型(时间融合变换器),然后在新的家庭中进行精确化训练(仅使用几天的数据)。所得到的模型用于预测新家庭的电力消耗预测,时间尺度为24小时(日前),每15分钟一次。我们显示了这个传播学习设置的优点,包括预测精度(约15% MAE减少)和控制性能(约2%能源成本减少),使用实际家庭数据进行评估。
Real-World Implementation of Reinforcement Learning Based Energy Coordination for a Cluster of Households
paper_authors: Gargya Gokhale, Niels Tiben, Marie-Sophie Verwee, Manu Lahariya, Bert Claessens, Chris Develder for: 这 paper 的目的是研究如何通过聚合控制多幢住宅建筑物来为现代电力网提供支持服务,包括备用服务。methods: 这 paper 使用了学习反馈控制(RL)技术来协调8幢住宅建筑物的电力消耗,不需要任何建筑模型或模拟器,因此实施和扩展非常方便。results: 这 paper 通过实验示出了RL基于排名系统选择哪些户型动用可变资产,并使用实时PI控制机制来控制选择的资产,实现了功能的电力跟踪和RL基于数据驱动的排名效果的可行性。Abstract
Given its substantial contribution of 40\% to global power consumption, the built environment has received increasing attention to serve as a source of flexibility to assist the modern power grid. In that respect, previous research mainly focused on energy management of individual buildings. In contrast, in this paper, we focus on aggregated control of a set of residential buildings, to provide grid supporting services, that eventually should include ancillary services. In particular, we present a real-life pilot study that studies the effectiveness of reinforcement-learning (RL) in coordinating the power consumption of 8 residential buildings to jointly track a target power signal. Our RL approach relies solely on observed data from individual households and does not require any explicit building models or simulators, making it practical to implement and easy to scale. We show the feasibility of our proposed RL-based coordination strategy in a real-world setting. In a 4-week case study, we demonstrate a hierarchical control system, relying on an RL-based ranking system to select which households to activate flex assets from, and a real-time PI control-based power dispatch mechanism to control the selected assets. Our results demonstrate satisfactory power tracking, and the effectiveness of the RL-based ranks which are learnt in a purely data-driven manner.
摘要
由于它的严重贡献了40%的全球电力消耗,建筑环境在现代电力网络中获得了越来越多的注意力,以满足需求。在这个意义上,之前的研究主要集中在建筑物之间的能源管理。相比之下,在这篇论文中,我们将关注一组住宅建筑物的总控制,以为电力网络提供支持服务,最终应包括辅助服务。具体来说,我们将展示一个实际的 Pilot 研究,研究使用强化学习(RL)来协调8个住宅建筑物的电力消耗,以同步跟踪目标电力信号。我们的RL方法不需要任何建筑物模型或模拟器,因此实施可行和扩展容易。我们在实际情况下展示了我们提议的RL-基于协调策略的可行性。在4个星期的案例研究中,我们实现了一个层次控制系统,通过RL-基于排名系统来选择需要活动的资产,并使用实时PI控制-基于的电力派发机制来控制选择的资产。我们的结果表明了满意的电力跟踪,以及RL-基于排名系统的学习效果,这些排名系统是通过实际数据驱动学习而学习的。
methods: 使用搜索算法来选择一小subset of subgraphs,并使用强化学习(RL)Agent来更新subgraph set,以提高GNNs的表达能力。
results: 在多个 datasets 上进行了广泛的实验,显示了 MAG-GNN 可以与现有方法竞争,甚至超过一些subgraph GNNs,同时也可以减少subgraph GNNs 的运行时间。Abstract
While Graph Neural Networks (GNNs) recently became powerful tools in graph learning tasks, considerable efforts have been spent on improving GNNs' structural encoding ability. A particular line of work proposed subgraph GNNs that use subgraph information to improve GNNs' expressivity and achieved great success. However, such effectivity sacrifices the efficiency of GNNs by enumerating all possible subgraphs. In this paper, we analyze the necessity of complete subgraph enumeration and show that a model can achieve a comparable level of expressivity by considering a small subset of the subgraphs. We then formulate the identification of the optimal subset as a combinatorial optimization problem and propose Magnetic Graph Neural Network (MAG-GNN), a reinforcement learning (RL) boosted GNN, to solve the problem. Starting with a candidate subgraph set, MAG-GNN employs an RL agent to iteratively update the subgraphs to locate the most expressive set for prediction. This reduces the exponential complexity of subgraph enumeration to the constant complexity of a subgraph search algorithm while keeping good expressivity. We conduct extensive experiments on many datasets, showing that MAG-GNN achieves competitive performance to state-of-the-art methods and even outperforms many subgraph GNNs. We also demonstrate that MAG-GNN effectively reduces the running time of subgraph GNNs.
摘要
Graph Neural Networks (GNNs) 在图学任务中最近成为了强大工具,但是大量的工作被投入到了GNNs的结构编码能力的提高中。一种特定的工作提出了子图GNNs,使用子图信息来提高GNNs的表达能力,并取得了很大的成功。然而,这种表达能力来源于完全对所有可能的子图进行枚举,这会导致GNNs的效率下降。在这篇论文中,我们分析了完全子图枚举的必要性,并证明了一个模型可以通过考虑一小部分的子图来达到相似的表达能力。然后,我们将这个问题转化为一个 combinatorial 优化问题,并提出了磁矢量图神经网络(MAG-GNN)来解决这个问题。MAG-GNN从候选子图集开始,使用了一个强化学习(RL)的代理人来逐步更新子图,以查找最有表达力的集合用于预测。这将枚举子图的枚举复杂度从对数复杂度降低到常数复杂度,保持好的表达能力。我们在许多数据集上进行了广泛的实验,显示MAG-GNN与当前状态的方法竞争,甚至超过了许多子图GNNs。我们还证明了MAG-GNN可以有效减少子图GNNs的运行时间。
Worst-case Performance of Popular Approximate Nearest Neighbor Search Implementations: Guarantees and Limitations
results: 我们发现,对于DiskANN的”慢预处理”版本,它可以在数据集中有 bounded “内在”维度时支持常数准确度和多余logarithmic查询时间的近似最近邻搜索查询。对于其他数据结构variant studied,包括DiskANN的”快预处理”版本、HNSW和NSG,我们提出了一家实例集,其中查询过程可以 linear in instance size 的时间内返回”合理”的准确率。例如,对于DiskANN,我们显示了,在实例大小为 n 时,查询过程至少需要 0.1 n 步骤才能查找其中的5个最近邻。Abstract
Graph-based approaches to nearest neighbor search are popular and powerful tools for handling large datasets in practice, but they have limited theoretical guarantees. We study the worst-case performance of recent graph-based approximate nearest neighbor search algorithms, such as HNSW, NSG and DiskANN. For DiskANN, we show that its "slow preprocessing" version provably supports approximate nearest neighbor search query with constant approximation ratio and poly-logarithmic query time, on data sets with bounded "intrinsic" dimension. For the other data structure variants studied, including DiskANN with "fast preprocessing", HNSW and NSG, we present a family of instances on which the empirical query time required to achieve a "reasonable" accuracy is linear in instance size. For example, for DiskANN, we show that the query procedure can take at least $0.1 n$ steps on instances of size $n$ before it encounters any of the $5$ nearest neighbors of the query.
摘要
Graph-based方法是实际中处理大数据集的强大工具,但它们在理论上有限的保证。我们研究近期的图形基于近似最近邻搜索算法,如HNSW、NSG和DiskANN的最坏情况性能。对于DiskANN,我们表明其"慢预处理"版本可以在数据集中的"内在维度"是bounded时提供常数准确率和多项几何查询时间的近似最近邻搜索查询。对其他数据结构变体,包括DiskANN的"快预处理"版本、HNSW和NSG,我们提供了一家实际上的实例,其查询时间与实例大小线性相关。例如,对于DiskANN,我们表明其查询过程可以在实例大小为$n$时间内至少执行$0.1n$步骤才会遇到5个最近邻居。
Software engineering for deep learning applications: usage of SWEng and MLops tools in GitHub repositories
results: 研究发现,大约70%的GitHub库中包含至少一个SE工具,软件配置管理工具是最多使用的,而维护工具则较少使用。另外,MLOps工具的使用相对较少,只有9个工具在该样本中被使用。TensorBoard是唯一在对�る repository 中使用的MLOps工具。Abstract
The rising popularity of deep learning (DL) methods and techniques has invigorated interest in the topic of SE4DL, the application of software engineering (SE) practices on deep learning software. Despite the novel engineering challenges brought on by the data-driven and non-deterministic paradigm of DL software, little work has been invested into developing AI-targeted SE tools. On the other hand, tools tackling more general engineering issues in DL are actively used and referred to under the umbrella term of ``MLOps tools''. Furthermore, the available literature supports the utility of conventional SE tooling in DL software development. Building upon previous MSR research on tool usage in open-source software works, we identify conventional and MLOps tools adopted in popular applied DL projects that use Python as the main programming language. About 70% of the GitHub repositories mined contained at least one conventional SE tool. Software configuration management tools are the most adopted, while the opposite applies to maintenance tools. Substantially fewer MLOps tools were in use, with only 9 tools out of a sample of 80 used in at least one repository. The majority of them were open-source rather than proprietary. One of these tools, TensorBoard, was found to be adopted in about half of the repositories in our study. Consequently, the use of conventional SE tooling demonstrates its relevance to DL software. Further research is recommended on the adoption of MLOps tooling by open-source projects, focusing on the relevance of particular tool types, the development of required tools, as well as ways to promote the use of already available tools.
摘要
随着深度学习(DL)方法和技术的普及,关注SE4DL(深度学习软件工程)领域的应用而增加。然而,由于深度学习软件的数据驱动和不确定的理论带来的新的工程挑战,对于AI目标的SE工具仍然受到了少量投入。相反,关于更一般的机器学习(ML)工程问题,如MLOps工具,活跃地使用和引用。此外,现有的文献支持传统的SE工具在深度学习软件开发中的可用性。基于之前的微软研究人员在开源软件项目中工具使用情况,我们识别了传统和MLOps工具在流行的应用深度学习项目中的采用情况。我们发现,大约70%的GitHub存储库包含至少一个传统SE工具。软件配置管理工具是最广泛采用的,而维护工具则相对较少。与此同时,MLOps工具的采用远远少于传统SE工具,只有80个存储库中的9个被使用。大多数这些工具是开源的,而不是商业化的。 tensorBoard 是这些工具中的一个,在我们的研究中被采用的约半数。因此,传统SE工具在深度学习软件开发中的使用表明了它们的重要性。进一步的研究建议在开源项目中采用MLOps工具,特别是关注特定工具类型、开发需要的工具以及如何促进现有工具的使用。
Proving Linear Mode Connectivity of Neural Networks via Optimal Transport
results: 我们提供了一种上下限bounds,可以量化每层神经网络的宽度,以便确保连续性。此外,我们还经验表明了积分权重分布的维度与连续性之间的相关性。Abstract
The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.
摘要
高维非对称优化问题的能量景观对现代深度神经网络架构的效果是关键。latest studies have shown that two different solutions found after two runs of stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a theoretical framework to explain this empirical observation. Based on the convergence rates of empirical measures in Wasserstein distance, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we provide upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates, is correlated with linear mode connectivity.
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
results: 在服务器上实现高 durchput 提高 ($7.73\times$ 比FP16和 $2.53\times$ 比INT8 归一化),同时保持同样的响应时间目标,而不会增加精度损失。Abstract
The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to $7.73\times$ compared to the FP16 and by $2.53\times$ compared to INT8 quantization, while maintaining the same latency target.
摘要
受大语言模型(LLM)应用的内容生成、智能客服和情感分析等领域的需求不断增长, LLM 服务提供商面临着巨大的挑战。为了高效利用 GPU 资源并提高通过put,批处多个请求已成为了流行的方法;而为了进一步加速批处, LLM 量化技术可以降低内存占用量和提高计算能力。然而,现有的量化方案(如 8 位 weight-activation 量化)无法完全利用现代 GPU 的能力,导致性能下降。为了最大化 LLM 的服务通过put,我们介绍 Atom,一种低位量化方法,可以 achieve high throughput improvements with negligible accuracy loss。Atom 使用低位操作和减少内存占用量,以提高服务通过put。它通过应用一种新的混合精度和细致的量化过程,保持高度的准确性。我们在 4 位 weight-activation 量化设置下测试 Atom。Atom 可以提高终端通过put的吞吐量,比FP16和 INT8 量化的吞吐量高出 $7.73\times$,而且保持同样的响应时间目标。
Bridging the Gap: Towards an Expanded Toolkit for ML-Supported Decision-Making in the Public Sector
paper_authors: Unai Fischer Abaigar, Christoph Kern, Noam Barda, Frauke Kreuter
For: This paper aims to bridge the gap between machine learning (ML) and public sector decision-making by addressing key technical challenges that arise when aligning intricate policy objectives with the precise formalization requirements of ML models.* Methods: The paper concentrates on pivotal points of the ML pipeline that connect the model to its operational environment, including the significance of representative training data and the importance of a model setup that facilitates effective decision-making. The paper also links these challenges with emerging methodological advancements, such as causal ML, domain adaptation, uncertainty quantification, and multi-objective optimization.* Results: The paper provides a comprehensive overview of the challenges that arise when using ML in the public sector, and highlights the importance of addressing these challenges in order to harmonize ML and public sector objectives. The paper also illustrates the path forward for addressing these challenges, including the use of emerging methodological advancements.Abstract
Machine Learning (ML) systems are becoming instrumental in the public sector, with applications spanning areas like criminal justice, social welfare, financial fraud detection, and public health. While these systems offer great potential benefits to institutional decision-making processes, such as improved efficiency and reliability, they still face the challenge of aligning intricate and nuanced policy objectives with the precise formalization requirements necessitated by ML models. In this paper, we aim to bridge the gap between ML and public sector decision-making by presenting a comprehensive overview of key technical challenges where disjunctions between policy goals and ML models commonly arise. We concentrate on pivotal points of the ML pipeline that connect the model to its operational environment, delving into the significance of representative training data and highlighting the importance of a model setup that facilitates effective decision-making. Additionally, we link these challenges with emerging methodological advancements, encompassing causal ML, domain adaptation, uncertainty quantification, and multi-objective optimization, illustrating the path forward for harmonizing ML and public sector objectives.
摘要
训练数据的选择和调整:ML 模型的精度和可靠性受到训练数据的影响,但是公共部门的决策过程中的训练数据可能不够完整或不够有代表性。2. 模型设置的构成和调整:为了使 ML 模型能够实际地支持公共部门的决策过程,需要适当地设置和调整模型的参数和架构。3. 适用于公共部门的 ML 技术发展:包括 causal ML、领域适应、uncertainty quantification 和多目标优化在内的新技术可以帮助解决 ML 和公共部门之间的匹配问题。本文通过聚焦 ML pipeline 中的关键点子,探讨 ML 模型如何与公共部门的决策过程进行匹配,并提出了一些实际的方法来解决这些挑战。这些方法包括:1. 使用更多的代表性丰富的训练数据来优化 ML 模型的性能。2. 适当地设置和调整 ML 模型的参数和架构,以便更好地支持公共部门的决策过程。3. 采用新的 ML 技术,例如 causal ML、领域适应、uncertainty quantification 和多目标优化,来解决 ML 和公共部门之间的匹配问题。
Efficient Cluster Selection for Personalized Federated Learning: A Multi-Armed Bandit Approach
results: 这篇论文的实验结果显示,在不同的数据分布和设备能力下,这个算法可以有效地处理变化很大的联邦学习enario。Abstract
Federated learning (FL) offers a decentralized training approach for machine learning models, prioritizing data privacy. However, the inherent heterogeneity in FL networks, arising from variations in data distribution, size, and device capabilities, poses challenges in user federation. Recognizing this, Personalized Federated Learning (PFL) emphasizes tailoring learning processes to individual data profiles. In this paper, we address the complexity of clustering users in PFL, especially in dynamic networks, by introducing a dynamic Upper Confidence Bound (dUCB) algorithm inspired by the multi-armed bandit (MAB) approach. The dUCB algorithm ensures that new users can effectively find the best cluster for their data distribution by balancing exploration and exploitation. The performance of our algorithm is evaluated in various cases, showing its effectiveness in handling dynamic federated learning scenarios.
摘要
Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile Streaming
paper_authors: Gregory Dexter, Petros Drineas, David P. Woodruff, Taisuke Yasuda
for: 这 paper 的目的是扩展 sketching 算法的应用范围,包括稀疏字储学习和 EUCLIDEAN $k$-means 归类问题。
methods: 这 paper 使用了新的技术来推广 sketching 算法的应用范围,包括一种新的 PTAS 方法和新的上界和下界。
results: 这 paper 得到了一些新的结果,包括一种新的 PTAS 方法和新的上界和下界,以及一些关于 dictionary learning 和 $k$-means 归类问题的研究。Abstract
Sketching algorithms have recently proven to be a powerful approach both for designing low-space streaming algorithms as well as fast polynomial time approximation schemes (PTAS). In this work, we develop new techniques to extend the applicability of sketching-based approaches to the sparse dictionary learning and the Euclidean $k$-means clustering problems. In particular, we initiate the study of the challenging setting where the dictionary/clustering assignment for each of the $n$ input points must be output, which has surprisingly received little attention in prior work. On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem, which generalizes to the first PTAS for the sparse dictionary learning problem. On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering. In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a turnstile stream, we show an $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ space upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde O(n/\epsilon^2 + dk/\epsilon)$ space upper bound for $k$-means clustering, as well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random order row insertion streams with a natural "bounded sensitivity" assumption. On the lower bounds side, we obtain a general $\tilde\Omega(n/\epsilon + dk/\epsilon)$ lower bound for $k$-means clustering, as well as an $\tilde\Omega(n/\epsilon^2)$ lower bound for algorithms which can estimate the cost of a single fixed set of candidate centers.
摘要
algorithm 已经证明是一种强大的方法,不仅用于设计具有低空间流处理器的算法,也用于快速的多项时间算法(PTAS)。在这个工作中,我们开发了新的技术,以扩展画 sketching 方法的应用范围至简短字典学习和欧几何 $k$-means 聚类问题。具体来说,我们开始研究具有复杂的设定,其中每个输入点的字典/聚类分配必须被输出,这个问题在先前的工作中很少获得关注。在快速算法方面,我们取得了一新的方法,用于设计 PTAS 的 $k$-means 聚类问题,这个方法可扩展到简短字典学习问题的首次 PTAS。在流处理算法方面,我们取得了新的上界和下界,用于字典学习和 $k$-means 聚类问题。具体来说,在turnstile流中,我们显示了一个 $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ 的空间上界,用于 $r$-简字典学习的大小为 $k$,以及一个 $\tilde O(n/\epsilon^2 + dk/\epsilon)$ 的空间上界,用于 $k$-means 聚类问题。此外,我们还取得了一个 $\tilde O(n)$ 的空间上界,用于 $k$-means 聚类在随机排序推入流中。在下界方面,我们取得了一个通用的 $\tilde\Omega(n/\epsilon + dk/\epsilon)$ 下界,用于 $k$-means 聚类问题,以及一个 $\tilde\Omega(n/\epsilon^2)$ 下界,用于可以估计单一集合中心的成本的算法。
results: 研究发现,选择最佳算法需要考虑不同的 LLP 变种和模型选择方法。通过对一些常见 LLQ 算法进行了广泛的比较, demonstrate 了需要我们提出的方法。Abstract
Learning from Label Proportions (LLP) is an established machine learning problem with numerous real-world applications. In this setting, data items are grouped into bags, and the goal is to learn individual item labels, knowing only the features of the data and the proportions of labels in each bag. Although LLP is a well-established problem, it has several unusual aspects that create challenges for benchmarking learning methods. Fundamental complications arise because of the existence of different LLP variants, i.e., dependence structures that can exist between items, labels, and bags. Accordingly, the first algorithmic challenge is the generation of variant-specific datasets capturing the diversity of dependence structures and bag characteristics. The second methodological challenge is model selection, i.e., hyperparameter tuning; due to the nature of LLP, model selection cannot easily use the standard machine learning paradigm. The final benchmarking challenge consists of properly evaluating LLP solution methods across various LLP variants. We note that there is very little consideration of these issues in prior work, and there are no general solutions for these challenges proposed to date. To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants. We use these methods to generate a collection of datasets encompassing the spectrum of LLP problem characteristics, which can be used in future evaluation studies. Additionally, we develop guidelines for benchmarking LLP algorithms, including the model selection and evaluation steps. Finally, we illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms. We show that choosing the best algorithm depends critically on the LLP variant and model selection method, demonstrating the need for our proposed approach.
摘要
To address these challenges, we develop methods capable of generating LLP datasets meeting the requirements of different variants. We use these methods to generate a collection of datasets encompassing the spectrum of LLP problem characteristics, which can be used in future evaluation studies. Additionally, we develop guidelines for benchmarking LLP algorithms, including the model selection and evaluation steps. Finally, we illustrate the new methods and guidelines by performing an extensive benchmark of a set of well-known LLP algorithms. We show that choosing the best algorithm depends critically on the LLP variant and model selection method, demonstrating the need for our proposed approach.翻译结果:LLP(学习从标签分量)是一个已经有长期应用的机器学习问题,具有许多实际应用场景。在这个设定中,数据项目被分组为袋子,目标是从数据特征和标签分量中学习各自的项目标签。虽然LLP是一个已知的问题,但它有一些不寻常的特点,导致评估学习方法的挑战。主要的挑战包括:1. 生成 variant-specific 数据集,捕捉不同的依赖结构和袋子特征的多样性。2. 因为 LLP 的特点,选择最佳模型不能使用标准机器学习范文。3. 评估 LLP 解决方案的多样性,以确保它们在不同的 LLP 变体中表现良好。为了解决这些挑战,我们开发了生成 LLP 数据集的方法,以满足不同变体的需求。我们使用这些方法生成了一系列包含 LLP 问题特征谱的数据集,可以在未来的评估研究中使用。此外,我们还提供了评估 LLP 算法的指南,包括模型选择和评估步骤。最后,我们使用新方法和指南对一组知名 LLP 算法进行了广泛的比较。我们发现,选择最佳算法取决于 LLP 变体和模型选择方法,这说明了我们的提出的方法的需要。
results: 在 realizable 设定下,showed that the expected number of mistakes for any learner under apple tasting feedback can only be $\Theta(1), \Theta(\sqrt{T})$, or $\Theta(T)$。Abstract
In online binary classification under \textit{apple tasting} feedback, the learner only observes the true label if it predicts "1". First studied by \cite{helmbold2000apple}, we revisit this classical partial-feedback setting and study online learnability from a combinatorial perspective. We show that the Littlestone dimension continues to prove a tight quantitative characterization of apple tasting in the agnostic setting, closing an open question posed by \cite{helmbold2000apple}. In addition, we give a new combinatorial parameter, called the Effective width, that tightly quantifies the minimax expected mistakes in the realizable setting. As a corollary, we use the Effective width to establish a \textit{trichotomy} of the minimax expected number of mistakes in the realizable setting. In particular, we show that in the realizable setting, the expected number of mistakes for any learner under apple tasting feedback can only be $\Theta(1), \Theta(\sqrt{T})$, or $\Theta(T)$.
摘要
在在线二分类学习中,学习者只会看到真实标签,如果预测结果为1。这个问题最早由Helmbold等人(2000)研究,我们现在从 combinatorial 角度重新研究这个古典的partial-feedback 设定,并证明 Littlestone 维度仍然是agnostic 设定中的一个紧张量量化 caracterization。此外,我们还提出了一个新的 combinatorial 参数,called Effective width,它紧密地量化了可行情况下的最差预期错误。为此,我们使用 Effective width 证明了可行情况下的最差预期错误数可以只是 $\Theta(1), \Theta(\sqrt{T})$ 或 $\Theta(T)$。
Feature Aggregation in Joint Sound Classification and Localization Neural Networks
methods: 我们采用了计算机视觉网络中的特征聚合技术,包括Path Aggregation Network (PANet)、Weighted Bi-directional Feature Pyramid Network (BiFPN)和Scale Encoding Network (SEN)等。这些技术被integrated into a SSL control architecture,并被评估使用两种声音分类和两种方向射 regression 的指标。PANet和BiFPN是计算机视觉模型中已知的聚合器,而我们提议的SEN是更加压缩的聚合器。
results: 结果表明,包含特征聚合的模型在声音分类和地点化方面的性能都高于控制模型,即Sound Event Localization and Detection network (SELDnet)。特征聚合技术提高了声音检测神经网络的性能,特别是在方向射 regression 方面。Abstract
This study addresses the application of deep learning techniques in joint sound signal classification and localization networks. Current state-of-the-art sound source localization deep learning networks lack feature aggregation within their architecture. Feature aggregation enhances model performance by enabling the consolidation of information from different feature scales, thereby improving feature robustness and invariance. This is particularly important in SSL networks, which must differentiate direct and indirect acoustic signals. To address this gap, we adapt feature aggregation techniques from computer vision neural networks to signal detection neural networks. Additionally, we propose the Scale Encoding Network (SEN) for feature aggregation to encode features from various scales, compressing the network for more computationally efficient aggregation. To evaluate the efficacy of feature aggregation in SSL networks, we integrated the following computer vision feature aggregation sub-architectures into a SSL control architecture: Path Aggregation Network (PANet), Weighted Bi-directional Feature Pyramid Network (BiFPN), and SEN. These sub-architectures were evaluated using two metrics for signal classification and two metrics for direction-of-arrival regression. PANet and BiFPN are established aggregators in computer vision models, while the proposed SEN is a more compact aggregator. The results suggest that models incorporating feature aggregations outperformed the control model, the Sound Event Localization and Detection network (SELDnet), in both sound signal classification and localization. The feature aggregation techniques enhance the performance of sound detection neural networks, particularly in direction-of-arrival regression.
摘要
Escaping Saddle Points in Heterogeneous Federated Learning via Distributed SGD with Communication Compression
For: 提高 federated learning(FL)中communication efficiency和学习精度的问题。* Methods: 提出了一种新的error-feedback scheme,实现了在不同客户端数据异ogeneous的情况下,通过压缩信息进行分布式SGD算法的实现。* Results: 证明了Power-EF算法可以在不同客户端数据异ogeneous情况下,逃脱平均点,并且在第二阶段 convergence 中,展现出线性增长。Abstract
We consider the problem of finding second-order stationary points of heterogeneous federated learning (FL). Previous works in FL mostly focus on first-order convergence guarantees, which do not rule out the scenario of unstable saddle points. Meanwhile, it is a key bottleneck of FL to achieve communication efficiency without compensating the learning accuracy, especially when local data are highly heterogeneous across different clients. Given this, we propose a novel algorithm Power-EF that only communicates compressed information via a novel error-feedback scheme. To our knowledge, Power-EF is the first distributed and compressed SGD algorithm that provably escapes saddle points in heterogeneous FL without any data homogeneity assumptions. In particular, Power-EF improves to second-order stationary points after visiting first-order (possibly saddle) points, using additional gradient queries and communication rounds only of almost the same order required by first-order convergence, and the convergence rate exhibits a linear speedup in terms of the number of workers. Our theory improves/recovers previous results, while extending to much more tolerant settings on the local data. Numerical experiments are provided to complement the theory.
摘要
我们考虑到寻找非常复杂的联邦学习(FL)中的第二阶站点问题。前一些FL工作主要集中在第一阶均衡保证,这不能排除不稳定的阶均点的情况。另一方面,在FL中实现通信效率不损学习精度的挑战,尤其是当地方数据具有很高的不同客户端的多样性时。为了解决这个问题,我们提出了一个新的算法Power-EF,它仅在一个新的错误反馈方案下进行压缩通信。我们知道Power-EF是首个分布式压缩SGD算法,可以在不同客户端的数据多样性下,避免阶均点而实现第二阶站点,并且在额外的梯度询问和通信轮次上进行几乎相同的复杂度。我们的理论提高了/恢复了先前的结果,同时扩展到许多更允许的本地数据设置。实验数据来补充理论。
results: 该论文在一系列简单的图像基于的分离实验中成功地分离了一组物体的属性,并且需要更少的干扰than comparable approach 。Abstract
Causal representation learning has showed a variety of settings in which we can disentangle latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are represented as $d$-dimensional vectors, and (2) that the observations are the output of some injective generative function of these latent variables. While these assumptions appear benign, we show that when the observations are of multiple objects, the generative function is no longer injective and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture arXiv:2006.15055, we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object's properties. This approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.
摘要
causal representation learning 在多种设置中展示了可以分离干扰变量的可靠性保证 ( hasta certain extent 的等价类). 这些方法假设:1) 干扰变量是 $d$-维 вектор表示; 2) 观察是这些干扰变量的生成函数的输出。 although these assumptions seem innocuous, we show that when the observations are of multiple objects, the generative function is no longer injective and disentanglement fails in practice. 我们可以通过结合近期的对象中心学习和 causal representation learning 来解决这种失败。 我们修改了 arXiv:2006.15055 中的槽注意架构,以便在 sparse perturbations 的 weak supervision 下,为每个对象分离其特性。 这种方法比一种在 Euclidean space 中编码并且需要更少的扰动而言,我们展示了这种方法可以成功地分离一系列的图像基于的对象分离实验中的对象特性。
Datasets and Benchmarks for Nanophotonic Structure and Parametric Design Simulations
results: 研究人员通过对不同Grid大小的电动力学模拟进行比较,发现可以通过灵活地选择评价精度来提高结构设计。此外,他们还提出了一些参数结构设计问题的解决方案。Abstract
Nanophotonic structures have versatile applications including solar cells, anti-reflective coatings, electromagnetic interference shielding, optical filters, and light emitting diodes. To design and understand these nanophotonic structures, electrodynamic simulations are essential. These simulations enable us to model electromagnetic fields over time and calculate optical properties. In this work, we introduce frameworks and benchmarks to evaluate nanophotonic structures in the context of parametric structure design problems. The benchmarks are instrumental in assessing the performance of optimization algorithms and identifying an optimal structure based on target optical properties. Moreover, we explore the impact of varying grid sizes in electrodynamic simulations, shedding light on how evaluation fidelity can be strategically leveraged in enhancing structure designs.
摘要
几何光子结构具有多方面应用,包括太阳能电池、反射层、电磁干扰隔绝、光滤波器和发光二极管。为设计和理解这些几何光子结构,电动力学模拟是必备的。这些模拟可以模拟电磁场过时的变化,并计算光学性能。在这个工作中,我们介绍了框架和参考标准,用于评估几何光子结构在参数结构设计问题中的性能。这些参考标准可以评估优化算法的性能,并帮助选择基于目标光学性能的最佳结构。此外,我们还探讨了在电动力学模拟中不同格子大小的影响,照明了如何积极地利用评估实价来提升结构设计。
Differentially Private Permutation Tests: Applications to Kernel Methods
methods: 使用差异性保护的排序测试(differentially private permutation tests),extend classical non-private permutation tests to private settings,maintain both finite-sample validity and differential privacy
results: 提出了 differentially private kernel tests(dpMMD和dpHSIC),可以在不同的隐私环境下实现最佳的能力,实现了在Synthetic和实际场景下的竞争力比较Here’s the breakdown of each point:
for: The paper is written for the purpose of privacy-preserving data analysis, specifically in the context of hypothesis testing.
methods: The paper introduces differentially private permutation tests as a way to extend classical non-private permutation tests to private settings while maintaining both finite-sample validity and differential privacy.
results: The paper proposes two differentially private kernel tests (dpMMD and dpHSIC) that can achieve optimal power under different privacy regimes, and demonstrates their competitive power through empirical evaluations on synthetic and real-world data.Abstract
Recent years have witnessed growing concerns about the privacy of sensitive data. In response to these concerns, differential privacy has emerged as a rigorous framework for privacy protection, gaining widespread recognition in both academic and industrial circles. While substantial progress has been made in private data analysis, existing methods often suffer from impracticality or a significant loss of statistical efficiency. This paper aims to alleviate these concerns in the context of hypothesis testing by introducing differentially private permutation tests. The proposed framework extends classical non-private permutation tests to private settings, maintaining both finite-sample validity and differential privacy in a rigorous manner. The power of the proposed test depends on the choice of a test statistic, and we establish general conditions for consistency and non-asymptotic uniform power. To demonstrate the utility and practicality of our framework, we focus on reproducing kernel-based test statistics and introduce differentially private kernel tests for two-sample and independence testing: dpMMD and dpHSIC. The proposed kernel tests are straightforward to implement, applicable to various types of data, and attain minimax optimal power across different privacy regimes. Our empirical evaluations further highlight their competitive power under various synthetic and real-world scenarios, emphasizing their practical value. The code is publicly available to facilitate the implementation of our framework.
摘要
近年来,有越来越多的关注关于敏感数据的隐私问题。为回应这些问题,差分隐私在学术和工业圈中得到了广泛的认可,成为隐私保护的严格框架。虽然在私人数据分析方面已经做出了大量的进展,但现有方法经常受到实用性或统计效率的限制。这篇论文的目标是在假设测试中解决这些问题,通过引入差分隐私排序测试来保持rigorous的隐私和统计有效性。我们的框架将经典的非私人排序测试扩展到私人设置下,并保持了finite-sample的有效性和差分隐私。我们的测试能力取决于选择的测试统计量,我们确定了一般的一致性和非假设统计上的强大能力。为了证明我们的框架的实用性和实用性,我们将重点关注使用归一化测试统计量,并引入差分隐私kernel测试:dpMMD和dpHSIC。这些差分隐私kernel测试是易于实现,适用于各种数据类型,并在不同的隐私环境下具有最佳的可比性。我们的实验证明了它们在不同的 sintetic 和实际场景下具有竞争力,强调它们的实际价值。代码publicly available,以便实现我们的框架。
On Linear Separation Capacity of Self-Supervised Representation Learning
results: 研究发现,数据增强学习可以提高线性分类器的表达能力,并且可以在多材料模型下 Linearly separate manifolds。此外,研究还发现,自助学习可以在小样本大量数据下提高线性分类器的表达能力。Abstract
Recent advances in self-supervised learning have highlighted the efficacy of data augmentation in learning data representation from unlabeled data. Training a linear model atop these enhanced representations can yield an adept classifier. Despite the remarkable empirical performance, the underlying mechanisms that enable data augmentation to unravel nonlinear data structures into linearly separable representations remain elusive. This paper seeks to bridge this gap by investigating under what conditions learned representations can linearly separate manifolds when data is drawn from a multi-manifold model. Our investigation reveals that data augmentation offers additional information beyond observed data and can thus improve the information-theoretic optimal rate of linear separation capacity. In particular, we show that self-supervised learning can linearly separate manifolds with a smaller distance than unsupervised learning, underscoring the additional benefits of data augmentation. Our theoretical analysis further underscores that the performance of downstream linear classifiers primarily hinges on the linear separability of data representations rather than the size of the labeled data set, reaffirming the viability of constructing efficient classifiers with limited labeled data amid an expansive unlabeled data set.
摘要
Machine Learning for the identification of phase-transitions in interacting agent-based systems
results: 该论文通过使用这种数据驱动的方法,成功地construct了一个相态转变的 диаграм。Abstract
Deriving closed-form, analytical expressions for reduced-order models, and judiciously choosing the closures leading to them, has long been the strategy of choice for studying phase- and noise-induced transitions for agent-based models (ABMs). In this paper, we propose a data-driven framework that pinpoints phase transitions for an ABM in its mean-field limit, using a smaller number of variables than traditional closed-form models. To this end, we use the manifold learning algorithm Diffusion Maps to identify a parsimonious set of data-driven latent variables, and show that they are in one-to-one correspondence with the expected theoretical order parameter of the ABM. We then utilize a deep learning framework to obtain a conformal reparametrization of the data-driven coordinates that facilitates, in our example, the identification of a single parameter-dependent ODE in these coordinates. We identify this ODE through a residual neural network inspired by a numerical integration scheme (forward Euler). We then use the identified ODE -- enabled through an odd symmetry transformation -- to construct the bifurcation diagram exhibiting the phase transition.
摘要
使用闭式表达式和选择合适的闭式来研究基于代理模型(ABM)的相对阶段和噪声引起的转变,已经是长期的策略。在这篇文章中,我们提出了一个数据驱动的框架,用于在ABM的含义场限制下标出相对阶段的转变点。为此,我们使用扩散地图算法来确定一个简洁的数据驱动的秘密变量,并证明它们与ABM的预期的理论参量之间存在一一对应关系。然后,我们使用深度学习框架来获得一个符号映射,以便在这些坐标系中进行数据驱动的协调。通过这种方式,我们可以在这些坐标系中提取出一个参数依赖的径谱方程。我们使用这个径谱方程,通过一种奇偶变换,构建了相对阶段的分布图。
Does Invariant Graph Learning via Environment Augmentation Learn Invariance?
paper_authors: Yongqiang Chen, Yatao Bian, Kaiwen Zhou, Binghui Xie, Bo Han, James Cheng
For: 本 paper 的目的是学习图像上的不变性,以便在图像上进行对外部数据进行泛化。* Methods: 本 paper 使用环境扩充来提高图像的不变性学习,但是这些环境扩充的有用性从未被证明。因此,本 paper 提出了一些最小假设,包括变化 suficiency 和变化 consistency,以便可能地学习图像的不变性。* Results: 本 paper 提出了一个新的框架 Graph invAriant Learning Assistant (GALA),该框架包括一个助手模型,该模型需要对图像环境变化或分布变化敏感。助手模型的proxy预测可以判断图像中的杂乱子图的变化。根据这些 proxy 预测,提取图像中最大不变性子图可以唯一地标识图像的不变性子图,并且在成功的 OOD 泛化下保证不变性。经过对多个 dataset 的广泛实验,包括 DrugOOD 等,确认了 GALA 的有效性。Abstract
Invariant graph representation learning aims to learn the invariance among data from different environments for out-of-distribution generalization on graphs. As the graph environment partitions are usually expensive to obtain, augmenting the environment information has become the de facto approach. However, the usefulness of the augmented environment information has never been verified. In this work, we find that it is fundamentally impossible to learn invariant graph representations via environment augmentation without additional assumptions. Therefore, we develop a set of minimal assumptions, including variation sufficiency and variation consistency, for feasible invariant graph learning. We then propose a new framework Graph invAriant Learning Assistant (GALA). GALA incorporates an assistant model that needs to be sensitive to graph environment changes or distribution shifts. The correctness of the proxy predictions by the assistant model hence can differentiate the variations in spurious subgraphs. We show that extracting the maximally invariant subgraph to the proxy predictions provably identifies the underlying invariant subgraph for successful OOD generalization under the established minimal assumptions. Extensive experiments on datasets including DrugOOD with various graph distribution shifts confirm the effectiveness of GALA.
摘要
《固定 graph 表示学习中的不变性学习目标是学习数据集中的不变性,以实现对不同环境的外部数据泛化。然而,通常获取 graph 环境分区是非常昂贵的,因此通常会使用环境扩充来解决这个问题。然而,这种环境扩充的有用性从来没有得到证明。在这种情况下,我们发现,通过环境扩充来学习不变的 graph 表示是不可能的,因此我们提出了一些最小化假设,包括变化充分和变化一致,以便实现可能的不变的 graph 学习。然后,我们提出了一个新的框架Graph invAriant Learning Assistant(GALA)。GALA 包含一个助手模型,该模型需要对 graph 环境变化或分布变化敏感。如果助手模型的代理预测正确,那么可以区分真正的变量和误差的变量。我们证明,从助手模型的代理预测中提取最大可变的子图可以识别下来的不变的子图,并且在我们提出的假设下,可以 garantuee 对外部数据的泛化。我们的实验结果表明,GALA 在具有不同 graph 分布变化的数据集上具有非常高的有效性。》
An Improved Relaxation for Oracle-Efficient Adversarial Contextual Bandits
methods: 这个论文使用的方法是一个 online adversary 选择 cost sequence,contexts 是从 known distribution 随机地被引入。
results: 这个论文的 regret bound 是 $O(T^{\frac{2}{3}(K\log(|\Pi|))^{\frac{1}{3})$,比之前的最好 bound $O((TK)^{\frac{2}{3}(\log(|\Pi|))^{\frac{1}{3})$ 更好。此外,这个论文还是第一个能够与 Langford 和 Zhang 在 NeurIPS 2007 提出的原始 bound 匹配的 result。Abstract
We present an oracle-efficient relaxation for the adversarial contextual bandits problem, where the contexts are sequentially drawn i.i.d from a known distribution and the cost sequence is chosen by an online adversary. Our algorithm has a regret bound of $O(T^{\frac{2}{3}(K\log(|\Pi|))^{\frac{1}{3})$ and makes at most $O(K)$ calls per round to an offline optimization oracle, where $K$ denotes the number of actions, $T$ denotes the number of rounds and $\Pi$ denotes the set of policies. This is the first result to improve the prior best bound of $O((TK)^{\frac{2}{3}(\log(|\Pi|))^{\frac{1}{3})$ as obtained by Syrgkanis et al. at NeurIPS 2016, and the first to match the original bound of Langford and Zhang at NeurIPS 2007 which was obtained for the stochastic case.
摘要
我们提出了一个 oracle-efficient relaxation 的方法来解决对抗上下文带状奖励问题,其中上下文是以独立 Identically distributed(i.i.d)方式从一个已知分布中随机获取,而问题选择的成本序列则是由一个在线 adversary 选择。我们的算法具有一个 regret bound of $O(T^{\frac{2}{3}}(K\log(|\Pi|))^{\frac{1}{3}})$,并在每个回合最多做 $O(K)$ 个调用于 offline 优化库的请求,其中 $K$ 表示行动的数量,$T$ 表示回合的数量,$\Pi$ 表示策略的集合。这是第一个超越先前最好的 bound of $O((TK)^{\frac{2}{3}}(\log(|\Pi|))^{\frac{1}{3}})$,它是 Syrgkanis et al. 在 NeurIPS 2016 上提出的,并且是第一个与 Langford 和 Zhang 在 NeurIPS 2007 上提出的原始 bound 匹配,这个 bound 是为 Stochastic 情况。
Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback
results: 论文提出了关于这三种方法的新发现,包括它们在 дискреase时间LTI系统中的收敛性和约瑟率。此外,论文还证明了vanilla policy gradient方法在初始化 nearby local minima时的线性收敛性。Abstract
In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.
摘要
近些时间,在政策梯度方法中探索优化景观的进展很大。相比状态反馈控制,输出反馈控制更为普遍,因为实际情况中系统的下面状态可能不完全 observable。这篇论文分析了在静态输出反馈(SOF)控制中政策梯度方法的优化景观。我们首先证明了SOF成本函数的重要性质,包括半征性、L-smoothness和M-Lipschitz连续偏导。尽管不具有凸性,我们利用这些性质来 derivate 新的发现,包括政策梯度方法的三种方法(包括混合政策梯度方法、自然政策梯度方法和Gauss-Newton方法)的收敛性(以及几乎维度独立的速率)。此外,我们提供了证明,在 initialization 近于 Local minima 时,混合政策梯度方法 exhibits 线性收敛到 Local minima。文章结束,通过数学实验证明我们的理论发现。这些结果不仅描述了随机梯度下引擎的 SOF 问题的优化,还为束缚学习中的政策梯度方法提供了信息。
Behavior Alignment via Reward Function Optimization
results: 本研究的结果显示,使用本研究的方法可以对RL代理人的政策优化过程进行自动调整,以减少问题解决中的限制和偏误。此外,本研究还证明了其可以对不同的任务和环境进行适用,并且可以实现高性能的解决方案,即使auxiliary reward函数存在误差或偏误。Abstract
Designing reward functions for efficiently guiding reinforcement learning (RL) agents toward specific behaviors is a complex task. This is challenging since it requires the identification of reward structures that are not sparse and that avoid inadvertently inducing undesirable behaviors. Naively modifying the reward structure to offer denser and more frequent feedback can lead to unintended outcomes and promote behaviors that are not aligned with the designer's intended goal. Although potential-based reward shaping is often suggested as a remedy, we systematically investigate settings where deploying it often significantly impairs performance. To address these issues, we introduce a new framework that uses a bi-level objective to learn \emph{behavior alignment reward functions}. These functions integrate auxiliary rewards reflecting a designer's heuristics and domain knowledge with the environment's primary rewards. Our approach automatically determines the most effective way to blend these types of feedback, thereby enhancing robustness against heuristic reward misspecification. Remarkably, it can also adapt an agent's policy optimization process to mitigate suboptimalities resulting from limitations and biases inherent in the underlying RL algorithms. We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges. We investigate heuristic auxiliary rewards of varying quality -- some of which are beneficial and others detrimental to the learning process. Our results show that our framework offers a robust and principled way to integrate designer-specified heuristics. It not only addresses key shortcomings of existing approaches but also consistently leads to high-performing solutions, even when given misaligned or poorly-specified auxiliary reward functions.
摘要
设计奖励函数以有效引导学习控制(RL)代理人行为是一个复杂的任务。这是因为它需要识别不 sparse的奖励结构,以避免不恰当的奖励引导代理人行为。直接修改奖励结构以提供更密集和更频繁的反馈可能会导致不预期的结果,并且激励代理人不符合设计者的目标行为。虽然潜在基于奖励的奖励形成 often 被建议作为解决方案,但我们系统地调查这种方法在一些情况下可能会导致性能下降。为解决这些问题,我们提出一种新的框架,使用二级目标学习行为Alignment奖励函数。这些函数将auxiliary奖励与环境的主要奖励相结合,以便自动确定最有效的奖励杂合方式,从而提高对奖励misspecification的Robustness。此外,它还可以通过调整代理人的政策优化过程来抑制基于RL算法的限制和偏见所导致的优化不足。我们对这种方法的可行性进行了多种任务的测试,从小规模实验到高维控制挑战。我们研究了不同质量的辅助奖励,一些有利于学习过程,而另一些有害。我们的结果表明,我们的框架可以采取一种原则性的方式来整合设计者指定的euristic。它不仅解决了现有方法的主要缺陷,还一致地导致高性能的解决方案,即使auxiliary奖励函数给出了偏移或低质量的指示。
Kernel-based Joint Multiple Graph Learning and Clustering of Graph Signals
results: 实验结果表明,这种方法在比较 estado-of-the-art方法时表现出了更高的效果。Abstract
Within the context of Graph Signal Processing (GSP), Graph Learning (GL) is concerned with the inference of a graph's topology from nodal observations, i.e., graph signals. However, data is often in mixed form, relating to different underlying structures. This heterogeneity necessitates the joint clustering and learning of multiple graphs. In many real-life applications, there are available node-side covariates (i.e., kernels) that imperatively should be incorporated, which has not been addressed by the rare graph signal clustering approaches. To this end and inspired by the rich K-means framework, we propose a novel kernel-based algorithm to incorporate this node-side information as we jointly partition the signals and learn a graph for each cluster. Numerical experiments demonstrate its effectiveness over the state-of-the-art.
摘要
在图像处理(GSP)中,图学习(GL)关注图像的结构划分,即从节点观测获取图像。然而,数据往往是混合形式的,关系不同的基础结构。这种多元性需要同时划分多个图。在许多实际应用中,有可用的节点侧特征(即kernel),需要考虑其中的信息,这一点未在前期的图像信号划分方法中被考虑。为此,我们基于rich K-means框架,提出一种新的kernel-based算法,并在同时划分信号和学习图中jointly使用节点侧信息。数值实验表明其效果胜过现有的状态。
A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning
results: 研究发现,当 Parameter 数量增加时,模型的测试错误率会经历两次下降,而不是传统的U型曲线预测。此外,通过视角改变,研究发现这种现象是由多个复杂性轴的交叠导致的。Abstract
Conventional statistical wisdom established a well-understood relationship between model complexity and prediction error, typically presented as a U-shaped curve reflecting a transition between under- and overfitting regimes. However, motivated by the success of overparametrized neural networks, recent influential work has suggested this theory to be generally incomplete, introducing an additional regime that exhibits a second descent in test error as the parameter count p grows past sample size n - a phenomenon dubbed double descent. While most attention has naturally been given to the deep-learning setting, double descent was shown to emerge more generally across non-neural models: known cases include linear regression, trees, and boosting. In this work, we take a closer look at evidence surrounding these more classical statistical machine learning methods and challenge the claim that observed cases of double descent truly extend the limits of a traditional U-shaped complexity-generalization curve therein. We show that once careful consideration is given to what is being plotted on the x-axes of their double descent plots, it becomes apparent that there are implicitly multiple complexity axes along which the parameter count grows. We demonstrate that the second descent appears exactly (and only) when and where the transition between these underlying axes occurs, and that its location is thus not inherently tied to the interpolation threshold p=n. We then gain further insight by adopting a classical nonparametric statistics perspective. We interpret the investigated methods as smoothers and propose a generalized measure for the effective number of parameters they use on unseen examples, using which we find that their apparent double descent curves indeed fold back into more traditional convex shapes - providing a resolution to tensions between double descent and statistical intuition.
摘要
传统统计智能认为,模型复杂度和预测误差之间存在一个很好地理解的关系,通常表现为一个U型曲线,反映模型在过拟合和under拟合两个 режиmes之间的转换。然而,受深度学习的成功影响,近期一些influential的工作表明,这种理论是通常不准确的,存在一个第二个下降的测试误差情况,被称为double descent。而且,这种现象不仅限于深度学习设置,还出现在非神经网络模型中,如线性回归、树和抛物模型。在这项工作中,我们更加仔细地研究了这些经典统计机器学习方法的证据,并挑战这些方法的double descent现象是否真的超出传统的U型复杂度-通用曲线的限制。我们发现,只要注意把plot的x轴上的图表绘制得到的是多少个复杂度轴,double descent现象就会变得更加明确。我们示示了第二个下降出现在这些下推轴之间的转换处,并且其位置不是因为 interpolate threshold p=n 决定的。然后,我们采用了一种类非Parametric统计视角,将这些方法看作是简单器,并提出了一种通用的效果参数计数器,用于测试这些方法在未seen例中的表现。我们发现,这些方法的apparent double descent曲线实际上是fold back到了传统的convex形状,解决了对double descent和统计直觉之间的矛盾。
TRIAGE: Characterizing and auditing training data for improved regression
results: 研究人员通过应用TRIAGE分析了多个回归任务,并证明了TRIAGE的 charactization是一致的。此外,TRIAGE还可以用于选择数据集和获取特征。总的来说,TRIAGE highlights the value of data characterization in real-world regression applications.Abstract
Data quality is crucial for robust machine learning algorithms, with the recent interest in data-centric AI emphasizing the importance of training data characterization. However, current data characterization methods are largely focused on classification settings, with regression settings largely understudied. To address this, we introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We operationalize the score to analyze individual samples' training dynamics and characterize samples as under-, over-, or well-estimated by the model. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings. Additionally, beyond sample level, we show TRIAGE enables new approaches to dataset selection and feature acquisition. Overall, TRIAGE highlights the value unlocked by data characterization in real-world regression applications
摘要
<>将文本翻译成简化中文。<>机器学习算法中数据质量的重要性已得到更多的关注,特别是在数据中心式AI时代,数据Characterization的重要性得到了更多的认可。然而,现有的数据Characterization方法主要集中在分类任务上, regression任务相对较少研究。为了解决这个问题,我们介绍了一种新的数据Characterization框架,即TRIAGE,该框架适用于多种回归器,并且可以提供一个模型无关的评分方法,即TRIAGE分数。我们将TRIAGE分数操作化,以分析个体样本的训练过程和模型评估。我们示出了TRIAGE的分类是一致的,并且它的用途可以提高数据雕刻/筛选的性能,在多种回归任务中。此外,TRIAGE还可以用于数据集选择和特征收集新的方法。总之,TRIAGE highlights the value of data characterization in real-world regression applications.
Playing in the Dark: No-regret Learning with Adversarial Constraints
For: 本文探讨了线性 convex 优化框架的扩展,包括额外的长期反对抗约束。特别是,在一线策略决定动作后,除了一个凸成本函数外,反对抗还会透露一组 $k$ 凸约束。成本和约束函数可以随时间变化,无法预测未来函数的信息。* Methods: 本文提出了一种元策略,同时实现了下线性累积约束和下线性 regret。这是通过将受限问题降低到标准 OCO 问题的 recursive 构建的一个黑盒减reduction。我们表明,可以通过解决 surrogate 问题使用任何适应 OCO 策略,满足标准数据依赖 regret bound。* Results: 本文提出了一种新的 Lyapunov 基于证明技术,揭示了 regret 与某些顺序不等式之间的连接。通过一种新的分解结果,我们得出了 regret 的优化性质。 finally, 本文应用于在线多任务学习和网络控制问题。Abstract
We study a generalization of the classic Online Convex Optimization (OCO) framework by considering additional long-term adversarial constraints. Specifically, after an online policy decides its action on a round, in addition to a convex cost function, the adversary also reveals a set of $k$ convex constraints. The cost and the constraint functions could change arbitrarily with time, and no information about the future functions is assumed to be available. In this paper, we propose a meta-policy that simultaneously achieves a sublinear cumulative constraint violation and a sublinear regret. This is achieved via a black box reduction of the constrained problem to the standard OCO problem for a recursively constructed sequence of surrogate cost functions. We show that optimal performance bounds can be achieved by solving the surrogate problem using any adaptive OCO policy enjoying a standard data-dependent regret bound. A new Lyapunov-based proof technique is presented that reveals a connection between regret and certain sequential inequalities through a novel decomposition result. We conclude the paper by highlighting applications to online multi-task learning and network control problems.
摘要
我们研究了online convex optimization(OCO)框架的一种普遍化,其中包括额外的长期反对派对约束。具体来说,在一个线上策略决定其行动后,除了一个凸成本函数外,反对派也会公布一组$k$个凸约束。成本函数和约束函数可以随时间变化,并不知道未来函数的信息。在这篇论文中,我们提议一个meta策略,可以同时实现一个凸累累约束和一个凸后悔。这是通过一种黑盒减少法将受约束问题转化为标准OCO问题的 recursively constructed sequence of surrogate cost functions。我们表明了一种新的 Lyapunov-based 证明技术,可以通过一种新的分解结果显示回归和某些顺序不等式之间的联系。最后,我们将报告应用于在线多任务学习和网络控制问题。
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data
results: 本文发现在训练数据接近对称的情况下,隐式偏好会使Gradient Descent寻找一个稳定的rank值,并且这値值会在ReLU activation function下随着训练进程的推移而变化。此外,本文还发现隐式偏好会使Gradient Descent寻找一个神经网络,使得所有的训练数据点都具有相同的normalized margin。实验结果与理论结果匹配。Abstract
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks. Therefore, implicit bias in non-smooth neural networks trained by gradient descent remains an open question. In this paper, we aim to answer this question by studying the implicit bias of gradient descent for training two-layer fully connected (leaky) ReLU neural networks. We showed that when the training data are nearly-orthogonal, for leaky ReLU activation function, gradient descent will find a network with a stable rank that converges to $1$, whereas for ReLU activation function, gradient descent will find a neural network with a stable rank that is upper bounded by a constant. Additionally, we show that gradient descent will find a neural network such that all the training data points have the same normalized margin asymptotically. Experiments on both synthetic and real data backup our theoretical findings.
摘要
“偏好解释器的偏好”是一个关键因素,使得神经网络通过梯度下降优化得到良好的泛化能力。而“梯度流”中的偏好已经广泛研究过 homogeneous 神经网络(包括 ReLU 和泄漏 ReLU 网络),但是“梯度下降”中的偏好对非满意神经网络仍然是一个开Question。在这篇论文中,我们尝试答复这个问题,通过研究两层全连接(泄漏 ReLU)神经网络在训练过程中的偏好。我们发现,当训练数据几乎正交时,使用泄漏 ReLU 活动函数时,梯度下降会找到一个稳定的权重积分,其渐近值为 1;而使用 ReLU 活动函数时,梯度下降会找到一个神经网络,其稳定权重积分上限为一个常数。此外,我们还发现,梯度下降会找到一个神经网络,使得所有的训练数据点都具有相同的归一化margin。实验结果证明了我们的理论发现。
Remaining Useful Life Prediction of Lithium-ion Batteries using Spatio-temporal Multimodal Attention Networks
paper_authors: Sungho Suh, Dhruv Aditya Mittal, Hymalai Bello, Bo Zhou, Mayank Shekhar Jha, Paul Lukowicz
For: The paper aims to predict the remaining useful life of Lithium-ion batteries in real-world scenarios, addressing the limitations of existing methods and improving the reliability and efficiency of battery operations.* Methods: The proposed method uses a two-stage remaining useful life prediction scheme based on a spatio-temporal multimodal attention network (ST-MAN), which captures complex spatio-temporal dependencies in the battery data and neglected features such as temperature, internal resistance, and material type.* Results: The proposed ST-MAN model outperforms existing CNN and LSTM-based methods, achieving state-of-the-art performance in predicting the remaining useful life of Li-ion batteries.Abstract
Lithium-ion batteries are widely used in various applications, including electric vehicles and renewable energy storage. The prediction of the remaining useful life (RUL) of batteries is crucial for ensuring reliable and efficient operation, as well as reducing maintenance costs. However, determining the life cycle of batteries in real-world scenarios is challenging, and existing methods have limitations in predicting the number of cycles iteratively. In addition, existing works often oversimplify the datasets, neglecting important features of the batteries such as temperature, internal resistance, and material type. To address these limitations, this paper proposes a two-stage remaining useful life prediction scheme for Lithium-ion batteries using a spatio-temporal multimodal attention network (ST-MAN). The proposed model is designed to iteratively predict the number of cycles required for the battery to reach the end of its useful life, based on available data. The proposed ST-MAN is to capture the complex spatio-temporal dependencies in the battery data, including the features that are often neglected in existing works. Experimental results demonstrate that the proposed ST-MAN model outperforms existing CNN and LSTM-based methods, achieving state-of-the-art performance in predicting the remaining useful life of Li-ion batteries. The proposed method has the potential to improve the reliability and efficiency of battery operations and is applicable in various industries, including automotive and renewable energy.
摘要
Hyperbolic Graph Neural Networks at Scale: A Meta Learning Approach
methods: 学习图节点和边的本地子图中的抽象特征,并将其转移到新的子图上进行几拟 shot 学习。引入一种新的方法——几何 GRAph Meta Learner(H-GRAM),可以在节点 classification 和链接预测任务中学习并转移抽象信息,以便更快地学习新的任务。
results: 在多个具有挑战性的几拟 shot Setting 中,H-GRAM 能够有效地学习和转移信息,并且在大型图数据集上可以扩展性地提高性能。与标准 HNNs 相比,我们的方法可以更好地扩展到大型图数据集和提高性能。Abstract
The progress in hyperbolic neural networks (HNNs) research is hindered by their absence of inductive bias mechanisms, which are essential for generalizing to new tasks and facilitating scalable learning over large datasets. In this paper, we aim to alleviate these issues by learning generalizable inductive biases from the nodes' local subgraph and transfer them for faster learning over new subgraphs with a disjoint set of nodes, edges, and labels in a few-shot setting. We introduce a novel method, Hyperbolic GRAph Meta Learner (H-GRAM), that, for the tasks of node classification and link prediction, learns transferable information from a set of support local subgraphs in the form of hyperbolic meta gradients and label hyperbolic protonets to enable faster learning over a query set of new tasks dealing with disjoint subgraphs. Furthermore, we show that an extension of our meta-learning framework also mitigates the scalability challenges seen in HNNs faced by existing approaches. Our comparative analysis shows that H-GRAM effectively learns and transfers information in multiple challenging few-shot settings compared to other state-of-the-art baselines. Additionally, we demonstrate that, unlike standard HNNs, our approach is able to scale over large graph datasets and improve performance over its Euclidean counterparts.
摘要
progress in гиперболических нейронных сетях (HNNs) 研究受到其缺乏抽象假设机制的限制,这些机制是必需的 для泛化到新任务和促进大量数据集上的学习。在这篇论文中,我们想要解决这些问题,通过从节点的本地子图中学习通用的抽象假设,并将其传递给新的子图上快速学习。我们提出了一种新的方法,即гиперболическиеGRAPH元学习器(H-GRAM),它在节点分类和链接预测任务上,通过从支持本地子图集中学习hyperbolic meta 梯度和标签hyerbolic气泡来启用快速学习新任务。此外,我们还证明了我们的meta学习框架的扩展可以解决现有方法所面临的扩展性问题。我们的比较分析表明,H-GRAM在多个具有挑战性的几个shot设定中能够有效地学习和传递信息,并且与标准HNNs相比,我们的方法可以在大规模图数据集上提高性能。
Estimating the Rate-Distortion Function by Wasserstein Gradient Descent
results: 实验表明,该方法可以在低比特率源上获得相当或更紧的约束,而需要许多更少的调整和计算努力。此外,该方法还与最大极值估计有关,并引入了一种新的测试源。Abstract
In the theory of lossy compression, the rate-distortion (R-D) function $R(D)$ describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining $R(D)$ for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate $R(D)$ from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.
摘要
理论上,吞吐率-损均衡(R-D)函数 $R(D)$ 描述了数据源可以通过压缩(bit-rate)来实现任何级别的准确性(损均)。确定 $R(D)$ 对于给定数据源是吞吐率压缩算法的基本性能上限。我们提出了一种基于最优运输的新方法来估计 $R(D)$。这种方法不同于经典的布拉哈特-阿里莫托算法,它在预先固定往复复制分布的支持上进行估计。我们的渐进梯度滚动算法会学习最优往复复制分布的支持,通过移动粒子来实现。我们证明了本方法的本地收敛性和样本复杂性,并与经典神经网络方法进行比较。实验结果表明,我们的方法可以在低比特率源上获得相对或更紧的约束,而且需要远少的调整和计算努力。我们还 highlight了与最大似然减杂的连接,并介绍了一种新的测试集,其中可以使用已知解决R-D问题的源。
Topological, or Non-topological? A Deep Learning Based Prediction
results: 实验结果显示,该模型的准确率为 91.4%,F1 分数为 88.5%,在分类非材料和材料中表现出色,超过其他状态对照模型Abstract
Prediction and discovery of new materials with desired properties are at the forefront of quantum science and technology research. A major bottleneck in this field is the computational resources and time complexity related to finding new materials from ab initio calculations. In this work, an effective and robust deep learning-based model is proposed by incorporating persistent homology and graph neural network which offers an accuracy of 91.4% and an F1 score of 88.5% in classifying topological vs. non-topological materials, outperforming the other state-of-the-art classifier models. The incorporation of the graph neural network encodes the underlying relation between the atoms into the model based on their own crystalline structures and thus proved to be an effective method to represent and process non-euclidean data like molecules with a relatively shallow network. The persistent homology pipeline in the suggested neural network is capable of integrating the atom-specific topological information into the deep learning model, increasing robustness, and gain in performance. It is believed that the presented work will be an efficacious tool for predicting the topological class and therefore enable the high-throughput search for novel materials in this field.
摘要
科学家们正在努力探索新材料的搜索和预测,以满足现代科学和技术的需求。然而,计算资源和计算复杂性问题成为了这一领域的主要瓶颈。在这篇文章中,我们提出了一种有效和可靠的深度学习模型,通过结合持续同态和图神经网络来减少计算资源的占用和提高预测的精度。这种模型在分类非普遍材料和普遍材料方面的准确率达91.4%,F1分数达88.5%,超过了其他现有的分类模型。在这种模型中,图神经网络允许通过晶体结构中的原子之间的关系来编码材料的结构,从而实现了对非欧几何数据的有效处理。持续同态管道在建议的神经网络中允许将原子特征的拓扑信息纳入深度学习模型中,从而提高了模型的稳定性和性能。总之,这种方法将成为预测材料的拓扑类别的有效工具,并促进高速搜索新材料的搜索。
Learning Subgrid-Scale Models in Discontinuous Galerkin Methods with Neural Ordinary Differential Equations for Compressible Navier–Stokes Equations
results: 作者通过多维泰勒-格林涡漩示例来证明该方法的性能,并证明该方法不仅可以重construct低级别模型的涨落尺度,还可以加速筛选高级别DG simulation的运算速度,提高了模型的准确性和效率。Abstract
The growing computing power over the years has enabled simulations to become more complex and accurate. However, high-fidelity simulations, while immensely valuable for scientific discovery and problem solving, come with significant computational demands. As a result, it is common to run a low-fidelity model with a subgrid-scale model to reduce the computational cost, but selecting the appropriate subgrid-scale models and tuning them are challenging. We propose a novel method for learning the subgrid-scale model effects when simulating partial differential equations using neural ordinary differential equations in the context of discontinuous Galerkin (DG) spatial discretization. Our approach learns the missing scales of the low-order DG solver at a continuous level and hence improves the accuracy of the low-order DG approximations as well as accelerates the filtered high-order DG simulations with a certain degree of precision. We demonstrate the performance of our approach through multidimensional Taylor--Green vortex examples at different Reynolds numbers and times, which cover laminar, transitional, and turbulent regimes. The proposed method not only reconstructs the subgrid-scale from the low-order (1st-order) approximation but also speeds up the filtered high-order DG (6th-order) simulation by two orders of magnitude.
摘要
随着计算能力的提高, simulations 已经能够更加复杂和准确。然而,高精度 simulations 的计算需求很大,因此通常采用低精度模型和缺失涂抹模型来降低计算成本。然而,选择合适的缺失涂抹模型并调整它们是困难的。我们提出了一种基于神经ordinary differential equations的新方法,用于在discontinuous Galerkin(DG)空间积分方法中学习subgrid-scale模型的效果。我们的方法可以在 kontinuierlichen Level学习低阶DG解的缺失尺度,从而提高低阶DG的准确性和加速筛选高阶DG(6th-order) simulations的过程。我们通过多维 Taylor--Green涡涌示例来证明我们的方法的性能,这些示例覆盖了laminar、transition和turbulent режиmes。我们的方法不仅可以从低阶(1st-order)解中重construct subgrid-scale,还可以加速筛选高阶DG simulations的过程,提高速度两个数量级。
D2NO: Efficient Handling of Heterogeneous Input Function Spaces with Distributed Deep Neural Operators
results: 提出一种新的分布式方法,可以降低Gradient descent back-propagation步数,提高效率而不失精度,并 Validated by four numerical examplesAbstract
Neural operators have been applied in various scientific fields, such as solving parametric partial differential equations, dynamical systems with control, and inverse problems. However, challenges arise when dealing with input functions that exhibit heterogeneous properties, requiring multiple sensors to handle functions with minimal regularity. To address this issue, discretization-invariant neural operators have been used, allowing the sampling of diverse input functions with different sensor locations. However, existing frameworks still require an equal number of sensors for all functions. In our study, we propose a novel distributed approach to further relax the discretization requirements and solve the heterogeneous dataset challenges. Our method involves partitioning the input function space and processing individual input functions using independent and separate neural networks. A centralized neural network is used to handle shared information across all output functions. This distributed methodology reduces the number of gradient descent back-propagation steps, improving efficiency while maintaining accuracy. We demonstrate that the corresponding neural network is a universal approximator of continuous nonlinear operators and present four numerical examples to validate its performance.
摘要
我们的方法包括将输入函数空间分区,并使用独立的 neural network 处理个别输入函数。中央 neural network 用于处理所有输出函数之间的共享信息。这种分布式方法可以降低梯度下降反propagation 步骤数量,提高效率而无损准确性。我们证明了相应的 neural network 是一个 universal approximator 的连续非线性算子,并在四个数字示例中验证了其性能。
A foundational neural operator that continuously learns without forgetting
results: 该模型能够同时学习多种 Parametric PDE 的解方程,并能够快速适应新的 Parametric PDE。同时,该模型也能够保持Positive Transfer和避免 Catastrophic Forgetting。经过广泛的 benchmark 测试,该模型可以在预测阶段比task-specific基eline模型表现更好,并且具有较少的hyperparameter tuning。Abstract
Machine learning has witnessed substantial growth, leading to the development of advanced artificial intelligence models crafted to address a wide range of real-world challenges spanning various domains, such as computer vision, natural language processing, and scientific computing. Nevertheless, the creation of custom models for each new task remains a resource-intensive undertaking, demanding considerable computational time and memory resources. In this study, we introduce the concept of the Neural Combinatorial Wavelet Neural Operator (NCWNO) as a foundational model for scientific computing. This model is specifically designed to excel in learning from a diverse spectrum of physics and continuously adapt to the solution operators associated with parametric partial differential equations (PDEs). The NCWNO leverages a gated structure that employs local wavelet experts to acquire shared features across multiple physical systems, complemented by a memory-based ensembling approach among these local wavelet experts. This combination enables rapid adaptation to new challenges. The proposed foundational model offers two key advantages: (i) it can simultaneously learn solution operators for multiple parametric PDEs, and (ii) it can swiftly generalize to new parametric PDEs with minimal fine-tuning. The proposed NCWNO is the first foundational operator learning algorithm distinguished by its (i) robustness against catastrophic forgetting, (ii) the maintenance of positive transfer for new parametric PDEs, and (iii) the facilitation of knowledge transfer across dissimilar tasks. Through an extensive set of benchmark examples, we demonstrate that the NCWNO can outperform task-specific baseline operator learning frameworks with minimal hyperparameter tuning at the prediction stage. We also show that with minimal fine-tuning, the NCWNO performs accurate combinatorial learning of new parametric PDEs.
摘要
In this study, we introduce the Neural Combinatorial Wavelet Neural Operator (NCWNO) as a foundational model for scientific computing. This model is specifically designed to excel in learning from a diverse spectrum of physics and continuously adapt to the solution operators associated with parametric partial differential equations (PDEs). The NCWNO leverages a gated structure that employs local wavelet experts to acquire shared features across multiple physical systems, complemented by a memory-based ensembling approach among these local wavelet experts. This combination enables rapid adaptation to new challenges.The proposed foundational model offers two key advantages: (i) it can simultaneously learn solution operators for multiple parametric PDEs, and (ii) it can swiftly generalize to new parametric PDEs with minimal fine-tuning. Additionally, the NCWNO is distinguished by its:* Robustness against catastrophic forgetting* Maintenance of positive transfer for new parametric PDEs* Facilitation of knowledge transfer across dissimilar tasksThrough an extensive set of benchmark examples, we demonstrate that the NCWNO can outperform task-specific baseline operator learning frameworks with minimal hyperparameter tuning at the prediction stage. We also show that with minimal fine-tuning, the NCWNO can accurately combine learning of new parametric PDEs.
Simple and Asymmetric Graph Contrastive Learning without Augmentations
results: 实验结果表明,GraphACL 可以在异谱图上 achieve 出色的表现,并且在 homophilic 和异谱图上都具有优异的泛化能力。Abstract
Graph Contrastive Learning (GCL) has shown superior performance in representation learning in graph-structured data. Despite their success, most existing GCL methods rely on prefabricated graph augmentation and homophily assumptions. Thus, they fail to generalize well to heterophilic graphs where connected nodes may have different class labels and dissimilar features. In this paper, we study the problem of conducting contrastive learning on homophilic and heterophilic graphs. We find that we can achieve promising performance simply by considering an asymmetric view of the neighboring nodes. The resulting simple algorithm, Asymmetric Contrastive Learning for Graphs (GraphACL), is easy to implement and does not rely on graph augmentations and homophily assumptions. We provide theoretical and empirical evidence that GraphACL can capture one-hop local neighborhood information and two-hop monophily similarity, which are both important for modeling heterophilic graphs. Experimental results show that the simple GraphACL significantly outperforms state-of-the-art graph contrastive learning and self-supervised learning methods on homophilic and heterophilic graphs. The code of GraphACL is available at https://github.com/tengxiao1/GraphACL.
摘要
图像对比学习(GCL)在图结构数据中的表示学习表现出色。然而,大多数现有的GCL方法都基于先制制图像增强和同类连接假设。因此,它们在不同类型连接的图中失去泛化能力。在这篇论文中,我们研究了在同类连接和不同类型连接图中进行对比学习的问题。我们发现,只需考虑偏 asymmetric 的邻居节点视角,就可以获得了良好的表现。 resulting algorithm, Asymmetric Contrastive Learning for Graphs (GraphACL), 易于实现并不需要图像增强和同类连接假设。我们提供了理论和实验证据,表明 GraphACL 可以捕捉一次邻居信息和两次同类连接相似性,这些都是模型不同类型连接图的关键。实验结果表明,简单的 GraphACL 在同类连接和不同类型连接图中明显超越了当前最佳的图像对比学习和自然学习方法。GraphACL 的代码可以在 https://github.com/tengxiao1/GraphACL 上获取。
Correlation Aware Sparsified Mean Estimation Using Random Projection
results: 这篇论文的实验结果显示,Rand-Proj-Spatial 比 Rand-$k$-Spatial 和其他更加复杂的簇范例化技术更高效。此外,这篇论文还提出了一种可以根据客户端间联系信息不同程度的弹性 Rand-Proj-Spatial 方法,并且在实验中证明其效果。Abstract
We study the problem of communication-efficient distributed vector mean estimation, a commonly used subroutine in distributed optimization and Federated Learning (FL). Rand-$k$ sparsification is a commonly used technique to reduce communication cost, where each client sends $k < d$ of its coordinates to the server. However, Rand-$k$ is agnostic to any correlations, that might exist between clients in practical scenarios. The recently proposed Rand-$k$-Spatial estimator leverages the cross-client correlation information at the server to improve Rand-$k$'s performance. Yet, the performance of Rand-$k$-Spatial is suboptimal. We propose the Rand-Proj-Spatial estimator with a more flexible encoding-decoding procedure, which generalizes the encoding of Rand-$k$ by projecting the client vectors to a random $k$-dimensional subspace. We utilize Subsampled Randomized Hadamard Transform (SRHT) as the projection matrix and show that Rand-Proj-Spatial with SRHT outperforms Rand-$k$-Spatial, using the correlation information more efficiently. Furthermore, we propose an approach to incorporate varying degrees of correlation and suggest a practical variant of Rand-Proj-Spatial when the correlation information is not available to the server. Experiments on real-world distributed optimization tasks showcase the superior performance of Rand-Proj-Spatial compared to Rand-$k$-Spatial and other more sophisticated sparsification techniques.
摘要
我们研究了一个分布式向量均值估计问题,这是分布式优化和联合学习(FL)中广泛使用的一种子 Routine。 Rand-$k$ 精炼是一种常用的减少通信成本的技术,每个客户端向服务器发送 $k < d$ 个坐标。然而,Rand-$k$ 无法考虑客户端之间的协方差信息,这可能导致性能下降。我们提出了 Rand-$k$-Spatial 估计器,使用服务器端的协方差信息来改进 Rand-$k$ 的性能。然而,Rand-$k$-Spatial 的性能仍然有限制。我们提出了 Rand-Proj-Spatial 估计器,它使用随机 $k$-维空间的投影来扩展 Rand-$k$ 的编码过程。我们使用 Subsampled Randomized Hadamard Transform (SRHT) 作为投影矩阵,并证明 Rand-Proj-Spatial 使用 SRHT 的投影可以更好地利用协方差信息。此外,我们提出了一种根据协方差信息不同程度的变化来修改 Rand-Proj-Spatial 的方法,并建议在服务器端不可获得协方差信息时使用实际 variant。我们在实际分布式优化任务上进行了实验,并证明 Rand-Proj-Spatial 的性能比 Rand-$k$-Spatial 和其他更复杂的精炼技术更高。
methods: 这个算法使用 max norm synchronization 来驱动训练,保留了设备上的深度模型训练,并使用本地设备之间的通信来实现分布式共识。每个设备会逐次交替进行两个阶段:1)设备上的学习,2)分布式合作,其中它们与附近的设备结合模型参数。
results: 这个算法可以让所有参与设备都 дости得到与 federated 和中央训练相同的测试性能,甚至在 100 个设备和宽松的单器散发加重的情况下。此外,这个算法还可以在不同的网络拓扑、罕见的通信和非Identical 数据分布情况下进行扩展。Abstract
We present P2PL, a practical multi-device peer-to-peer deep learning algorithm that, unlike the federated learning paradigm, does not require coordination from edge servers or the cloud. This makes P2PL well-suited for the sheer scale of beyond-5G computing environments like smart cities that otherwise create range, latency, bandwidth, and single point of failure issues for federated approaches. P2PL introduces max norm synchronization to catalyze training, retains on-device deep model training to preserve privacy, and leverages local inter-device communication to implement distributed consensus. Each device iteratively alternates between two phases: 1) on-device learning and 2) distributed cooperation where they combine model parameters with nearby devices. We empirically show that all participating devices achieve the same test performance attained by federated and centralized training -- even with 100 devices and relaxed singly stochastic consensus weights. We extend these experimental results to settings with diverse network topologies, sparse and intermittent communication, and non-IID data distributions.
摘要
我们介绍P2PL,一种实用多设备 peer-to-peer深度学习算法,不同于联邦学习模式,不需要边缘服务器或云端协调。这使得P2PL在 beyond-5G 计算环境中,如智能城市,创造范围、延迟、带宽和单点故障问题,而 federated 方法不适用。P2PL 引入最大范数同步来促进训练,保留设备上深度模型训练,并利用本地设备间通信实现分布式共识。每个设备会逐次 alternate between two 阶段:1)设备上学习和 2)分布式合作,其中 combines 模型参数与附近设备。我们实验表明,参与训练的所有设备可以达到 federated 和中央训练所得到的测试性能,即使有 100 个设备和松弛单调共识加权。我们还将这些实验结果扩展到不同的网络拓扑、笔数和间歇性通信、非标一致数据分布的设置下。
Bayes beats Cross Validation: Efficient and Accurate Ridge Regression via Expectation Maximization
results: 研究表明,该方法可以在大 enough $n$ 的情况下,无需设定任何难以确定的 гипер参数,具有唯一最优解,并且在 $O(\min(n, p))$ 操作下实现单一迭代EM循环。此外,研究还发现,通过采用合适的预处理步骤,可以在 $O(n \min(n, p))$ 操作下评估单个 $\lambda$ 值,而不需要评估所有 $l$ 个 candidate $\lambda$ 值。Abstract
We present a novel method for tuning the regularization hyper-parameter, $\lambda$, of a ridge regression that is faster to compute than leave-one-out cross-validation (LOOCV) while yielding estimates of the regression parameters of equal, or particularly in the setting of sparse covariates, superior quality to those obtained by minimising the LOOCV risk. The LOOCV risk can suffer from multiple and bad local minima for finite $n$ and thus requires the specification of a set of candidate $\lambda$, which can fail to provide good solutions. In contrast, we show that the proposed method is guaranteed to find a unique optimal solution for large enough $n$, under relatively mild conditions, without requiring the specification of any difficult to determine hyper-parameters. This is based on a Bayesian formulation of ridge regression that we prove to have a unimodal posterior for large enough $n$, allowing for both the optimal $\lambda$ and the regression coefficients to be jointly learned within an iterative expectation maximization (EM) procedure. Importantly, we show that by utilizing an appropriate preprocessing step, a single iteration of the main EM loop can be implemented in $O(\min(n, p))$ operations, for input data with $n$ rows and $p$ columns. In contrast, evaluating a single value of $\lambda$ using fast LOOCV costs $O(n \min(n, p))$ operations when using the same preprocessing. This advantage amounts to an asymptotic improvement of a factor of $l$ for $l$ candidate values for $\lambda$ (in the regime $q, p \in O(\sqrt{n})$ where $q$ is the number of regression targets).
摘要
我团队提出了一种新的方法来调整ridge regression中的正则化超参数($\lambda$ ),这种方法比逐个留下一个(LOOCV)更快速计算,而且可以提供与LOOCVrisk相同或更高质量的回归参数估计。LOOCV risk可能会在finite $n$ 下存在多个和坏的地方极 minimum,因此可能需要指定一组 candidate $\lambda$,这可能会导致不良的解决方案。然而,我们证明了该方法在 suficiently large $n$ 下是唯一优化解决方案,不需要指定任何难以确定的超参数。这是基于ridge regression的 Bayesian 表述,我们证明了其 posterior 在 suficiently large $n$ 下是单模的,因此可以通过 iterative expectation maximization (EM) 过程来同时学习 optimal $\lambda$ 和回归系数。其中,我们还证明了可以通过适当的预处理步骤,在 $O(\min(n, p))$ 操作下完成一次主 EM 循环,其中 $n$ 是行数,$p$ 是列数。与此相比,通过快速 LOOCV 评估 $\lambda$ 的值需要 $O(n \min(n, p))$ 操作。这种优势在 $l$ 个 candidate $\lambda$ 值的情况下(在 $q, p \in O(\sqrt{n})$ regime 中)amounts to an asymptotic improvement factor of $l$。
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
results: 根据论文的结果,SiDA 方法可以实现大幅提高 MoE 模型的测试速度,将对 GPU 内存的使用量减少到 1%,同时保持模型效率不变。具体来说,SiDA 方法可以实现 Up to 3.93X 的测试速度增加、Up to 75% 的延迟降低和 Up to 80% 的 GPU 内存储存量减少。Abstract
Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA attains a remarkable speedup in MoE inference with up to 3.93X throughput increasing, up to 75% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even in memory-constrained systems.
摘要
大型模型时代,混合专家(MoE)架构已成为有利的选择,因为它可以无需增加显著的计算负担来扩大模型的容量。然而,实现这些利点经常导致大量模型参数在推理过程中处于休眠状态,同时大模型的内存需求常常超过当今的GPU内存容量。为解决这个问题,我们提出了SiDA(基于缺省性的数据意识),这是一种高效的推理方法,专门为大MoE模型设计。SiDA利用系统的主存,这是现在充足且可扩展的,同时还利用GPU内存,通过利用MoE模型中专家活动的自然缺省性来提高模型效率。通过采用数据意识的视角,SiDA实现了更高的模型效率,减少了推理延迟和GPU内存占用,同时保持了模型性能的稳定。具体来说,SiDA在MoE推理中可以达到3.93倍的吞吐量提高、75%的延迟减少和80%的GPU内存减少,同时保持模型性能下降不到1%。这项工作为大MoE模型的扩展和高效部署提供了可行的方法。