cs.LG - 2023-08-26

Unveiling the Role of Message Passing in Dual-Privacy Preservation on GNNs

paper_url: http://arxiv.org/abs/2308.13513
repo_url: None
paper_authors: Tianyi Zhao, Hui Hu, Lu Cheng
for: This paper aims to address the privacy leakage issue in Graph Neural Networks (GNNs) and propose a principled privacy-preserving GNN framework.
methods: The proposed framework consists of three major modules: Sensitive Information Obfuscation Module, Dynamic Structure Debiasing Module, and Adversarial Learning Module.
results: Experimental results on four benchmark datasets show that the proposed model effectively protects both node and link privacy while preserving high utility for downstream tasks such as node classification.

Abstract
Graph Neural Networks (GNNs) are powerful tools for learning representations on graphs, such as social networks. However, their vulnerability to privacy inference attacks restricts their practicality, especially in high-stake domains. To address this issue, privacy-preserving GNNs have been proposed, focusing on preserving node and/or link privacy. This work takes a step back and investigates how GNNs contribute to privacy leakage. Through theoretical analysis and simulations, we identify message passing under structural bias as the core component that allows GNNs to \textit{propagate} and \textit{amplify} privacy leakage. Building upon these findings, we propose a principled privacy-preserving GNN framework that effectively safeguards both node and link privacy, referred to as dual-privacy preservation. The framework comprises three major modules: a Sensitive Information Obfuscation Module that removes sensitive information from node embeddings, a Dynamic Structure Debiasing Module that dynamically corrects the structural bias, and an Adversarial Learning Module that optimizes the privacy-utility trade-off. Experimental results on four benchmark datasets validate the effectiveness of the proposed model in protecting both node and link privacy while preserving high utility for downstream tasks, such as node classification.

摘要
格raph神经网络（GNNs）是一种强大的图像学习工具，可以在社交网络等图像上学习表示。然而，它们的隐私泄露攻击限制了它们在高风险领域的实际应用。为解决这个问题，隐私保护GNNs被提出，重点保护节点和/或连接的隐私。本工作从另一个角度研究GNNs如何导致隐私泄露。通过理论分析和实验，我们发现了 message passing under structural bias 是 GNNs 中最重要的泄露扩散和强化因素。基于这些发现，我们提出了一种理解隐私保护的 GNN 框架，称为 dual-privacy preservation。该框架包括三个主要模块：敏感信息干扰模块、动态结构偏置修正模块和对抗学习模块。实验结果表明，提出的模型可以保护节点和连接的隐私，同时保持下游任务的高用户性。

TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

paper_url: http://arxiv.org/abs/2308.13490
repo_url: https://github.com/google-research-datasets/tpu_graphs
paper_authors: Phitchaya Mangpo Phothilimthana, Sami Abu-El-Haija, Kaidi Cao, Bahare Fatemi, Charith Mendis, Bryan Perozzi
for: 这篇论文的目的是提供一个大型计算图论文性能预测数据集，用于优化机器学习编译器和自动调整器。
methods: 这篇论文使用了计算图来表示机器学习工作负荷，并收集了来自开源机器学习项目的各种模型架构。
results: 这篇论文提出了一个大型计算图论文性能预测数据集（TPuGraphs），其中每个图表示一个主要计算任务，例如训练epoch或推理步骤。该数据集包含25倍以上的图数据，770倍以上的图 сред值大小，并且引入了新的挑战，如扩展性、训练效率和模型质量。

Abstract
Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the autotuner for XLA, a machine learning compiler, discovered 10-20% speedup on state-of-the-art models serving substantial production traffic at Google. Although there exist a few datasets for program performance prediction, they target small sub-programs such as basic blocks or kernels. This paper introduces TpuGraphs, a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures, e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer. TpuGraphs provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This graph-level prediction task on large graphs introduces new challenges in learning, ranging from scalability, training efficiency, to model quality.

摘要
精准硬件性能模型在代码优化中扮演着关键的角色。它们可以帮助编译器做出优化决策，或者帮助自动调整器确定程序的最佳配置。例如，XLA的自动调整器在处理大规模生产流量时发现了10-20%的提升。虽然有一些程序性能预测数据集存在，但它们主要针对小型子程序，如基本块或kernels。这篇文章介绍了TpuGraphs，一个基于计算图的程序性能预测数据集，运行在tensor处理单元（TPU）上。每个图像在数据集中表示了机器学习工作负荷的主要计算，例如训练epoch或推理步骤。每个数据样本包含一个计算图、一个编译配置和图像的执行时间。数据集中的图像来自开源机器学习程序，包括受欢迎的模型架构，如ResNet、EfficientNet、Mask R-CNN和Transformer。TpuGraphs提供了25倍更多的图像，并770倍更大的图像平均大小，相比已有的机器学习程序性能预测数据集。这个图级预测任务中的大图学习问题 introduce了新的挑战，从扩展性、培训效率、模型质量等方面。

Staleness-Alleviated Distributed GNN Training via Online Dynamic-Embedding Prediction

paper_url: http://arxiv.org/abs/2308.13466
repo_url: None
paper_authors: Guangji Bai, Ziyang Yu, Zheng Chai, Yue Cheng, Liang Zhao
for: 这篇论文是为了解决Graph Neural Networks（GNNs）在大规模图上的训练中的难点，特别是难以同步多个节点的问题。
methods: 这篇论文使用了分布式计算来解决这个问题，并且使用了历史值推断来实现高并发性。
results: 这篇论文提出了一个名为SAT（Staleness-Alleviated Training）的新的分布式GNN训练框架，可以有效地减少节点嵌入缓存的旧化。实验结果显示，SAT可以实现更好的性能和训练速度在多个大规模图数据集上。

Abstract
Despite the recent success of Graph Neural Networks (GNNs), it remains challenging to train GNNs on large-scale graphs due to neighbor explosions. As a remedy, distributed computing becomes a promising solution by leveraging abundant computing resources (e.g., GPU). However, the node dependency of graph data increases the difficulty of achieving high concurrency in distributed GNN training, which suffers from the massive communication overhead. To address it, Historical value approximation is deemed a promising class of distributed training techniques. It utilizes an offline memory to cache historical information (e.g., node embedding) as an affordable approximation of the exact value and achieves high concurrency. However, such benefits come at the cost of involving dated training information, leading to staleness, imprecision, and convergence issues. To overcome these challenges, this paper proposes SAT (Staleness-Alleviated Training), a novel and scalable distributed GNN training framework that reduces the embedding staleness adaptively. The key idea of SAT is to model the GNN's embedding evolution as a temporal graph and build a model upon it to predict future embedding, which effectively alleviates the staleness of the cached historical embedding. We propose an online algorithm to train the embedding predictor and the distributed GNN alternatively and further provide a convergence analysis. Empirically, we demonstrate that SAT can effectively reduce embedding staleness and thus achieve better performance and convergence speed on multiple large-scale graph datasets.

摘要
尽管 Graf Neural Networks (GNNs) 的最近成功，但是在大规模图上训练 GNNs 仍然具有挑战，主要是因为邻居爆炸。为了解决这问题，分布式计算成为了一种有前途的解决方案，利用了丰富的计算资源（例如 GPU）。然而，图数据中节点的依赖关系使得在分布式 GNN 训练中达到高并发性变得更加困难，这会导致巨大的通信开销。为了解决这个问题，历史值 aproximation 被视为一种有前途的分布式训练技术。它利用了一个缓存历史信息（例如节点嵌入）作为可以Affordable的近似值，并实现了高并发性。然而，这些利点来自于使用过时的训练信息，导致偏斜、不准确和融合问题。为了解决这些挑战，本文提出了 SAT（Staleness-Alleviated Training），一种新的和可扩展的分布式 GNN 训练框架。SAT 的关键思想是模型 GNN 的嵌入演化为一个 temporal graph，并建立一个模型来预测未来的嵌入，从而有效地减轻嵌入缓存的过时性。我们提出了一种在线算法来训练嵌入预测器和分布式 GNN alternatively，并提供了一种融合分析。实验表明，SAT 可以有效地减轻嵌入缓存的过时性，从而实现更好的性能和融合速度在多个大规模图数据集上。