查看原文
其他

Angel基于参数服务器的规模分布式机器学习平台

一尘 跳动的数据 2023-09-29

去年12月底,Angel从LF AI基金会毕业了,也是中国首个从LF AI基金会毕业的开源项目。这意味着,Angel得到全球技术专家的认可,成为世界最顶级的AI开源项目之一。

从LF AI毕业了,代码层面的license格式要修改,可以持续关注开源社区的动向哈。


LF AI 是 Linux基金会旗下面向AI领域的顶级基金会。


上图是LF AI网站对angel的介绍,有意思的是阿里开源的Alink也加入LF AI,如下图



概述

Angel是腾讯开源的大规模分布式机器学习平台,专注于稀疏数据高维模型的训练。目前Angel是Linux基金会人工智能(LFAI)孵化项目,相比于TensorFlow, PyTorch和Spark等业界同类平台,它有如下特点:

  • Angel是一个基于Parameter Server(PS)理念开发的高性能分布式机器学习平台,它具有灵活的可定制函数PS Function(PSF),可以将部分计算下推至PS端。PS架构良好的横向扩展能力让Angel能高效处理千亿级别的模型。

  • Angel具有专门为处理高维稀疏特征特别优化的数学库,性能可达breeze数学库的10倍以上。Angel的PS和内置的算法内核均构建在该数学库之上。

  • Angel擅长推荐模型和图网络模型相关领域(如社交网络分析)。图1是Angel和几个业界主流平台在稀疏数据,模型维度,性能表现,深度模型和生态建设几个维度的对比。Tensorflow和PyTouch在深度学习领域和生态建设方面优势明显,但在稀疏数据和高维模型方面的处理能力相对不足,而Angel正好与它们形成互补,3.0版本推出的PyTorch On Angel尝试将PyTorch和Angel的优势结合在一起。

Angel与业界主流平台的对比



Angel3.0整体架构



Angel ChangeLog

Release-2.2.0 - 2019-05-06

In this Release, we have enhanced the graph algorithms: (1) we made a refactoring of the existing K-Core algorithm, the performance and stability have been significantly improved; (2) we add the louvain algorithm, which is also named Fast-Unfolding. The test results show that the K-Core and the Louvain algorithm are both 10x faster than GraphX. In this release we official release the Vero, a new GBDT implementation over Spark On Angel. The advantage of Vero is that it obtains great The main feature of Vero is which has obvious advantages in supporting high dimensional models and multi-classification problems. We also add kerberos support in this release.

New features in Release-2.2.0:

  • Add Fast Unfolding algorithm in Spark-on-Angel

  • Support predict for FTRL-LR in Spark-on-Angel

  • Support predict for FTRL-FM in Spark-on-Angel

  • Add Vero, a feature parallelism version GBDT on Spark-on-Angel

  • Support regression for GBDT on Spark-on-Angel

  • Add a new data split input format--BalanceInputFormatV2

  • Support running over Kubernetes.

Bugs fixed in Release-2.2.0:

  • Fix the failure of loading model when the model is moved, closing the csc check

  • Fix the problem that parameter servers exist with errors in Spark-on-Angel

  • Fix the problem that the interface of sparse index pull might be blocked when given parameters are invalid

  • Fix the problem that saving result would fail if the parent path is not existed

  • Fix the problem that the BalanceInputFormat would return empty splits sometimes

  • Fix the problem when saving json configuration files

  • Fix the problem when requesting the resources for Angel workers

Release-2.1.0 - 2019-03-08

In this release, we add an intelligent model partitioning method, named"LoadBalancePartitioner", In Spark-on-Angel. By analyzing the distribution of features in the training data in advance, the number of features of each partition can be precisely controlled. This leads a balanced load for each server. The empirical tests demonstrate that the efficiency of model training can be greatly improved in many cases. Further, we add three algorithms in this release, including FM solved by FTRL optimizer, K-Core algorithm and feature-parallel GBDT, which can support a high-dimentional tree model.

New features in Release-2.1.0:

  • Adding a load-balanced model partitioner, called "LoadBalancePartitioner" in Spark-on-Angel

  • Adding Ftrl-FM algorithm

  • Adding K-core algorithm

  • Adding a feature-parallel version of GBDT algorithm

Release-2.0.2 - 2019-01-30

In this release, we optimize the performance of FTRL algorithm, adding the support for float data type. We limit the maxmimum retry times for remote requests to avoid unrecoverable blocking. We also increase the performance of the math library.

New features in Release-2.0.2

  • Optimize the model partitioning for FTRL algorithm

  • Support float data type for FTRL algorithm

  • Avoid rehashing in math library to obtain performance improvement

  • Add a maxmimum retry times for remote requests on servers

Release-2.0.1 - 2019-01-11

In this release, we add the support for incremental training for FTRL. We implement some new optimizers and learning rate scheduling strategies. Documentations about how to choose optimizers or scheduling strategies and how to accelerate deep learning algorithms with openblas are provided in this release.

New features in Release-2.0.1

  • Add documentations about how to use openblas to accelerate deep learning algorithms

  • Optimize the performance for FTRL

  • Support inremental training for FTRL

  • Add optimizers with L1 penalty: Adagrad/Adadelta

  • Add some scheduling strategies for learning rate

Bug Fix:

  • Fix the problem of inconsistency of nodes number in network embedding

  • Fix casting problem existing in quantile compressing


Owner

Angel is a big project, it consists a seris of sub-project, each sub-project has a owner and a backup owner:

  • Angel: paynie, leleyu

  • sona: fitzwang, leleyu

  • mlcore: fitzwang, endymecy

  • math: rachelsunrh, fitzwang

  • serving: ouyangwen, fitzwang

  • format: paynie, raohuaming

  • PyTorchOnAngel: leleyu, ouyangwen

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存