Angel基于参数服务器的规模分布式机器学习平台

Original 一尘跳动的数据 2023-09-29

收录于合集 #深入浅出分布式机器学习平台 15个

去年12月底，Angel从LF AI基金会毕业了，也是中国首个从LF AI基金会毕业的开源项目。这意味着，Angel得到全球技术专家的认可，成为世界最顶级的AI开源项目之一。

从LF AI毕业了，代码层面的license格式要修改，可以持续关注开源社区的动向哈。

LF AI 是 Linux基金会旗下面向AI领域的顶级基金会。

上图是LF AI网站对angel的介绍，有意思的是阿里开源的Alink也加入LF AI,如下图

概述

Angel是腾讯开源的大规模分布式机器学习平台，专注于稀疏数据高维模型的训练。目前Angel是Linux基金会人工智能(LFAI)孵化项目，相比于TensorFlow, PyTorch和Spark等业界同类平台，它有如下特点：

Angel是一个基于Parameter Server（PS）理念开发的高性能分布式机器学习平台，它具有灵活的可定制函数PS Function（PSF），可以将部分计算下推至PS端。PS架构良好的横向扩展能力让Angel能高效处理千亿级别的模型。
Angel具有专门为处理高维稀疏特征特别优化的数学库，性能可达breeze数学库的10倍以上。Angel的PS和内置的算法内核均构建在该数学库之上。
Angel擅长推荐模型和图网络模型相关领域（如社交网络分析）。图1是Angel和几个业界主流平台在稀疏数据，模型维度，性能表现，深度模型和生态建设几个维度的对比。Tensorflow和PyTouch在深度学习领域和生态建设方面优势明显，但在稀疏数据和高维模型方面的处理能力相对不足，而Angel正好与它们形成互补，3.0版本推出的PyTorch On Angel尝试将PyTorch和Angel的优势结合在一起。

Angel与业界主流平台的对比

Angel3.0整体架构

Angel ChangeLog

Release-2.2.0 - 2019-05-06

In this Release, we have enhanced the graph algorithms: (1) we made a refactoring of the existing K-Core algorithm, the performance and stability have been significantly improved; (2) we add the louvain algorithm, which is also named Fast-Unfolding. The test results show that the K-Core and the Louvain algorithm are both 10x faster than GraphX. In this release we official release the Vero, a new GBDT implementation over Spark On Angel. The advantage of Vero is that it obtains great The main feature of Vero is which has obvious advantages in supporting high dimensional models and multi-classification problems. We also add kerberos support in this release.

New features in Release-2.2.0:

Add Fast Unfolding algorithm in Spark-on-Angel
Support predict for FTRL-LR in Spark-on-Angel
Support predict for FTRL-FM in Spark-on-Angel
Add Vero, a feature parallelism version GBDT on Spark-on-Angel
Support regression for GBDT on Spark-on-Angel
Add a new data split input format--BalanceInputFormatV2
Support running over Kubernetes.

Bugs fixed in Release-2.2.0:

Fix the failure of loading model when the model is moved, closing the csc check
Fix the problem that parameter servers exist with errors in Spark-on-Angel
Fix the problem that the interface of sparse index pull might be blocked when given parameters are invalid
Fix the problem that saving result would fail if the parent path is not existed
Fix the problem that the BalanceInputFormat would return empty splits sometimes
Fix the problem when saving json configuration files
Fix the problem when requesting the resources for Angel workers

Release-2.1.0 - 2019-03-08

In this release, we add an intelligent model partitioning method, named"LoadBalancePartitioner", In Spark-on-Angel. By analyzing the distribution of features in the training data in advance, the number of features of each partition can be precisely controlled. This leads a balanced load for each server. The empirical tests demonstrate that the efficiency of model training can be greatly improved in many cases. Further, we add three algorithms in this release, including FM solved by FTRL optimizer, K-Core algorithm and feature-parallel GBDT, which can support a high-dimentional tree model.

New features in Release-2.1.0:

Adding a load-balanced model partitioner, called "LoadBalancePartitioner" in Spark-on-Angel
Adding Ftrl-FM algorithm
Adding K-core algorithm
Adding a feature-parallel version of GBDT algorithm

Release-2.0.2 - 2019-01-30

In this release, we optimize the performance of FTRL algorithm, adding the support for float data type. We limit the maxmimum retry times for remote requests to avoid unrecoverable blocking. We also increase the performance of the math library.

New features in Release-2.0.2

Optimize the model partitioning for FTRL algorithm
Support float data type for FTRL algorithm
Avoid rehashing in math library to obtain performance improvement
Add a maxmimum retry times for remote requests on servers

Release-2.0.1 - 2019-01-11

In this release, we add the support for incremental training for FTRL. We implement some new optimizers and learning rate scheduling strategies. Documentations about how to choose optimizers or scheduling strategies and how to accelerate deep learning algorithms with openblas are provided in this release.

New features in Release-2.0.1

Add documentations about how to use openblas to accelerate deep learning algorithms
Optimize the performance for FTRL
Support inremental training for FTRL
Add optimizers with L1 penalty: Adagrad/Adadelta
Add some scheduling strategies for learning rate

Bug Fix:

Fix the problem of inconsistency of nodes number in network embedding
Fix casting problem existing in quantile compressing

Owner

Angel is a big project, it consists a seris of sub-project, each sub-project has a owner and a backup owner:

Angel: paynie, leleyu
sona: fitzwang, leleyu
mlcore: fitzwang, endymecy
math: rachelsunrh, fitzwang
serving: ouyangwen, fitzwang
format: paynie, raohuaming
PyTorchOnAngel: leleyu, ouyangwen

“家属和记者取得联系”：记者的退场意味深长

广州地铁“偷拍门”事件：那个漂亮的女大学生，为啥惹了众怒...

劲爆！为了姜萍两位女CEO互揭老底！

治安处罚中“赌资较大”“情节严重”数额认定的理解与适用（各地标准）

中石化一副总被曝出轨人妻，本人嚣张回应：旧情复燃尔

Angel基于参数服务器的规模分布式机器学习平台

Release-2.2.0 - 2019-05-06

Release-2.1.0 - 2019-03-08

Release-2.0.2 - 2019-01-30

Release-2.0.1 - 2019-01-11

Owner

您可能也对以下帖子感兴趣

“家属和记者取得联系”：记者的退场意味深长

广州地铁“偷拍门”事件：那个漂亮的女大学生，为啥惹了众怒...

劲爆！为了姜萍两位女CEO互揭老底！

治安处罚中“赌资较大”“情节严重”数额认定的理解与适用（各地标准）

中石化一副总被曝出轨人妻，本人嚣张回应：旧情复燃尔

生成图片，分享到微信朋友圈

Angel基于参数服务器的规模分布式机器学习平台

Release-2.2.0 - 2019-05-06

Release-2.1.0 - 2019-03-08

Release-2.0.2 - 2019-01-30

Release-2.0.1 - 2019-01-11

Owner

您可能也对以下帖子感兴趣