1 Star 0 Fork 151

gonghanxuan / AscendSpeed

forked from Ascend / MindSpeed 
加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

AscendSpeed是针对华为昇腾设备的大模型加速库。

大模型训练是一种非常复杂的过程,涉及到许多技术和挑战,其中大模型训练需要大量的显存资源是一个难题,对计算卡提出了不小的挑战。为了在单个计算卡显存资源不足时,可以通过多张计算卡进行计算,业界出现了类似 Megatron、DeepSpeed 等第三方大模型加速库,对模型、输入数据等进行切分并分配到不同的计算卡上,最后在通过集合通信对结果进行汇总。

昇腾提供AscendSpeed加速库,使能客户大模型业务快速迁移至昇腾设备,并且支持昇腾专有算法,确保开箱可用。

模型训练软件配套

软件 版本
Python 3.8.18
driver 2023Q4商发
firmware 2023Q4商发
CANN 2023Q4商发
kernel 2023Q4商发
torch 2.1.0
torch_npu 2023Q4商发
apex 2023Q4商发

支持特性

AscendSpeed对Megatron对基本功能进行了适配,已适配如下加速特性:

  • 张量并行
  • 流水线并行
  • 序列并行
  • 重计算
  • 分布式优化器
  • 异步分布式数据并行

快速上手

  1. 安装AscendSpeed

    从git直接安装

    pip install git+https://gitee.com/ascend/AscendSpeed.git

    或者,下载源码安装

    git clone https://gitee.com/ascend/AscendSpeed.git
    cd AscendSpeed
    pip install -e .
  2. 获取Megatron-LM并指定commit id

    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
    git checkout bcce6f54e075e3c3374ea67adefe54f3f2da2b07
  3. 在Megatron-LM目录下修改pretrain_gpt.py文件,在import torch下新增一行import ascendspeed.megatron_adaptor

     import os
     import torch
    +import ascendspeed.megatron_adaptor
     from torch import Tensor
     from functools import partial
     from typing import Union
  4. 在Megatron-LM目录下,准备好训练数据,并在示例脚本中填写对应路径,然后执行。

    bash examples/pretrain_gpt_distributed.sh

算法

TP重计算通信优化

  • 问题分析:在大部分大模型训练客户场景,开启重计算和TP并行属于必须配置,开启重计算虽然能节省内存,但是会导致TP维度通信耗时增长50%,整体计算耗时会增长30%~40%。

  • Motivation: 重计算通信算子消除,优化重计算层划分,实现大模型训练通信性能提升。

  • 解决思路:

    • 重计算通信优化:开启张量并行时,在前向层FFN末端会插入AllReduce算子,其反向对应的是Identity,由于重计算只是为了获取中间激活值,所以其末端AllReduce的输出是冗余的,因此可以消除末端AllReduce,而不影响中间计算和后续的反向计算,如下图所示;

    • 反向通信Overlap:开启序列并行时,在前向层末端FFN会插入ReduceScatter通信,同时在反向中会插入AllGather通信,重计算时可以直接消除ReduceScatter通信,同时将反向时的AllGather隐藏在前向计算中,如上图所示;

    • 重计算层划分优化:如下图所示,按照通信算子的位置去划分重计算层,可以将层内通信转化成层末端通信,通过上述重计算通信优化方式,可以完全消除重计算引入的通信耗时,E2E TP维度通信耗时可以缩减1/3。

  • 使用方法: 设置--optimize-recomp-communication-level,可选项为1或者2,其中level1代表仅对MLP层进行通信优化,level2代表对MLP/ATTN层都进行通信优化。

说明

安全加固方案

关于文件的权限控制

  • 建议您参考附录A 文件权限清单对各类文件权限进行设计与控制。
  • linux系统的umask值建议不低于027
  • linux系统的ASLR值建议为2级(默认为2级)。
  • 建议您务必对模型训练相关文件(如数据集、配置文件、源代码、checkpoint等)做好权限管理,避免文件被恶意篡改、破坏业务进行等风险,比如可以控制为同组/其他用户仅有只读权限。
  • 原生megatron以及torch框架执行中所生成的文件权限受到linux系统umask参数影响,如umask设置为027,其目录/文件权限默认为750/640,您可进一步管理权限。

关于命令执行

基于安全性考虑,建议您在执行任何命令时,都尽量使用非root账户执行,遵循权限最小化原则。

关于资源使用

建议您根据自身运行环境资源状况,进行训练配置的设定与数据集的准备,若与资源状况不匹配,比如数据集的size超出内存容量/NPU存储容量等,那么原生的Megatron或Pytorch库的组件会直接退出,并自动释放占用的资源。

关于数据集与index map

第一次执行训练,原生megatron会打印WARNING: could not find index map files,并尝试在数据集目录下帮您创建index map files,从而能够继续训练。为兼容多用户共享数据集文件以及index map files的业务场景,生成的index map files权限默认为644,存在被其他用户访问的风险,您可以参考附录A 文件权限清单对其进行加固。

关于通信

您作为计算集群的完全控制者,务必注意集群节点间的通信安全,比如做好组网设计并采取相关安全措施。建议在内部网络下部署计算集群,从而避免公网环境下的诸多安全风险。

关于网络端口

AscendSpeed不主动开放端口,对于原生Pytorch开放的相关端口,您可以参考其官方文档进行设置。在单机训练的情况下,不建议开放全局端口。具体的通信矩阵可以参考附录B 通信矩阵

运行时底层的CANN会缓存算子编译文件,存储在运行目录下的kernel_meta_*文件夹内,加快后续训练的运行速度。

附录

A-文件权限清单

您可以根据自身需要,参考此清单对各类文件进行加固:

类型 linux权限参考值 备注
文件夹 / 目录 750 (rwxr-x---) 包括checkpoint保存目录、数据集存放目录,安装目录等
数据集文件 640 (rw-r-----) 这里的数据集为公开数据集,不涉及隐私数据、商业资产等。另外,若需要共享数据集目录/文件,您可酌情调整为755/644,并注意调整后存在被其他用户(Others)读取的风险
运行生成文件 640 (rw-r-----) 如checkpoint、数据集预处理npy文件等就属于生成文件
不可执行程序文件 440 (r--r-----) 一般程序文件不应修改,如果需要进行开发,您可酌情调整为640
程序目录 / 可执行程序文件 550 (r-xr-x---) 一般程序目录/可执行程序不应修改,如果需要进行开发,您可酌情调整为750
日志文件(已归档) 440 (r--r-----)
日志文件(正在记录) 640(rw-r-----)

B-通信矩阵

源设备 源IP 源端口 目的设备 目的IP 目的端口(侦听) 协议 端口说明 备注
运行torch_npu进程的计算设备 设备地址IP 操作系统自动分配,分配范围由操作系统决定,如ubuntu是采用/proc/sys/net/ipv4_local_port_range文件指定 运行torch_npu进程的计算设备 设备地址IP 当用户不使用测试示例脚本,则默认29500/29400。用户可调用torch.distributed.launch函数,通过传入的--master_port自由指定1024-65535之间未被占用的端口 TCP 源端口与目的端口均用于收发数据。对于静态分布式场景(backend=static)默认端口为29400;对于动态分布式场景(backend=c10d)中默认端口29500 megatron_npu本身不开启端口,该通信过程由开源软件Pytorch控制,配置方式可参考其官方文档:https://pytorch.org/docs/stable/distributed.html#launch-utility
运行torch_npu进程的计算设备 设备地址IP 操作系统自动分配,分配范围由操作系统决定,如ubuntu是采用/proc/sys/net/ipv4_local_port_range文件指定 运行torch_npu进程的计算设备 设备地址IP 当使用pretrain_gpt_distributed*系列测试示例脚本,脚本对torch.distributed.launch传入的--master_port6000,用户可以自由指定1024-65535之间未被占用的端口 TCP 原生Pytorch(调用torchruntorch.distributed.launch)通信需要,用于收发数据 和第一条记录所述为同一端口,这里特别说明测试示例脚本对Pytorch开启的master_port默认配置为6000
运行torch_npu进程的计算设备 设备地址IP 操作系统自动分配,分配范围由操作系统决定,如ubuntu是采用/proc/sys/net/ipv4_local_port_range文件指定 运行torch_npu进程的计算设备 设备地址IP 当使用test_gpt_distributed*系列测试示例脚本,脚本对torch.distributed.launch传入的--master_port60035,用户可以自由指定1024-65535之间未被占用的端口 TCP 原生Pytorch(调用torchruntorch.distributed.launch)通信需要,用于收发数据 和第一条记录所述为同一端口,这里特别说明测试示例脚本对Pytorch开启的master_port默认配置为60035
运行torch_npu进程的计算设备 设备地址IP 请参见备注中的CANN官方文档 运行torch_npu进程的计算设备 设备地址IP 请参见备注中的CANN官方文档 TCP 请参见备注中的CANN官方文档 该通信过程完全由HCCL组件控制,端口范围可参考文档:https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/700alpha001/ref/envref/envref_07_0065.html CANN通信文档:https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/700alpha001/ref/hcclapiref/hcclapi_07_0001.html
BSD 3-Clause License Copyright (c) 2023, Huawei Technologies Co., Ltd All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -- This repository also contains code from Microsoft (from their DeepSpeed project). Files from these organization(s) have notices at the top of each file. Below are licenses used in those files, as indicated. ----------------------------- LICENSE FOR Microsoft code ----------------------------- Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

暂无描述 展开 收起
Apache-2.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/gonghanxuan/AscendSpeed.git
git@gitee.com:gonghanxuan/AscendSpeed.git
gonghanxuan
AscendSpeed
AscendSpeed
master

搜索帮助

344bd9b3 5694891 D2dac590 5694891