37 Star 7 Fork 19

Ascend / ascend-hccl-controller

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
Apache-2.0

hccl-controller

组件介绍

  • 一个Controller至少追踪一种类型的Kubernetes资源。这些对象有一个代表期望状态的指定字段。Controller负责确保其追踪的资源对象的当前状态接近期望状态。
  • Controller Manager就是集群内部的管理控制中心,由负责不同资源的多个Controller构成,共同负责集群内的节点、Pod等所有资源的管理。
  • Controller Manager主要提供了一个分发事件的能力,而不同的Controller只需要注册对应的Handler来等待接收和处理事件。
  • 每种特定资源都有特定的Controller维护管理以保持预期状态。

图 1 Controller interaction process

1、HCCL-Controller整体流程

HCCL-Controller 是华为自研的一款用于NPU训练任务的组件,利用kubernetes的informer机制,持续监控NPU训练任务及其POD的各种事件,并读取POD的NPU信息,生成对应的 Configmap。该Configmap包含了NPU训练任务需要的hccl.json配置文件,方便NPU训练任务更好的协同和调度底层的昇腾处理器。 HCCL-Controller整体流程如图1所示。

图 1 HCCL-Controller process

  1. Device-plugin通过list-and-watch接口,定时上报节点昇腾910处理器DeviceID和健康状态。

  2. Scheduller收到用户训练任务请求,创建Job和Configmap。使用Volacno调度器选择Job部署的节点。

  3. Scheduller发送创建Pod信息到选中的节点Kubelet上。

  4. 在被选择的节点上,Device-plugin会从Kubelet收到分配设备的请求,返回DeviceID、Volume、环境变量等信息给Kubelet,Kubelet分配资源给Pod。

  5. Device-plugin修改该Pod的annotation字段,将分配给Pod的昇腾910处理器网卡IP和DeviceID写入Pod的annotation。

  6. HCCL-Controller持续监控volcano job和Pod的变化,如果有新创建的Pod,HCCL-Controller会把Pod中annotation值取出,当volcano job的所有Pod信息获取完后,更新对应rings-config的Configmap。

  7. Pod中容器训练任务持续查看Configmap的状态,发现状态为完成后,则可以从configmap中生成hccl.json文件

2、HCCL-Controller业务规则

HCCL-Controller是专门用于生成训练作业所有Pod的hccl.json文件的组件,该组件为Atlas 800 训练服务器K8s集群专用组件。

  • 训练任务,Pod,ConfigMap需要设置ring-controller.atlas: ascend-910标签,HCCL-Controller通过该标签过滤,用于区分昇腾910场景和非昇腾910场景。
  • volcano job与configmap的对应方式:volcano job.yaml中volume(ascend-910-config)的configmap name,就是volcano job对应的configmap。
  • hccl-controller持续监控 volcano job,pod和ConfigMap的变化(需携带•约定1:训练任务,Pod,ConfigMap需...中的标签),同一个训练任务的volcano job和ConfigMap通过volume(ascend-910-config)关联。如果有新创建的Pod,hccl-controller把Pod中的annotation(atlas.kubectl.kubernetes.io/ascend-910-configuration)的值取出,为volcano job创建数据缓存信息表,当volcano job的所有实例信息获取完整后,更新对应的rings-config的ConfigMap。
  • ConfigMap中rings-config的文件名默认为hccl.json,默认挂在路径为:“/user/serverid/devindex/config”。

编译HCCL-Controller

  1. 通过git拉取源码,并切换sync-dev分支,获得ascend-hccl-controller。

    示例:源码放在/home/test/ascend-hccl-controller目录下

  2. 执行以下命令,进入构建目录,执行构建脚本,在“output“目录下生成二进制hccl-controller、yaml文件和Dockerfile。

    cd /home/test/ascend-hccl-controller/build/

    chmod +x build.sh

    ./build.sh

  3. 执行以下命令,查看output生成的软件列表。

    ll /home/test/ascend-hccl-controller/output

    drwxr-xr-x 2 root root     4096 Jan 29 19:12 ./
    drwxr-xr-x 9 root root     4096 Jan 29 19:09 ../
    -r-------- 1 root root      498 Jan 29 19:09 Dockerfile
    -r-x------ 1 root root 35323904 Jan 29 19:09 hccl-controller
    -r-------- 1 root root     2374 Jan 29 19:12 hccl-controller-v3.0.0.yaml

组件安装

  1. 请参考《MindX DL用户指南》(https://www.hiascend.com/software/mindx-dl) 中的“集群调度用户指南 > 安装部署指导 > 安装集群调度组件 > 典型安装场景 > 集群调度场景”进行。

说明

  1. 当前容器方式部署本组件,本组件的认证鉴权方式为ServiceAccount, 该认证鉴权方式为ServiceAccount的token明文显示,如果需要加密保存,请自行修改

更新日志

版本 发布日期 修改说明
v3.0.0 2022-1230 首次发布
Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

简介

华为NPU通讯集合生成控制器,自动生成训练任务需要的通信集合配置。 为了确保您能够获得商业支持,请使用我们正式版本的源码(Tag说明中有配套xx版本或者xx补丁版本字样)。同时,建议在集成时反馈相关信息(至少包含如下内容:集成的内容,版本,联系方式)到kangfuan2@huawei.com邮箱,我们将严格保护您的个人信息。 展开 收起
Go 等 3 种语言
Apache-2.0
取消

贡献者

全部

近期动态

加载更多
不能加载更多了
Go
1
https://gitee.com/ascend/ascend-hccl-controller.git
git@gitee.com:ascend/ascend-hccl-controller.git
ascend
ascend-hccl-controller
ascend-hccl-controller
master

搜索帮助