1 Star 0 Fork 1

ApulisPlatform / model-gallery

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
README_en_US.md 6.64 KB
一键复制 编辑 原始数据 按行查看 历史
banrieen 提交于 2021-05-24 10:12 . update info

Model-Gallery Quickly Guide

[TOC]

Model-Gallery is a developer community built on the basis of Apulis AI Platform. It provides the sharing of models, algorithms, datasets and other contents.

English | 简体中文

Tips mindspore model use NPU logical id, tensorflow model use NPU physical id.

Create Single-Device & Multi-Card Distributed Job

Code Development

code-dev-single

  1. Create new development environment
  2. Open Jupyter window, after environment status become running
  3. Change direction at $HOME/<USERNAME>/<Code Storage Path>
    cd ~
    # If you can clone the model-gallery repository
    # git config --global credential.helper store
    # git clone --depth 1 https://apulis-gitlab.apulis.cn/apulis/model-gallery.git
    # otherwise, unzip the uploaded model-gallery.zip
    cd ~/model-gallery/models/npu/testcase/distributed
    bash test_distributed.sh

Model Training

single-train

  1. Select the "Model Training" on toolbar

  2. Create new model training

    • Startup File: model-gallery/models/npu/testcase/distributed/test_distributed.sh
    • OutputPath: work_dirs/testcase
    • Traning Dataset: mnist(/data/dataset/storage/mnist/) or another datasets you need

Create Multi-Device Distributed Job

Code Development

  1. create the distributed code

    distributed-code

  2. Open master(ps-0) Jupyter window, After environment status become running

  3. Change direction at $HOME/<USERNAME>/<Code Storage Path>

  4. NPU distributed Job need to run start cmd on each worker node.

ssh worker-0 
cd ~/model-gallery/models/npu/testcase/distributed/
bash test_distributed.sh

ssh worker-1 
cd ~/model-gallery/models/npu/testcase/distributed/
bash test_distributed.sh
  • GPU distributed Job can run horovod cmd on master(ps-0) node.
#    If GPU device has no ib NIC, thus remove ib0 parameter
    horovodrun  --network-interface ib0 -np 4 -hostfile /job/hostfile python /examples/tensorflow2_keras_mnist.py
#   -np 4 : means 2 machines 4 cards
#    pytorch
    horovodrun  --network-interface ib0 -np 2 -hostfile /job/hostfile python /examples/pytorch_mnist.py
#   -np 1 : means 1 machines 2 cards
#    mindspore distributed job cannot be distributed on the same node; that is mindspore framework bug.
    cd ~/code/resnet_mindspore && bash run.sh

Distributed Training

  • NPU Model only input the startup cmds at Worker command textbox (Suggest use preset models)
bash ~/model-gallery/models/npu/testcase/distributed/test_distributed.sh

NPU Distributed

  • Create distirbuted GPU job (Input startup cmds on master command textbox)
#    tensorflow
    horovodrun  --network-interface ib0 -np 4 -hostfile /job/hostfile python /examples/tensorflow2_keras_mnist.py
#    pytorch
    horovodrun  --network-interface ib0 -np 2 -hostfile /job/hostfile python /examples/pytorch_mnist.py
#    mindspore
    cd ~/code/resnet_mindspore && bash run.sh

Models Adapation (Advance)

  • Create and open a distirbuted code development environment
cd ~
git config --global credential.helper store
git clone --depth 1 https://apulis-gitlab.apulis.cn/apulis/model-gallery.git
cd ~/model-gallery
# If  git pull get a conflict
git reset --hard    
  • Use tensorflow/mindspore template ~/model-gallery/models/npu/mindspore_train.sh
cp ~/model-gallery/models/npu/mindspore_train.sh ~/model-gallery/models/npu/{your_model}/train.sh
  • Replace the startupfile in train.sh such as STARTFILE="train.py "

  • Update args = parse_args() as args, _ = parser.parse_known_args() in train.py

  • Move the train datasets under the direction /data/dataset/storage/

cp -r mnist /data/dataset/storage/
  • Run the startup script and check the status or logs
bash train.sh
  • Algorithm params configuration
  • gallery_config.json
{
    "created_at": "2020-10-28 03:35:29",
    "updated_at": "2020-10-28 03:35:29",
    "version": "0.0.1",
    "status": "normal",
    //   Deafult is normal

    "platform": "AIArts",
    //   Options: AIArts, Avisualis, Segmentation...

    "models": [
        {
            "name": "LeNet_TensorFlow_GPU_scratch",
            "framework":"tensorflow",
            "model_name":"lenet",
            "description": "lenet-mxnet",
            "size": "20165368",
            //   Default unitis : Bytes,18M

            "type": "CV/Classfication",
            //   Options: CV/ObjectDetection, NLP/BERT, CV/Segmentation...

            "dataset": {
                "name": "mnist",
                "path": "mnist",
                //   Datasets direction name
                //   Storage real path on NFS server: /data/dataset/storage/mnist
                //   The startup cmd with datasets path: python train.py --dataset mnist
                "size":"123123",
                //Datasets unit: bytes
                "format": "TFRecord"
            },
            "params": {
                "batch_size": "50",
                "epochs":"10",
                "lr":"0.1",
            },
            //   The startup cmd with params: python train.py --batch_size 50 --optimizer sgd
            //   value of params use double quotes

            "engine": "apulistech/mxnet:2.0.0-gpu-py3",

            //apulistech/tensorflow-nni-npu:1.15.0-20.2-arm
            //apulistech/mindspore-nni-npu:1.1.1-20.2-arm

            "precision": "-",
          //   Model evaluation accuracy

            "output_path": "work_dirs/lenet_mxnet",
            //   work_dirs/{Model Name}
            //   Relative path /home/admin/work_dirs/lenet_mxnet
            //   The startup cmd with output_path: python train.py --output_path work_dirs/lenet_mxnet

            "startup_file": "train.py",
            //   All of the training startup files are named as: train.py / train.sh or main/train.py
            //   All of the evaluation startup files are named as:  eval.py / eval.py  or main/eval.py

            "device_type": "nvidia_gpu_amd64",
            //   Options: nvidia_gpu_amd64, huawei_npu_amd64, huawei_npu_arm64 

            "device_num": 1
            //   Whether to support multi-card or multi-machine training
            //   NPU num Options: 0, 1, 2, 4, 8
        }
    ]
}
  • All object names, path names, configuration parameter names, use underlined interval uniformly.

  • The model name format is {model name}{frame name}{version number}{computing equipment}_{whether to start training from scratch}, such as: LeNet_TensorFlow_GPU_scratch

  • Data sets and model files need to be placed on the storage server

1
https://gitee.com/apulisplatform/model-gallery.git
git@gitee.com:apulisplatform/model-gallery.git
apulisplatform
model-gallery
model-gallery
v1.6.0

搜索帮助