Model-Gallery Quickly Guide

[TOC]

Model-Gallery is a developer community built on the basis of Apulis AI Platform. It provides the sharing of models, algorithms, datasets and other contents.

English | 简体中文

Tips mindspore model use NPU logical id, tensorflow model use NPU physical id.

Create Single-Device & Multi-Card Distributed Job

Code Development

code-dev-single

Create new development environment
Open Jupyter window, after environment status become running
Change direction at $HOME/<USERNAME>/<Code Storage Path>

    cd ~
    # If you can clone the model-gallery repository
    # git config --global credential.helper store
    # git clone --depth 1 https://apulis-gitlab.apulis.cn/apulis/model-gallery.git
    # otherwise, unzip the uploaded model-gallery.zip
    cd ~/model-gallery/models/npu/testcase/distributed
    bash test_distributed.sh

Model Training

single-train

Select the "Model Training" on toolbar
Create new model training
- Startup File: model-gallery/models/npu/testcase/distributed/test_distributed.sh
- OutputPath: work_dirs/testcase
- Traning Dataset: mnist(/data/dataset/storage/mnist/) or another datasets you need

Create Multi-Device Distributed Job

Code Development

create the distributed code
Open master(ps-0) Jupyter window, After environment status become running
Change direction at $HOME/<USERNAME>/<Code Storage Path>
NPU distributed Job need to run start cmd on each worker node.

ssh worker-0 
cd ~/model-gallery/models/npu/testcase/distributed/
bash test_distributed.sh

ssh worker-1 
cd ~/model-gallery/models/npu/testcase/distributed/
bash test_distributed.sh

GPU distributed Job can run horovod cmd on master(ps-0) node.

#    If GPU device has no ib NIC, thus remove ib0 parameter
    horovodrun  --network-interface ib0 -np 4 -hostfile /job/hostfile python /examples/tensorflow2_keras_mnist.py
#   -np 4 : means 2 machines 4 cards
#    pytorch
    horovodrun  --network-interface ib0 -np 2 -hostfile /job/hostfile python /examples/pytorch_mnist.py
#   -np 1 : means 1 machines 2 cards
#    mindspore distributed job cannot be distributed on the same node; that is mindspore framework bug.
    cd ~/code/resnet_mindspore && bash run.sh

Distributed Training

NPU Model only input the startup cmds at Worker command textbox (Suggest use preset models)

bash ~/model-gallery/models/npu/testcase/distributed/test_distributed.sh

NPU Distributed

Create distirbuted GPU job (Input startup cmds on master command textbox)

#    tensorflow
    horovodrun  --network-interface ib0 -np 4 -hostfile /job/hostfile python /examples/tensorflow2_keras_mnist.py
#    pytorch
    horovodrun  --network-interface ib0 -np 2 -hostfile /job/hostfile python /examples/pytorch_mnist.py
#    mindspore
    cd ~/code/resnet_mindspore && bash run.sh

Models Adapation (Advance)

Create and open a distirbuted code development environment

cd ~
git config --global credential.helper store
git clone --depth 1 https://apulis-gitlab.apulis.cn/apulis/model-gallery.git
cd ~/model-gallery
# If  git pull get a conflict
git reset --hard

Use tensorflow/mindspore template ~/model-gallery/models/npu/mindspore_train.sh

cp ~/model-gallery/models/npu/mindspore_train.sh ~/model-gallery/models/npu/{your_model}/train.sh

Replace the startupfile in train.sh such as STARTFILE="train.py "
Update args = parse_args() as args, _ = parser.parse_known_args() in train.py
Move the train datasets under the direction /data/dataset/storage/

cp -r mnist /data/dataset/storage/

Run the startup script and check the status or logs

bash train.sh

Algorithm params configuration

gallery_config.json

{
    "created_at": "2020-10-28 03:35:29",
    "updated_at": "2020-10-28 03:35:29",
    "version": "0.0.1",
    "status": "normal",
    //   Deafult is normal

    "platform": "AIArts",
    //   Options: AIArts, Avisualis, Segmentation...

    "models": [
        {
            "name": "LeNet_TensorFlow_GPU_scratch",
            "framework":"tensorflow",
            "model_name":"lenet",
            "description": "lenet-mxnet",
            "size": "20165368",
            //   Default unitis : Bytes,18M

            "type": "CV/Classfication",
            //   Options: CV/ObjectDetection, NLP/BERT, CV/Segmentation...

            "dataset": {
                "name": "mnist",
                "path": "mnist",
                //   Datasets direction name
                //   Storage real path on NFS server: /data/dataset/storage/mnist
                //   The startup cmd with datasets path: python train.py --dataset mnist
                "size":"123123",
                //Datasets unit: bytes
                "format": "TFRecord"
            },
            "params": {
                "batch_size": "50",
                "epochs":"10",
                "lr":"0.1",
            },
            //   The startup cmd with params: python train.py --batch_size 50 --optimizer sgd
            //   value of params use double quotes

            "engine": "apulistech/mxnet:2.0.0-gpu-py3",

            //apulistech/tensorflow-nni-npu:1.15.0-20.2-arm
            //apulistech/mindspore-nni-npu:1.1.1-20.2-arm

            "precision": "-",
          //   Model evaluation accuracy

            "output_path": "work_dirs/lenet_mxnet",
            //   work_dirs/{Model Name}
            //   Relative path /home/admin/work_dirs/lenet_mxnet
            //   The startup cmd with output_path: python train.py --output_path work_dirs/lenet_mxnet

            "startup_file": "train.py",
            //   All of the training startup files are named as: train.py / train.sh or main/train.py
            //   All of the evaluation startup files are named as:  eval.py / eval.py  or main/eval.py

            "device_type": "nvidia_gpu_amd64",
            //   Options: nvidia_gpu_amd64, huawei_npu_amd64, huawei_npu_arm64 

            "device_num": 1
            //   Whether to support multi-card or multi-machine training
            //   NPU num Options: 0, 1, 2, 4, 8
        }
    ]
}

All object names, path names, configuration parameter names, use underlined interval uniformly.
The model name format is {model name}{frame name}{version number}{computing equipment}_{whether to start training from scratch}, such as： LeNet_TensorFlow_GPU_scratch
Data sets and model files need to be placed on the storage server

ApulisPlatform / model-gallery

Model-Gallery Quickly Guide

Create Single-Device & Multi-Card Distributed Job

Code Development

Model Training

Create Multi-Device Distributed Job

Code Development

Distributed Training

Models Adapation (Advance)

简介

发行版

贡献者

近期动态

ApulisPlatform / model-gallery .gitee-modal { width: 500px !important; }

Model-Gallery Quickly Guide

Create Single-Device & Multi-Card Distributed Job

Code Development

Model Training

Create Multi-Device Distributed Job

Code Development

Distributed Training

Models Adapation (Advance)

简介

发行版

贡献者

近期动态

搜索帮助

ApulisPlatform / model-gallery