[TOC]
Model-Gallery is a developer community built on the basis of Apulis AI Platform. It provides the sharing of models, algorithms, datasets and other contents.
Tips mindspore model use NPU logical id, tensorflow model use NPU physical id.
$HOME/<USERNAME>/<Code Storage Path>
cd ~
# If you can clone the model-gallery repository
# git config --global credential.helper store
# git clone --depth 1 https://apulis-gitlab.apulis.cn/apulis/model-gallery.git
# otherwise, unzip the uploaded model-gallery.zip
cd ~/model-gallery/models/npu/testcase/distributed
bash test_distributed.sh
Select the "Model Training" on toolbar
Create new model training
model-gallery/models/npu/testcase/distributed/test_distributed.sh
work_dirs/testcase
/data/dataset/storage/mnist/
) or another datasets you needcreate the distributed code
Open master(ps-0) Jupyter window, After environment status become running
Change direction at $HOME/<USERNAME>/<Code Storage Path>
NPU distributed Job need to run start cmd on each worker node.
ssh worker-0
cd ~/model-gallery/models/npu/testcase/distributed/
bash test_distributed.sh
ssh worker-1
cd ~/model-gallery/models/npu/testcase/distributed/
bash test_distributed.sh
# If GPU device has no ib NIC, thus remove ib0 parameter
horovodrun --network-interface ib0 -np 4 -hostfile /job/hostfile python /examples/tensorflow2_keras_mnist.py
# -np 4 : means 2 machines 4 cards
# pytorch
horovodrun --network-interface ib0 -np 2 -hostfile /job/hostfile python /examples/pytorch_mnist.py
# -np 1 : means 1 machines 2 cards
# mindspore distributed job cannot be distributed on the same node; that is mindspore framework bug.
cd ~/code/resnet_mindspore && bash run.sh
bash ~/model-gallery/models/npu/testcase/distributed/test_distributed.sh
# tensorflow
horovodrun --network-interface ib0 -np 4 -hostfile /job/hostfile python /examples/tensorflow2_keras_mnist.py
# pytorch
horovodrun --network-interface ib0 -np 2 -hostfile /job/hostfile python /examples/pytorch_mnist.py
# mindspore
cd ~/code/resnet_mindspore && bash run.sh
cd ~
git config --global credential.helper store
git clone --depth 1 https://apulis-gitlab.apulis.cn/apulis/model-gallery.git
cd ~/model-gallery
# If git pull get a conflict
git reset --hard
~/model-gallery/models/npu/mindspore_train.sh
cp ~/model-gallery/models/npu/mindspore_train.sh ~/model-gallery/models/npu/{your_model}/train.sh
Replace the startupfile in train.sh such as STARTFILE="train.py "
Update args = parse_args()
as args, _ = parser.parse_known_args()
in train.py
Move the train datasets under the direction /data/dataset/storage/
cp -r mnist /data/dataset/storage/
bash train.sh
{
"created_at": "2020-10-28 03:35:29",
"updated_at": "2020-10-28 03:35:29",
"version": "0.0.1",
"status": "normal",
// Deafult is normal
"platform": "AIArts",
// Options: AIArts, Avisualis, Segmentation...
"models": [
{
"name": "LeNet_TensorFlow_GPU_scratch",
"framework":"tensorflow",
"model_name":"lenet",
"description": "lenet-mxnet",
"size": "20165368",
// Default unitis : Bytes,18M
"type": "CV/Classfication",
// Options: CV/ObjectDetection, NLP/BERT, CV/Segmentation...
"dataset": {
"name": "mnist",
"path": "mnist",
// Datasets direction name
// Storage real path on NFS server: /data/dataset/storage/mnist
// The startup cmd with datasets path: python train.py --dataset mnist
"size":"123123",
//Datasets unit: bytes
"format": "TFRecord"
},
"params": {
"batch_size": "50",
"epochs":"10",
"lr":"0.1",
},
// The startup cmd with params: python train.py --batch_size 50 --optimizer sgd
// value of params use double quotes
"engine": "apulistech/mxnet:2.0.0-gpu-py3",
//apulistech/tensorflow-nni-npu:1.15.0-20.2-arm
//apulistech/mindspore-nni-npu:1.1.1-20.2-arm
"precision": "-",
// Model evaluation accuracy
"output_path": "work_dirs/lenet_mxnet",
// work_dirs/{Model Name}
// Relative path /home/admin/work_dirs/lenet_mxnet
// The startup cmd with output_path: python train.py --output_path work_dirs/lenet_mxnet
"startup_file": "train.py",
// All of the training startup files are named as: train.py / train.sh or main/train.py
// All of the evaluation startup files are named as: eval.py / eval.py or main/eval.py
"device_type": "nvidia_gpu_amd64",
// Options: nvidia_gpu_amd64, huawei_npu_amd64, huawei_npu_arm64
"device_num": 1
// Whether to support multi-card or multi-machine training
// NPU num Options: 0, 1, 2, 4, 8
}
]
}
All object names, path names, configuration parameter names, use underlined interval uniformly.
The model name format is {model name}{frame name}{version number}{computing equipment}_{whether to start training from scratch}, such as: LeNet_TensorFlow_GPU_scratch
Data sets and model files need to be placed on the storage server
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。