98 Star 776 Fork 1.4K

MindSpore / models

 / 详情

[MS][NET][ssd_mobilenetv1_fpn/retinanet][pynative][ascend 8p] Memory not enough

DONE
Bug-Report
创建于  
2021-09-27 20:06
name about labels
Bug Report Use this template for reporting a bug kind/bug

Environment

  • Hardware Environment(Ascend/GPU/CPU):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device ascend

  • Software Environment:
    -- MindSpore version (source or binary):commit_id:58619b2bb
    -- Python version (e.g., Python 3.7.5):
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    run包:http://10.90.67.50/productrepo/HiAI/HISI_C79/20210916

Related testcase

test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p_pynative.py
test_ms_retinanet_coco2017_check_loss_8p_0001_pynative.py

Steps to reproduce the issue

  1. get code from models
  2. sh run_distribute_train.sh

Describe the current behavior

网络使用pynative运行模式出现错误内存不足

Describe the expected behavior

网络训练成功

Related log / screenshot

[ERROR] ME(43089:281473480737920,MainProcess):2021-09-27-11:18:14.364.973 [mindspore/dataset/engine/datasets.py:2616] Uncaught exception:
Traceback (most recent call last):
  File "train.py", line 188, in <module>
    train_net()
  File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/ssd_mobilenet_v1_fpn/network/test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p_pynative/LOG0/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
    run_func(*args, **kwargs)
  File "train.py", line 185, in train_net
    model.train(config.epoch_size, dataset, callbacks=callback, dataset_sink_mode=dataset_sink_mode)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 718, in train
    sink_size=sink_size)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 502, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 564, in _train_dataset_sink_process
    outputs = self._train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 79, in construct
    return self.network(*outputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/ssd_mobilenet_v1_fpn/network/test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p_pynative/LOG0/src/ssd.py", line 521, in construct
    grads = self.grad(self.network, weights)(*args, sens)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 77, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
    out = _pynative_executor(fn, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 438, in __call__
    return self._executor(obj, args)
RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_memory_pool.cc:139 free_mem_size] graph dynamic mem offset [0] less than or equal to device mem pool offset [0]!

#



[ERROR] ME(91012:281473025606784,MainProcess):2021-09-27-11:07:26.253.967 [mindspore/dataset/engine/datasets.py:2616] Uncaught exception:
Traceback (most recent call last):
  File "train.py", line 192, in <module>
    main()
  File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/retinanet/network/test_ms_retinanet_coco2017_check_loss_8p_0001_pynative/LOG0/src/model_utils/moxing_adapter.py", line 113, in wrapped_func
    run_func(*args, **kwargs)
  File "train.py", line 185, in main
    model.train(10, dataset, callbacks=cb, dataset_sink_mode=True, sink_size=229)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 718, in train
    sink_size=sink_size)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 502, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 564, in _train_dataset_sink_process
    outputs = self._train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 79, in construct
    return self.network(*outputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/retinanet/network/test_ms_retinanet_coco2017_check_loss_8p_0001_pynative/LOG0/src/retinanet.py", line 315, in construct
    grads = self.grad(self.network, weights)(*args, sens)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 77, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
    out = _pynative_executor(fn, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 438, in __call__
    return self._executor(obj, args)
RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_memory_pool.cc:53 CalMemBlockAllocSize] Memory not enough: current free memory size[536870912] is smaller than required size[704972288], dynamic offset [0] memory pool offset[31675383808])

Special notes for this issue

ssd_mobilenetv1_fpn/retinanet网络使用pynative运行模式出现错误内存不足

评论 (4)

zhongjicheng 创建了Bug-Report
zhongjicheng 负责人设置为zjun
zhongjicheng 关联仓库设置为MindSpore/models
zhongjicheng 计划开始日期设置为2021-09-27
zhongjicheng 计划截止日期设置为2021-09-30
zhongjicheng 里程碑设置为B-SIG-ModelZoo
zhongjicheng 优先级设置为主要
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
sig/modelzoo
标签
zhongjicheng 添加了
 
kind/bug
标签
xiangjiawei007 修改了描述
chujinjin 添加了kind/discussion(已删除)标签
展开全部操作日志

当前pynaitve模式下,设备内存碎片比较大:当前内存池GPU以1G block大小去分割内存,在Ascend上以256M的倍数block去分割内存,会造成block与block之间有内存碎片。如:GPU模式下,统计transform 96 batch size下,内存碎片的大小约为3G。
r1.6版本有需求跟踪该问题。
【STABLE】【PyNative】PyNative下设备内存优化
https://e.gitee.com/mind_spore/dashboard?issue=I4CGW1

chujinjin 任务类型Bug-Report 修改为RFC
chujinjin 添加了
 
ccb/rfc
标签
chujinjin 里程碑B-SIG-Executor-PYNATIVE 修改为未设置
chujinjin 里程碑设置为IT-2021Q4-用户编程

第一个step,在跑反向的过程中内存还是不足,理论上设置mempool_block_size="31GB"后,已经不存在block之间的碎片了,正向内存反向执行过程中也已经释放了。需要其它的手段对该网络进行运行内存优化,如融合,inplace运算等。

Root Case :
PyNative下优化内存的接口mempool_block_size没有再modezoo种添加优化使用。

Fix Solution:
添加mempool_block_size接口,优化PyNative下的内存使用。

Fix pr:
!1815:Fix pynative memory not enough

Test Suggestion:
使用2022/1/13日之后(包含)的moodezoo种脚本进行测试。

Self-test Report:
运行正常,没有出现内存不足的情况。

retinanet

[jenkins@10-90-66-64 network]$ tail -f test_ms_retinanet_coco2017_check_loss_8p_0001/LOG2/log.txt
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.512.657 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.561.553 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.610.623 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.659.304 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.709.414 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.853.585 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.955.518 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:05.592.20 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
Create dataset done!
Start train retinanet, the first epoch will be slower because of the graph compilation.
[TRACE] TDT(112009,python):2022-01-12-10:42:17.702.605 [status:Running] [log.cpp:154]Channel "40c2aa0e-7351-11ec-ba45-4cf55bcfc84a": Send Sample Files,[tensor_data_deliver.cpp:279:Send]131493
epoch: 1 step: 1, loss is 3381.634521484375
lr:[0.000001]
epoch: 1 step: 2, loss is 2178.119140625
lr:[0.000001]
epoch: 1 step: 3, loss is 2283.59326171875
lr:[0.000001]
epoch: 1 step: 4, loss is 2815.9267578125

ssd

[jenkins@10-90-66-64 network]$ tail -f test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p/LOG2/log.txt
(9600, 4)
(2400, 4)
(600, 4)
(150, 4)
Start create dataset!
[WARNING] ME(69520:281473094759552,MainProcess):2022-01-12-10:28:29.121.799 [mindspore/train/serialization.py:627] For 'load_param_into_net', remove parameter network.feature_extractor.mobilenet_v1.network.0.0.weight's prefix name: network., continue to load it to net parameter feature_extractor.mobilenet_v1.network.0.0.weight.
Create dataset done! dataset size is 458
In sink mode, one epoch return a loss.
Start train SSD, the first epoch will be slower because of the graph compilation.
[TRACE] TDT(69520,python):2022-01-12-10:28:36.453.636 [status:Running] [log.cpp:154]Channel "5661bb68-734f-11ec-971e-4cf55bcfc84a": Send Sample Files,[tensor_data_deliver.cpp:279:Send]78033
epoch: 1 step: 1, loss is 9643.1533203125
epoch: 1 step: 2, loss is 10651.611328125
epoch: 1 step: 3, loss is 11576.017578125
epoch: 1 step: 4, loss is 13060.048828125
epoch: 1 step: 5, loss is 9230.03125

zjun 里程碑IT-2021Q4-用户编程 修改为未设置
zjun 移除了
 
ccb/rfc
标签
zjun 移除了kind/discussion(已删除)标签
zjun 里程碑设置为B-SolutionTest
zjun 任务类型RFC 修改为Bug-Report
zjun 任务状态TODO 修改为VALIDATION
zjun 添加协作者zjun
zjun 负责人zjun 修改为zhongjicheng

回归版本:2022-01-11 每日构建版本
编译时间 2022-01-11
回归步骤:参考issue复现步骤
基本功能:问题已解决
retinanet
输入图片说明
ssd_mobilenet_v1_fpn
输入图片说明
测试结论:回归通过
回归人员:zhongjicheng
回归时间: 2022-01-12

zhongjicheng 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
6580807 zjun3021 1615805932
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助

14c37bed 8189591 565d56ea 8189591