name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
Ascend
/GPU
/CPU
):Uncomment only one
/device <>
line, hit enter to put that in a new line, and remove leading whitespaces from that line:/device ascend
test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p_pynative.py
test_ms_retinanet_coco2017_check_loss_8p_0001_pynative.py
网络使用pynative运行模式出现错误内存不足
网络训练成功
[ERROR] ME(43089:281473480737920,MainProcess):2021-09-27-11:18:14.364.973 [mindspore/dataset/engine/datasets.py:2616] Uncaught exception:
Traceback (most recent call last):
File "train.py", line 188, in <module>
train_net()
File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/ssd_mobilenet_v1_fpn/network/test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p_pynative/LOG0/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
run_func(*args, **kwargs)
File "train.py", line 185, in train_net
model.train(config.epoch_size, dataset, callbacks=callback, dataset_sink_mode=dataset_sink_mode)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 718, in train
sink_size=sink_size)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 502, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 564, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
output = self.run_construct(cast_inputs, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 79, in construct
return self.network(*outputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
output = self.run_construct(cast_inputs, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/ssd_mobilenet_v1_fpn/network/test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p_pynative/LOG0/src/ssd.py", line 521, in construct
grads = self.grad(self.network, weights)(*args, sens)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 77, in wrapper
results = fn(*arg, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
out = _pynative_executor(fn, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 438, in __call__
return self._executor(obj, args)
RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_memory_pool.cc:139 free_mem_size] graph dynamic mem offset [0] less than or equal to device mem pool offset [0]!
#
[ERROR] ME(91012:281473025606784,MainProcess):2021-09-27-11:07:26.253.967 [mindspore/dataset/engine/datasets.py:2616] Uncaught exception:
Traceback (most recent call last):
File "train.py", line 192, in <module>
main()
File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/retinanet/network/test_ms_retinanet_coco2017_check_loss_8p_0001_pynative/LOG0/src/model_utils/moxing_adapter.py", line 113, in wrapped_func
run_func(*args, **kwargs)
File "train.py", line 185, in main
model.train(10, dataset, callbacks=cb, dataset_sink_mode=True, sink_size=229)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 718, in train
sink_size=sink_size)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 502, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 564, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
output = self.run_construct(cast_inputs, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 79, in construct
return self.network(*outputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
output = self.run_construct(cast_inputs, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/retinanet/network/test_ms_retinanet_coco2017_check_loss_8p_0001_pynative/LOG0/src/retinanet.py", line 315, in construct
grads = self.grad(self.network, weights)(*args, sens)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 77, in wrapper
results = fn(*arg, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
out = _pynative_executor(fn, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 438, in __call__
return self._executor(obj, args)
RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_memory_pool.cc:53 CalMemBlockAllocSize] Memory not enough: current free memory size[536870912] is smaller than required size[704972288], dynamic offset [0] memory pool offset[31675383808])
ssd_mobilenetv1_fpn/retinanet网络使用pynative运行模式出现错误内存不足
当前pynaitve模式下,设备内存碎片比较大:当前内存池GPU以1G block大小去分割内存,在Ascend上以256M的倍数block去分割内存,会造成block与block之间有内存碎片。如:GPU模式下,统计transform 96 batch size下,内存碎片的大小约为3G。
r1.6版本有需求跟踪该问题。
【STABLE】【PyNative】PyNative下设备内存优化
https://e.gitee.com/mind_spore/dashboard?issue=I4CGW1
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
第一个step,在跑反向的过程中内存还是不足,理论上设置mempool_block_size="31GB"后,已经不存在block之间的碎片了,正向内存反向执行过程中也已经释放了。需要其它的手段对该网络进行运行内存优化,如融合,inplace运算等。
Root Case :
PyNative下优化内存的接口mempool_block_size没有再modezoo种添加优化使用。
Fix Solution:
添加mempool_block_size接口,优化PyNative下的内存使用。
Fix pr:
!1815:Fix pynative memory not enough
Test Suggestion:
使用2022/1/13日之后(包含)的moodezoo种脚本进行测试。
Self-test Report:
运行正常,没有出现内存不足的情况。
[jenkins@10-90-66-64 network]$ tail -f test_ms_retinanet_coco2017_check_loss_8p_0001/LOG2/log.txt
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.512.657 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.561.553 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.610.623 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.659.304 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.709.414 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.853.585 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:04.955.518 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
[WARNING] ME(112009:281473375462528,MainProcess):2022-01-12-10:42:05.592.20 [mindspore/common/_decorator.py:33] 'TensorAdd' is deprecated from version 1.1 and will be removed in a future version, use 'Add' instead.
Create dataset done!
Start train retinanet, the first epoch will be slower because of the graph compilation.
[TRACE] TDT(112009,python):2022-01-12-10:42:17.702.605 [status:Running] [log.cpp:154]Channel "40c2aa0e-7351-11ec-ba45-4cf55bcfc84a": Send Sample Files,[tensor_data_deliver.cpp:279:Send]131493
epoch: 1 step: 1, loss is 3381.634521484375
lr:[0.000001]
epoch: 1 step: 2, loss is 2178.119140625
lr:[0.000001]
epoch: 1 step: 3, loss is 2283.59326171875
lr:[0.000001]
epoch: 1 step: 4, loss is 2815.9267578125
[jenkins@10-90-66-64 network]$ tail -f test_ms_model_zoo_ssd_mobilenet_v1_fpn_coco_train_infer_8p/LOG2/log.txt
(9600, 4)
(2400, 4)
(600, 4)
(150, 4)
Start create dataset!
[WARNING] ME(69520:281473094759552,MainProcess):2022-01-12-10:28:29.121.799 [mindspore/train/serialization.py:627] For 'load_param_into_net', remove parameter network.feature_extractor.mobilenet_v1.network.0.0.weight's prefix name: network., continue to load it to net parameter feature_extractor.mobilenet_v1.network.0.0.weight.
Create dataset done! dataset size is 458
In sink mode, one epoch return a loss.
Start train SSD, the first epoch will be slower because of the graph compilation.
[TRACE] TDT(69520,python):2022-01-12-10:28:36.453.636 [status:Running] [log.cpp:154]Channel "5661bb68-734f-11ec-971e-4cf55bcfc84a": Send Sample Files,[tensor_data_deliver.cpp:279:Send]78033
epoch: 1 step: 1, loss is 9643.1533203125
epoch: 1 step: 2, loss is 10651.611328125
epoch: 1 step: 3, loss is 11576.017578125
epoch: 1 step: 4, loss is 13060.048828125
epoch: 1 step: 5, loss is 9230.03125
回归版本:2022-01-11 每日构建版本
编译时间 2022-01-11
回归步骤:参考issue复现步骤
基本功能:问题已解决
retinanet
ssd_mobilenet_v1_fpn
测试结论:回归通过
回归人员:zhongjicheng
回归时间: 2022-01-12
登录 后才可以发表评论