name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
googlenet网络,测试googlenet单卡网络在Ascend环境pynative模式下使用按条件开启方法开启profiler场景,网络训练报 RuntimeError: Getnext gets peek data from data queue failed: 5
网络路径:https://gitee.com/mindspore/models/tree/master/research/cv/googlenet
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend910B3
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
失败版本:r2.3.0.B210
run包:Milan_C17/20240414
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
用例仓地址:solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001.py
1、取mindspore model_zoo中网络googlenet,将训练脚本中的图模式参数设置为pynative,然后通过按条件开启方法开启profiler,执行网络训练
网络推理成功
epoch: 3 step: 15, loss is 2.325002908706665
epoch: 3 step: 16, loss is 2.076383113861084
epoch: 3 step: 17, loss is 2.0212104320526123
epoch: 3 step: 18, loss is 1.9688502550125122
epoch: 3 step: 19, loss is 1.938265323638916
epoch: 3 step: 20, loss is 2.209531545639038
Train epoch time: 2720.425 ms, per step time: 136.021 ms
[ERROR] DEVICE(3636114,fffd167cef20,python):2024-04-27-14:55:49.680.622 [mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:303] RetryPeakItemFromDataQueue] Getnext gets peek data time out, that most likely caused by data processing being too slow
[CRITICAL] DEVICE(3636114,fffd167cef20,python):2024-04-27-14:55:49.686.048 [mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:305] RetryPeakItemFromDataQueue] Getnext gets peek data from data queue failed: 5
Traceback (most recent call last):
File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/train.py", line 273, in <module>
run_train()
File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/model_utils/moxing_adapter.py", line 105, in wrapped_func
run_func(*args, **kwargs)
File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/train.py", line 267, in run_train
model.train(cfg.epoch_size, dataset, callbacks=cbs, dataset_sink_mode=True)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 1082, in train
self._train(epoch,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 636, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 715, in __call__
raise err
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 711, in __call__
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/dataset_helper.py", line 108, in construct
outputs = self.get_next()
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 392, in __call__
return _run_op(self, self.name, args)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 1010, in _run_op
return _convert_stub(stub)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 202, in _convert_stub
elements = stub.get_elements()
RuntimeError: Getnext gets peek data from data queue failed: 5
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:305 RetryPeakItemFromDataQueue
走给郭志健
Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
关闭profiler问题还存在,需要志建再看一下
首要报错 dataset 向设备侧发送数据时 driver out of memory, 此时dataset模块会退出,不再发送数据,最后表现为get next 超时。
epoch: 1 step: 1, loss is 2.3452377319335938
epoch: 1 step: 2, loss is 2.4508824348449707
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.728.921 [npu_driver.cc:3582]1701935 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: device_id=0, qid=6, timeout=-1, drvRetCode=6.
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.016 [api_c.cc:4213]1701935 rtMemQueueEnQueueBuff:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.050 [error_message_manage.cc:53]1701935 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.067 [error_message_manage.cc:53]1701935 FuncErrorReason:rtMemQueueEnQueueBuff execute failed, reason=[driver error:out of memory]
按照 郭博 建议解决办法是 ms.set_context(max_device_memory="xxGB") 来减少网络侧的内存占用。让数据处理侧有足够的设备内存能正常给设备提供数据。
wenli 正在帮忙验证。
CCB结论(郭琦、郭志建、雷伟):alexnet、googlenet等都存在此问题。考虑如下方案:
1.测试用例通过设置ms.set_context(max_device_memory="xxGB") 规避
2. 2.3 630版本已规划需求:虚拟内存动态分配,解决该问题(找明奇要需求单号)
3. modelzoo脚本适配,在910B上适配该脚本 , 此问题单转给赵婷 处理后回归
遗留问题:考虑910A/910B场景下对用户更友好的默认配置。--郭琦/王禹程
回归版本:
commit_id = '[sha1]:39ac2284,[branch]:(HEAD,origin/master,origin/HEAD,master)'
runpkg_version:Milan_C17/20240414
回归步骤:参考issue复现步骤
基本功能:跑测正常
INFO 2024-05-05 14:38:49 - test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001 - base.py:teardown:140 - The base teardown is running
== 1 passed, 4 warnings in 182.38s (0:03:02) ==
测试结论:走单不规范,未打上rca/、rct/、ctl/标签,打回
已整改
登录 后才可以发表评论