2.4K Star 8.1K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][开启profiler场景][910B3 1p]RuntimeError: Getnext gets peek data from data queue failed: 5

DONE
Bug-Report
创建于  
2024-05-01 00:32
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

googlenet网络,测试googlenet单卡网络在Ascend环境pynative模式下使用按条件开启方法开启profiler场景,网络训练报 RuntimeError: Getnext gets peek data from data queue failed: 5
网络路径:https://gitee.com/mindspore/models/tree/master/research/cv/googlenet

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend910B3

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    失败版本:r2.3.0.B210
    run包:Milan_C17/20240414

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1、取mindspore model_zoo中网络googlenet,将训练脚本中的图模式参数设置为pynative,然后通过按条件开启方法开启profiler,执行网络训练

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络推理成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

epoch: 3 step: 15, loss is 2.325002908706665
epoch: 3 step: 16, loss is 2.076383113861084
epoch: 3 step: 17, loss is 2.0212104320526123
epoch: 3 step: 18, loss is 1.9688502550125122
epoch: 3 step: 19, loss is 1.938265323638916
epoch: 3 step: 20, loss is 2.209531545639038
Train epoch time: 2720.425 ms, per step time: 136.021 ms
[ERROR] DEVICE(3636114,fffd167cef20,python):2024-04-27-14:55:49.680.622 [mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:303] RetryPeakItemFromDataQueue] Getnext gets peek data time out, that most likely caused by data processing being too slow
[CRITICAL] DEVICE(3636114,fffd167cef20,python):2024-04-27-14:55:49.686.048 [mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:305] RetryPeakItemFromDataQueue] Getnext gets peek data from data queue failed: 5
Traceback (most recent call last):
  File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/train.py", line 273, in <module>
    run_train()
  File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/model_utils/moxing_adapter.py", line 105, in wrapped_func
    run_func(*args, **kwargs)
  File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/train.py", line 267, in run_train
    model.train(cfg.epoch_size, dataset, callbacks=cbs, dataset_sink_mode=True)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 1082, in train
    self._train(epoch,
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 636, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback,
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 715, in __call__
    raise err
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 711, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/dataset_helper.py", line 108, in construct
    outputs = self.get_next()
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 392, in __call__
    return _run_op(self, self.name, args)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 1010, in _run_op
    return _convert_stub(stub)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 202, in _convert_stub
    elements = stub.get_elements()
RuntimeError: Getnext gets peek data from data queue failed: 5

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:305 RetryPeakItemFromDataQueue

Special notes for this issue/备注 (Optional / 选填)

走给郭志健

评论 (12)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
sig/minddata
标签
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
device/ascend
标签
zhongjicheng 添加了
 
v2.3.0
标签
zhongjicheng 添加了
 
v2.3.0.rc2
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
fangwenyi 负责人mudongrui 修改为zangqx
fangwenyi 添加协作者mudongrui
fangwenyi 里程碑设置为B-SIG-Visualization

关闭profiler问题还存在,需要志建再看一下

zangqx 添加协作者zangqx
zangqx 负责人zangqx 修改为guozhijian
zangqx 里程碑B-SIG-Visualization 修改为B-SIG-Data

首要报错 dataset 向设备侧发送数据时 driver out of memory, 此时dataset模块会退出,不再发送数据,最后表现为get next 超时。

epoch: 1 step: 1, loss is 2.3452377319335938
epoch: 1 step: 2, loss is 2.4508824348449707
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.728.921 [npu_driver.cc:3582]1701935 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: device_id=0, qid=6, timeout=-1, drvRetCode=6.
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.016 [api_c.cc:4213]1701935 rtMemQueueEnQueueBuff:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.050 [error_message_manage.cc:53]1701935 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.067 [error_message_manage.cc:53]1701935 FuncErrorReason:rtMemQueueEnQueueBuff execute failed, reason=[driver error:out of memory]

按照 郭博 建议解决办法是 ms.set_context(max_device_memory="xxGB") 来减少网络侧的内存占用。让数据处理侧有足够的设备内存能正常给设备提供数据。

wenli 正在帮忙验证。

guozhijian 任务状态TODO 修改为VALIDATION

CCB结论(郭琦、郭志建、雷伟):alexnet、googlenet等都存在此问题。考虑如下方案:
1.测试用例通过设置ms.set_context(max_device_memory="xxGB") 规避
2. 2.3 630版本已规划需求:虚拟内存动态分配,解决该问题(找明奇要需求单号)
3. modelzoo脚本适配,在910B上适配该脚本 , 此问题单转给赵婷 处理后回归

leiwei2 任务状态VALIDATION 修改为WIP
leiwei2 负责人guozhijian 修改为zhaoting
leiwei2 添加协作者guozhijian
leiwei2 里程碑B-SIG-Data 修改为B-SIG-ModelZoo
fangwenyi 里程碑B-SIG-ModelZoo 修改为B-SIG-Kit

CCB结论(郭琦、郭志建、雷伟):alexnet、googlenet等都存在此问题。考虑如下方案:
1.测试用例通过设置ms.set_context(max_device_memory="xxGB") 规避
2. 2.3 630版本已规划需求:虚拟内存动态分配,解决该问题(找明奇要需求单号)
3. modelzoo脚本适配,在910B上适配该脚本 , 此问题单转给赵婷 处理后回归

@leiwei2 910B3 设置56G可跑通

遗留问题:考虑910A/910B场景下对用户更友好的默认配置。--郭琦/王禹程

问题跟因

910b上数据处理需要更多的显存占用

解决措施

已按ccb结论修改网络脚本,pr https://gitee.com/mindspore/models/pulls/5407

i-robot 添加了
 
gitee
标签
zhaoting 任务状态WIP 修改为VALIDATION
zhaoting 添加协作者zhaoting
zhaoting 负责人zhaoting 修改为zhongjicheng
zhaoting 里程碑B-SIG-Kit 修改为B-SolutionTest
zhongjicheng 负责人zhongjicheng 修改为wenli

回归版本:
commit_id = '[sha1]:39ac2284,[branch]:(HEAD,origin/master,origin/HEAD,master)'
runpkg_version:Milan_C17/20240414
回归步骤:参考issue复现步骤
基本功能:跑测正常
INFO 2024-05-05 14:38:49 - test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001 - base.py:teardown:140 - The base teardown is running

== 1 passed, 4 warnings in 182.38s (0:03:02) ==

测试结论:走单不规范,未打上rca/、rct/、ctl/标签,打回

wenli 任务状态VALIDATION 修改为TODO
wenli 里程碑B-SolutionTest 修改为B-SIG-Kit
wenli 添加协作者wenli
wenli 负责人wenli 修改为zhaoting
wenli 取消协作者zhaoting
zhaoting 添加了
 
rca/others
标签
zhaoting 添加了
 
rct/oldrelease
标签
zhaoting 添加了
 
ctl/solutiontest
标签
zhaoting 里程碑B-SIG-Kit 修改为B-SolutionTest
zhaoting 添加协作者zhaoting
zhaoting 负责人zhaoting 修改为zhongjicheng
zhaoting 取消协作者zhongjicheng
fangwenyi 任务状态TODO 修改为VALIDATION
zhongjicheng 负责人zhongjicheng 修改为未设置
zhongjicheng 负责人设置为wenli
zhongjicheng 取消协作者wenli
wenli 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(8)
6584633 zhao ting v 1585658628
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

53164aa7 5694891 3bd8fe86 5694891