问题跟因

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

googlenet网络，测试googlenet单卡网络在Ascend环境pynative模式下使用按条件开启方法开启profiler场景，网络训练报 RuntimeError: Getnext gets peek data from data queue failed: 5
网络路径：https://gitee.com/mindspore/models/tree/master/research/cv/googlenet

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend910B3

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
失败版本：r2.3.0.B210
run包：Milan_C17/20240414
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址：solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1、取mindspore model_zoo中网络googlenet,将训练脚本中的图模式参数设置为pynative,然后通过按条件开启方法开启profiler，执行网络训练

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络推理成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

epoch: 3 step: 15, loss is 2.325002908706665
epoch: 3 step: 16, loss is 2.076383113861084
epoch: 3 step: 17, loss is 2.0212104320526123
epoch: 3 step: 18, loss is 1.9688502550125122
epoch: 3 step: 19, loss is 1.938265323638916
epoch: 3 step: 20, loss is 2.209531545639038
Train epoch time: 2720.425 ms, per step time: 136.021 ms
[ERROR] DEVICE(3636114,fffd167cef20,python):2024-04-27-14:55:49.680.622 [mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:303] RetryPeakItemFromDataQueue] Getnext gets peek data time out, that most likely caused by data processing being too slow
[CRITICAL] DEVICE(3636114,fffd167cef20,python):2024-04-27-14:55:49.686.048 [mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:305] RetryPeakItemFromDataQueue] Getnext gets peek data from data queue failed: 5
Traceback (most recent call last):
  File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/train.py", line 273, in <module>
    run_train()
  File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/model_utils/moxing_adapter.py", line 105, in wrapped_func
    run_func(*args, **kwargs)
  File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/perf_tuning/profiler_pynative/test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001/train.py", line 267, in run_train
    model.train(cfg.epoch_size, dataset, callbacks=cbs, dataset_sink_mode=True)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 1082, in train
    self._train(epoch,
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 636, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback,
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 715, in __call__
    raise err
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 711, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/dataset_helper.py", line 108, in construct
    outputs = self.get_next()
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 392, in __call__
    return _run_op(self, self.name, args)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/ops/primitive.py", line 1010, in _run_op
    return _convert_stub(stub)
  File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/_stub_tensor.py", line 202, in _convert_stub
    elements = stub.get_elements()
RuntimeError: Getnext gets peek data from data queue failed: 5

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/data_queue/data_queue_mgr.cc:305 RetryPeakItemFromDataQueue

Special notes for this issue/备注 (Optional / 选填)

走给郭志健

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

感谢您的提问，您可以评论//mindspore-assistant更快获取帮助：

如果您刚刚接触MindSpore，或许您可以在教程找到答案
如果您是资深Pytorch用户，您或许需要：

如果您遇到动态图问题，可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
模型精度调优问题可参考官网调优指南
如果您反馈的是框架BUG，请确认您在ISSUE中提供了MindSpore版本、使用的后端类型（CPU、GPU、Ascend）、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
如果您已经定位出问题根因，欢迎提交PR参与MindSpore开源社区，我们会尽快review

关闭profiler问题还存在，需要志建再看一下

首要报错 dataset 向设备侧发送数据时 driver out of memory, 此时dataset模块会退出，不再发送数据，最后表现为get next 超时。

epoch: 1 step: 1, loss is 2.3452377319335938
epoch: 1 step: 2, loss is 2.4508824348449707
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.728.921 [npu_driver.cc:3582]1701935 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: device_id=0, qid=6, timeout=-1, drvRetCode=6.
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.016 [api_c.cc:4213]1701935 rtMemQueueEnQueueBuff:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.050 [error_message_manage.cc:53]1701935 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(1700982,python):2024-05-04-14:40:57.729.067 [error_message_manage.cc:53]1701935 FuncErrorReason:rtMemQueueEnQueueBuff execute failed, reason=[driver error:out of memory]

按照郭博建议解决办法是 ms.set_context(max_device_memory="xxGB") 来减少网络侧的内存占用。让数据处理侧有足够的设备内存能正常给设备提供数据。

wenli 正在帮忙验证。

CCB结论(郭琦、郭志建、雷伟)：alexnet、googlenet等都存在此问题。考虑如下方案：
1.测试用例通过设置ms.set_context(max_device_memory="xxGB") 规避
2. 2.3 630版本已规划需求：虚拟内存动态分配，解决该问题(找明奇要需求单号)
3. modelzoo脚本适配，在910B上适配该脚本，此问题单转给赵婷处理后回归

CCB结论(郭琦、郭志建、雷伟)：alexnet、googlenet等都存在此问题。考虑如下方案：
1.测试用例通过设置ms.set_context(max_device_memory="xxGB") 规避
2. 2.3 630版本已规划需求：虚拟内存动态分配，解决该问题(找明奇要需求单号)
3. modelzoo脚本适配，在910B上适配该脚本，此问题单转给赵婷处理后回归

@leiwei2 910B3 设置56G可跑通

遗留问题：考虑910A/910B场景下对用户更友好的默认配置。--郭琦/王禹程

问题跟因

910b上数据处理需要更多的显存占用

解决措施

已按ccb结论修改网络脚本，pr https://gitee.com/mindspore/models/pulls/5407

回归版本：
commit_id = '[sha1]:39ac2284,[branch]:(HEAD,origin/master,origin/HEAD,master)'
runpkg_version:Milan_C17/20240414
回归步骤：参考issue复现步骤
基本功能：跑测正常
INFO 2024-05-05 14:38:49 - test_ms_mi_profiler_pynative_googlenet_1p_on_condition_abnormal_0001 - base.py:teardown:140 - The base teardown is running

== 1 passed, 4 warnings in 182.38s (0:03:02) ==

测试结论：走单不规范，未打上rca/、rct/、ctl/标签，打回

已整改

GVP MindSpore / mindspore

内容风险标识

[ST][MS][开启profiler场景][910B3 1p]RuntimeError: Getnext gets peek data from data queue failed: 5

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (12)

问题跟因

解决措施

GVPMindSpore / mindspore

内容风险标识

[ST][MS][开启profiler场景][910B3 1p]RuntimeError: Getnext gets peek data from data queue failed: 5

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (12)

问题跟因

解决措施

搜索帮助

GVP MindSpore / mindspore