name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
Swin-Transformer模型训练ImageNet1k数据集失败
Ascend
/GPU
/CPU
) / 硬件环境: ascend 910/device ascend
https://www.hiascend.com/zh/software/modelzoo/models/detail/C/b7a78a7b9aa8956ae90c8700fa59fa0e/1
1.按此链接操作训练 https://www.hiascend.com/zh/software/modelzoo/models/detail/C/b7a78a7b9aa8956ae90c8700fa59fa0e/1
2. 训练一个5分类的小型数据集成功
3. 训练imageNet1K数据集失败
[ERROR] MD(129997,fffebbfff1e0,python):2023-01-01-23:06:33.491.920 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid file found: ../dataset/imagenet/train/n01440764/.ipynb_checkpoints, should be file, but got directory.
Line of code : 275
File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS/mindspore/mindspore/ccsrc/minddata/dataset/core/tensor.cc
[CRITICAL] GE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.323.724 [mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/runtime_model.cc:242] Run] Call rt api rtStreamSynchronize failed, ret: 507011
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.327.791 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:694] GetDumpPath] The environment variable 'MS_OM_PATH' is not set, the files of node dump will save to the process local path, as ./rank_id/node_dump/...
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.327.929 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:712] DumpTaskExceptionInfo] Task fail infos task_id: 2, stream_id: 30, tid: 130797, device_id: 0, retcode: 507011 ( model execute failed)
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.249 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:721] DumpTaskExceptionInfo] Dump node (Default/GetNext-op483) task error input/output data to: ./rank_0/node_dump
The function call stack:
In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/dataset_helper.py(95)/ outputs = self.get_next()/
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.300 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:728] DumpTaskExceptionInfo] GetNext error may be caused by slow data processing (bigger than 20s / batch) or transfer data to device error.
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.314 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:730] DumpTaskExceptionInfo] Suggestion:
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.326 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:731] DumpTaskExceptionInfo] 1) Set the parameter dataset_sink_mode=False of model.train(...) or model.eval(...) and try again.
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.337 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:733] DumpTaskExceptionInfo] 2) Reduce the batch_size in data processing and try again.
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.350 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:734] DumpTaskExceptionInfo] 3) You can create iterator by interface create_dict_iterator() of dataset class to independently verify the performance of data processing without training. Refer to the link for data processing optimization suggestions: https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.6/optimize_data_processing.html
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.702.445 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:647] LaunchGraph] run task error!
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.702.583 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:660] ReportErrorMessage] Ascend error occurred, error message:
E39999: Inner Error!
E39999 Aicpu kernel execute failed, device_id=0, stream_id=30, task_id=2.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:733]
Aicpu kernel execute failed, device_id=0, stream_id=30, task_id=2, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:741]
rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]
[CRITICAL] RUNTIME_FRAMEWORK(129997,ffff9d88a780,python):2023-01-01-23:07:38.703.012 [mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:619] Run] Launch graph failed, graph id: 13
Traceback (most recent call last):
File "../train.py", line 93, in <module>
main()
File "../train.py", line 84, in main
dataset_sink_mode=True)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 906, in train
sink_size=sink_size)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 87, in wrapper
func(self, *args, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 548, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 628, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 586, in __call__
out = self.compile_and_run(*args)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 989, in compile_and_run
return _cell_graph_executor(self, *new_inputs, phase=self.phase)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1085, in __call__
return self.run(obj, *args, phase=phase)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1110, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 90, in wrapper
results = fn(*arg, **kwargs)
File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1092, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:619 Run] Launch graph failed, graph id: 13
[WARNING] MD(129997,ffff9d88a780,python):2023-01-01-23:07:38.718.099 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 6097; batch_queue: 16, 16, 16, 16, 16, 16, 16, 16, 16, 16; push_start_time: 2023-01-01-23:06:30.500.976, 2023-01-01-23:06:30.850.740, 2023-01-01-23:06:31.198.416, 2023-01-01-23:06:31.550.675, 2023-01-01-23:06:31.897.591, 2023-01-01-23:06:32.250.443, 2023-01-01-23:06:32.601.576, 2023-01-01-23:06:32.952.194, 2023-01-01-23:06:33.300.537, 2023-01-01-23:06:33.651.053; push_end_time: 2023-01-01-23:06:30.841.669, 2023-01-01-23:06:31.191.669, 2023-01-01-23:06:31.541.634, 2023-01-01-23:06:31.891.079, 2023-01-01-23:06:32.241.735, 2023-01-01-23:06:32.592.227, 2023-01-01-23:06:32.942.152, 2023-01-01-23:06:33.291.636, 2023-01-01-23:06:33.642.452, 2023-01-01-23:06:33.991.755.
//mindspore-assistant
Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!
登录 后才可以发表评论