99 Star 796 Fork 1.4K

MindSpore / models

 / 详情

Swin-Transformer训练Imagenet1k数据集失败

REJECTED
Bug-Report
创建于  
2023-01-03 19:56
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

Swin-Transformer模型训练ImageNet1k数据集失败

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境: ascend 910

/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (1.7.0) :
    -- Python version (Python 3.7.5) :
    -- OS platform and distribution(ModelArts Notebook):
    --镜像 tensorflow1.15-mindspore1.7.0-cann5.1.0-euler2.8-aarch64
    --规格 8*Ascend 910 CPU192核 内存720GiB

Related testcase / 关联用例 (Mandatory / 必填)

https://www.hiascend.com/zh/software/modelzoo/models/detail/C/b7a78a7b9aa8956ae90c8700fa59fa0e/1

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.按此链接操作训练 https://www.hiascend.com/zh/software/modelzoo/models/detail/C/b7a78a7b9aa8956ae90c8700fa59fa0e/1
2. 训练一个5分类的小型数据集成功
3. 训练imageNet1K数据集失败

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] MD(129997,fffebbfff1e0,python):2023-01-01-23:06:33.491.920 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:217] InterruptMaster] Task is terminated with err msg(more detail in info level log):Unexpected error. Invalid file found: ../dataset/imagenet/train/n01440764/.ipynb_checkpoints, should be file, but got directory.
Line of code : 275
File         : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_CentOS/mindspore/mindspore/ccsrc/minddata/dataset/core/tensor.cc

[CRITICAL] GE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.323.724 [mindspore/ccsrc/plugin/device/ascend/hal/device/ge_runtime/runtime_model.cc:242] Run] Call rt api rtStreamSynchronize failed, ret: 507011
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.327.791 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:694] GetDumpPath] The environment variable 'MS_OM_PATH' is not set, the files of node dump will save to the process local path, as ./rank_id/node_dump/...
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.327.929 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:712] DumpTaskExceptionInfo] Task fail infos task_id: 2, stream_id: 30, tid: 130797, device_id: 0, retcode: 507011 ( model execute failed)
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.249 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:721] DumpTaskExceptionInfo] Dump node (Default/GetNext-op483) task error input/output data to: ./rank_0/node_dump
The function call stack:
In file /home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/dataset_helper.py(95)/        outputs = self.get_next()/

[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.300 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:728] DumpTaskExceptionInfo] GetNext error may be caused by slow data processing (bigger than 20s / batch) or transfer data to device error.
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.314 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:730] DumpTaskExceptionInfo] Suggestion: 
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.326 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:731] DumpTaskExceptionInfo]     1) Set the parameter dataset_sink_mode=False of model.train(...) or model.eval(...) and try again.
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.337 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:733] DumpTaskExceptionInfo]     2) Reduce the batch_size in data processing and try again.
[WARNING] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.346.350 [mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:734] DumpTaskExceptionInfo]     3) You can create iterator by interface create_dict_iterator() of dataset class to independently verify the performance of data processing without training. Refer to the link for data processing optimization suggestions: https://www.mindspore.cn/docs/programming_guide/zh-CN/r1.6/optimize_data_processing.html

[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.702.445 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:647] LaunchGraph] run task error!
[ERROR] DEVICE(129997,ffff477fe1e0,python):2023-01-01-23:07:36.702.583 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_device_context.cc:660] ReportErrorMessage] Ascend error occurred, error message:
E39999: Inner Error!
E39999  Aicpu kernel execute failed, device_id=0, stream_id=30, task_id=2.[FUNC:PrintAicpuErrorInfo][FILE:task.cc][LINE:733]
        Aicpu kernel execute failed, device_id=0, stream_id=30, task_id=2, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:741]
        rtStreamSynchronize execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:45]

[CRITICAL] RUNTIME_FRAMEWORK(129997,ffff9d88a780,python):2023-01-01-23:07:38.703.012 [mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:619] Run] Launch graph failed, graph id: 13
Traceback (most recent call last):
  File "../train.py", line 93, in <module>
    main()
  File "../train.py", line 84, in main
    dataset_sink_mode=True)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 906, in train
    sink_size=sink_size)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 87, in wrapper
    func(self, *args, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 548, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/train/model.py", line 628, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 586, in __call__
    out = self.compile_and_run(*args)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/nn/cell.py", line 989, in compile_and_run
    return _cell_graph_executor(self, *new_inputs, phase=self.phase)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1085, in __call__
    return self.run(obj, *args, phase=phase)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1110, in run
    return self._exec_pip(obj, *args, phase=phase_real)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 90, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/ma-user/anaconda3/envs/MindSpore/lib/python3.7/site-packages/mindspore/common/api.py", line 1092, in _exec_pip
    return self._graph_executor(args, phase)
RuntimeError: mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:619 Run] Launch graph failed, graph id: 13
[WARNING] MD(129997,ffff9d88a780,python):2023-01-01-23:07:38.718.099 [mindspore/ccsrc/minddata/dataset/engine/datasetops/device_queue_op.cc:73] ~DeviceQueueOp] preprocess_batch: 6097; batch_queue: 16, 16, 16, 16, 16, 16, 16, 16, 16, 16; push_start_time: 2023-01-01-23:06:30.500.976, 2023-01-01-23:06:30.850.740, 2023-01-01-23:06:31.198.416, 2023-01-01-23:06:31.550.675, 2023-01-01-23:06:31.897.591, 2023-01-01-23:06:32.250.443, 2023-01-01-23:06:32.601.576, 2023-01-01-23:06:32.952.194, 2023-01-01-23:06:33.300.537, 2023-01-01-23:06:33.651.053; push_end_time: 2023-01-01-23:06:30.841.669, 2023-01-01-23:06:31.191.669, 2023-01-01-23:06:31.541.634, 2023-01-01-23:06:31.891.079, 2023-01-01-23:06:32.241.735, 2023-01-01-23:06:32.592.227, 2023-01-01-23:06:32.942.152, 2023-01-01-23:06:33.291.636, 2023-01-01-23:06:33.642.452, 2023-01-01-23:06:33.991.755.

//mindspore-assistant

评论 (3)

crazy_apple 创建了Bug-Report

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

fangwenyi 添加了
 
mindspore-assistant
标签
fangwenyi 任务状态TODO 修改为ACCEPTED
fangwenyi 任务状态ACCEPTED 修改为REJECTED
fangwenyi 负责人设置为fangwenyi
fangwenyi 关联项目设置为MindSpore Issue Assistant

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
9269727 y jie creater 1641697690
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助