99 Star 800 Fork 1.4K

MindSpore / models

 / 详情

[Bug]: 【910B】【MS】跑resnet50模型训练,profiler收集性能数据成功,分析失败

DONE
创建于  
2023-08-14 21:38

问题描述

【910B】【MS】跑resnet50模型训练,profiler收集性能数据成功,分析失败

环境信息

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:
    Ascend Training Solution 23.0.RC3.B012
    CANN 6.3.RC3.B030
    Ascend HDK 23.0.RC2.2.B030

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 2.0.0) :MindSpore 2.1.0.B130
    -- Python version (e.g., Python 3.7.5) :Python 3.7.5
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):openeuler22.03
    -- GCC/Compiler version (if compiled from source): 10.3.1

关联用例

Train_MS_Resnet50_Perf_010

重现步骤

1、修改train.py,加上profiler = ms.Profiler(output_path='./profiler_data')、profiler.analyse()
2、启动训练bash run_distribute_train.sh [RANK_TABLE_FILE] [DATASET_PATH] [CONFIG_PATH]

预期结果

profiler收集数据正常,分析正常

日志/截图

2023-08-14 19:34:56,479:INFO:epoch: [1/1] loss: 6.905463, epoch time: 246.382 s, per step time: 24638.249 ms
2023-08-14 19:34:57,568:INFO:If run eval and enable_cache Remember to shut down the cache server via "cache_admin --stop"
Mon 14 Aug 2023 19:37:28 [INFO] [MSVP] [51469] msprof_common.py: Start analyzing data in "/home/ywx1249490/profiler/profiler/PROF_000001_20230814193039475_FJFCBNFLBOOKKNRA/host" ...
Mon 14 Aug 2023 19:37:28 [INFO] [MSVP] [51469] msprof_common.py: It may take few minutes, please be patient ...
Mon 14 Aug 2023 19:37:34 [INFO] [MSVP] [51469] msprof_common.py: Analysis data in "/home/ywx1249490/profiler/profiler/PROF_000001_20230814193039475_FJFCBNFLBOOKKNRA/host" finished.
Mon 14 Aug 2023 19:37:34 [INFO] [MSVP] [51469] msprof_common.py: Start analyzing data in "/home/ywx1249490/profiler/profiler/PROF_000001_20230814193039475_FJFCBNFLBOOKKNRA/device_0" ...
Mon 14 Aug 2023 19:37:34 [INFO] [MSVP] [51469] msprof_common.py: It may take few minutes, please be patient ...
Mon 14 Aug 2023 19:37:35 [INFO] [MSVP] [51469] msprof_common.py: Analysis data in "/home/ywx1249490/profiler/profiler/PROF_000001_20230814193039475_FJFCBNFLBOOKKNRA/device_0" finished.
Traceback (most recent call last):
File "train.py", line 238, in
train_net()
File "/home/ywx1249490/models/official/cv/ResNet/scripts/train_parallel0/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
run_func(*args, **kwargs)
File "train.py", line 234, in train_net
profiler.analyse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 579, in analyse
self._ascend_analyse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 970, in _ascend_analyse
self._ascend_graph_analyse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 1194, in _ascend_graph_analyse
op_summary, op_statistic, steptrace = _ascend_graph_msprof_analyse(source_path)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 277, in _ascend_graph_msprof_analyse
df_op_summary, df_op_statistic, df_step_trace = msprof_analyser.parse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/parser/ascend_msprof_generator.py", line 101, in parse
self._read_steptrace()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/parser/ascend_msprof_generator.py", line 181, in _read_steptrace
self.steptrace = np.array(steptrace, dtype=steptrace_dt)
ValueError: could not assign tuple of length 15 to structure with 51 fields.

备注

定位人:籍家荣

评论 (4)

yangliu 创建了任务
yangliu 添加了
 
kind/bug
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli @Shawny

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review

问题根因

msprof上报的step_trace.csv格式复杂,mindspore需要适配

解决方案

该问题已经解决:最新的https://gitee.com/mindspore/mindspore/blob/master/mindspore/python/mindspore/profiler/parser/ascend_msprof_generator.py
可以支持解析,不会报错

收集、分析成功
输入图片说明
输入图片说明

yangliu 任务状态TODO 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
9522969 ywx1249490 1685753332
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助

344bd9b3 5694891 D2dac590 5694891