99 Star 796 Fork 1.4K

MindSpore / models

 / 详情

[Question]: Mindspore2.0运行VGG16模型产生ProfilerFileNotFound异常

REJECTED
创建于  
2023-07-10 13:25

请描述您的问题? / Please describe your question

0 问题描述

Mindspore2.0.0在推理服务器800(9010) Ascend 910B上运行Vgg16网络时可以正常训练(单机/分布式均可),但是启动Profiler后,会产生ProfilerFileNotFound异常(找不到Framework文件):
Traceback (most recent call last):
File "train.py", line 252, in
profiler.analyse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 363, in analyse
self._ascend_analyse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 750, in _ascend_analyse
self._ascend_graph_analyse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 935, in _ascend_graph_analyse
source_path)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/profiling.py", line 863, in _ascend_graph_op_analyse
framework_parser.parse()
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/parser/framework_parser.py", line 279, in parse
framework_path_dict = self._search_file(self._profiling_path)
File "/usr/local/python3.7.5/lib/python3.7/site-packages/mindspore/profiler/parser/framework_parser.py", line 336, in _search_file
raise ProfilerFileNotFoundException('Framework')
mindspore.profiler.common.exceptions.exceptions.ProfilerFileNotFoundException: [ProfilerFileNotFoundException] code: 50546084, msg: The file not found.

1 问题背景

1.1 版本环境

Mindspore 2.0.0
CANN 6.3.RC1
固件(6.3.0.1.241)和驱动软件(23.0.RC1)符合官方教程要求
环境变量如下:
LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe/op_tiling:/usr/local/openmpi-4.0.3/lib:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/python3.7.5/lib:
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:
TOOLCHAIN_HOME=/usr/local/Ascend/ascend-toolkit/latest/toolkit
SSH_CONNECTION=10.108.233.159 61512 10.137.48.111 22
LESSCLOSE=/usr/bin/lesspipe %s %s
LANG=en_US.UTF-8
DISPLAY=localhost:10.0
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/..
XDG_SESSION_ID=1
TBE_IMPL_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe
USER=root
PWD=/root/fyz/demo/vgg16
HOME=/root
SSH_CLIENT=10.108.233.159 61512 22
https_proxy=http://10.137.48.111:3128
XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
http_proxy=http://10.137.48.111:3128
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ASCEND_PLUGIN_HOME=/usr/local/Ascend/tfplugin/latest
GLOG_v=2
SSH_TTY=/dev/pts/0
MAIL=/var/mail/root
TERM=xterm
SHELL=/bin/bash
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
SHLVL=1
LANGUAGE=en_US:en
PYTHONPATH=/usr/local/Ascend/ascend-toolkit/latest/opp/op_impl/built-in/ai_core/tbe:/usr/local/Ascend/tfplugin/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe:
OMPI_ALLOW_RUN_AS_ROOT=1
LOGNAME=root
XDG_RUNTIME_DIR=/run/user/0
PATH=/usr/local/Ascend/ascend-toolkit/latest/ccec_compiler/bin/:/usr/local/openmpi-4.0.3/bin:/usr/local/Ascend/ascend-toolkit/latest/bin:/usr/local/Ascend/ascend-toolkit/latest/compiler/ccec_compiler/bin:/usr/local/python3.7.5/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/sbin:/usr/local/bin
PS1=[\033[1;31m]\u[\033[1;35m]@[\033[1;34m]\h[\033[1;32m][\w][\033[1;33m]# [\033[0m]
ftp_proxy=http://10.137.48.111:3128
LESSOPEN=| /usr/bin/lesspipe %s
_=/usr/bin/env
OLDPWD=/root/fyz/demo

1.2 程序改动

使用的Vgg16程序从官方models拉取,地址是https://gitee.com/mindspore/models/tree/master/official/cv/VGG/vgg16#https://gitee.com/link?target=http%3A%2F%2Fwww.cs.toronto.edu%2F~kriz%2Fcifar.html
在train.py文件中增加了Profiler语句:
profiler = Profiler()
...
profiler.analyse()

1.3 程序运行

在单机单卡或单机多卡运行状态下,通过npu-smi info命令查看Ascend状态,发现HBM-Usage几乎占满,Power升高跳动,AICore %始终为0,且相应Ascend处理器可以收到python进程

运行期间,可以正常进行训练,且可以训练完毕。当训练完成后,错误产生。

1.4 相关尝试

1.4.1 执行set_env.sh

问题依然存在。

1.4.2 查看日志源文件

发现并未生成framework相关文件,仅生成两类文件,包括hwts和ts_track

应当如何解决?

评论 (3)

yzfang 创建了任务
yzfang 添加了
 
mindspore-assistant
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli @Shawny

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

您好,910B需求请走需求通道,models仓内默认支持的是910A机器

Shawny 任务类型任务 修改为Question
Shawny 任务状态TODO 修改为REJECTED

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
8108889 shawny233 1628167362
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助