99 Star 786 Fork 1.4K

MindSpore / models

 / 详情

[MS][NET][pinns][pynative][gpu 1p]network train failed

DONE
Bug-Report
创建于  
2021-09-27 20:20
name about labels
Bug Report Use this template for reporting a bug kind/bug

Environment

  • Hardware Environment(Ascend/GPU/CPU):

Uncomment only one /device <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/device gpu

  • Software Environment:
    -- MindSpore version (source or binary):commit_id:58619b2bb
    -- Python version (e.g., Python 3.7.5):
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

Related testcase

test_ms_pinns_navier_stokes_pynative_train_check_fps.py

Steps to reproduce the issue

  1. get code from models
  2. python train.py

Describe the current behavior

网络使用pynative模式运行失败

Describe the expected behavior

网络训练成功

Related log / screenshot

Traceback (most recent call last):
  File "train.py", line 52, in <module>
    train_navier(**conf)
  File "/data/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/pinns/pinns_navier_stokes_pynative/src/NavierStokes/train_ns.py", line 108, in train_navier
    callbacks=[LossMonitor(loss_print_num), ckpoint, TimeMonitor(1), eval_cb], dataset_sink_mode=True)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 718, in train
    sink_size=sink_size)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 502, in _train
    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 564, in _train_dataset_sink_process
    outputs = self._train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 79, in construct
    return self.network(*outputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py", line 353, in construct
    loss = self.network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py", line 110, in construct
    out = self._backbone(data)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/data/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/pinns/pinns_navier_stokes_pynative/src/NavierStokes/net.py", line 224, in construct
    d_p = self.dp(x, y, t)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 433, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 430, in __call__
    output = self.run_construct(cast_inputs, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 352, in run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/data/zjc/workspace/solution_test/remaining/test_scripts/mindspore/net/pinns/pinns_navier_stokes_pynative/src/NavierStokes/net.py", line 138, in construct
    return self.grad(self.net)(x, y, t, (sens_1, sens_2))
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 77, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 376, in after_grad
    self._pynative_forward_run(grad_, args, kwargs, fn)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 355, in _pynative_forward_run
    fn(*args, **new_kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 440, in __call__
    _pynative_executor.end_graph(self, output, *inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 376, in end_graph
    self._executor.end_graph(obj, output, *args, *(kwargs.values()))
RuntimeError: _Map_base::at

Special notes for this issue

pinns网络使用pynative模式在gpu环境运行失败

评论 (3)

zhongjicheng 创建了Bug-Report
zhongjicheng 负责人设置为chujinjin
zhongjicheng 关联仓库设置为MindSpore/models
zhongjicheng 计划开始日期设置为2021-09-27
zhongjicheng 计划截止日期设置为2021-09-30
zhongjicheng 里程碑设置为B-SIG-ModelZoo
zhongjicheng 优先级设置为主要
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
sig/modelzoo
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加协作者anzhengqi
xiangjiawei007 修改了描述
chujinjin 添加协作者chujinjin
chujinjin 负责人chujinjin 修改为JoyLvliang
展开全部操作日志

二阶在缓存下逻辑还有问题,同一个网络先跑一阶,在跑二阶。前面一阶的缓存会影响二阶的逻辑。

chujinjin 添加了1.5.1(已删除)标签
JoyLvliang 任务状态TODO 修改为WIP
JoyLvliang 移除了v1.5.1(已删除)标签
chujinjin 添加了
 
v1.5.1
标签
JoyLvliang 添加协作者JoyLvliang
JoyLvliang 负责人JoyLvliang 修改为zjun

#Appearance & Root Cause
pynative三阶以上的缓存存在问题。第2个step时候,三阶取到了2阶的缓存,导致运行错乱。

#Fix Solution
对每一阶缓存添加特别标识,让其准确取值。

Relation PR
https://e.gitee.com/mind_spore/repos/mindspore/mindspore/pulls/26072

zjun 添加协作者zjun
zjun 负责人zjun 修改为zhongjicheng
zjun 任务状态WIP 修改为VALIDATION

回归版本:2021-11-8 每日构建版本
编译时间 2021-11-8
回归步骤:参考issue复现步骤
基本功能:问题已解决
输入图片说明
测试结论:回归通过
回归人员:zhongjicheng
回归时间:2021-11-11

zhongjicheng 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
6580807 zjun3021 1615805932 6575291 chujinjin 1605008803 6575381 anzhengqi 1585657544
1
https://gitee.com/mindspore/models.git
git@gitee.com:mindspore/models.git
mindspore
models
models

搜索帮助