2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

在使用mindspore官网提供的nn算子时出现报错,请求帮忙分析

TODO
Bug-Report
创建于  
2024-05-20 15:20

Describe the current behavior / 问题描述 (Mandatory / 必填):

在使用nn.conv3D时会返回报错:
[ERROR] GE_ADPT(2061,ffff951bd020,python):2024-05-20-12:04:21.733.064 [mindspore/ccsrc/transform/graph_ir/graph_runner.cc:441] CompileGraph] Call GE CompileGraph Failed, ret is: 1343225857
Traceback (most recent call last):
File "/home/docker/code/SolarWind/train_gaussian.py", line 168, in
model.train(
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/train/model.py", line 1068, in train
self._train(epoch,
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/train/model.py", line 617, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/train/model.py", line 919, in _train_process
outputs = self._train_network(*next_element)
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/nn/cell.py", line 680, in call
out = self.compile_and_run(*args, **kwargs)
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1020, in compile_and_run
self.compile(*args, **kwargs)
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/nn/cell.py", line 997, in compile
_cell_graph_executor.compile(self, phase=self.phase,
File "/home/docker/miniconda3/envs/ms-2.2.11/lib/python3.9/site-packages/mindspore/common/api.py", line 1547, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Compile graph kernel_graph_4 failed.


  • Ascend Error Message:

E60108: In op[MatMulV2], [The supported format_out list is ['ND', 'NC1HWC0', 'FRACTAL_NZ'], while the current format_out is NCDHW.]
TraceBack (most recent call last):
Failed to compile Op [recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/MatMul-op256,[recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/Cast-op3319,recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/Cast-op3319,recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/BiasAdd-op259,recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/MatMul-op256]]. (oppath: [Compile /usr/local/Ascend/ascend-toolkit/7.0.1/opp/built-in/op_impl/ai_core/tbe/impl/mat_mul.py failed with errormsg/stack: File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/utils/errormgr/error_manager_util.py", line 69, in raise_runtime_error_cube
raise RuntimeError(args_dict, *msgs)
RuntimeError: ({'errCode': 'E60108', 'op_name': 'MatMulV2', 'reason': "The supported format_out list is ['ND', 'NC1HWC0', 'FRACTAL_NZ'], while the current format_out is NCDHW."}, "In op[MatMulV2], [The supported format_out list is ['ND', 'NC1HWC0', 'FRACTAL_NZ'], while the current format_out is NCDHW.]")
], optype: [MatMulV2])
[SubGraphOpt][Compile][ProcFailedCompTask] Thread[281442461184288] recompile single op[recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/MatMul-op256] failed[FUNC:ProcessAllFailedCompileTasks][FILE:tbe_op_store_adapter.cc][LINE:954]
[SubGraphOpt][Compile][ParalCompOp] Thread[281442461184288] process fail task failed[FUNC:ParallelCompileOp][FILE:tbe_op_store_adapter.cc][LINE:1001]
[SubGraphOpt][Compile][CompOpOnly] CompileOp failed.[FUNC:CompileOpOnly][FILE:op_compiler.cc][LINE:1127]
[GraphOpt][FusedGraph][RunCompile] Failed to compile graph with compiler Normal mode Op Compiler[FUNC:SubGraphCompile][FILE:fe_graph_optimizer.cc][LINE:1292]
Call OptimizeFusedGraph failed, ret:-1, engine_name:AIcoreEngine, graph_name:partition3_rank60_new_sub_graph521[FUNC:OptimizeSubGraph][FILE:graph_optimize.cc][LINE:131]
Failed to compile Op [recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/MatMul-op286,[recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/Cast-op3317,recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/Cast-op3317,recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/BiasAdd-op289,recompute_Default/network-WithLossCell/_backbone-SwinTransformer3D/enc_layers-CellList/0-EncoderStage/enc_blocks-CellList/0-Block/mlp-MLP/fc2-Dense/MatMul-op286]]. (oppath: [Compile /usr/local/Ascend/ascend-toolkit/7.0.1/opp/built-in/op_impl/ai_core/tbe/impl/mat_mul.py failed with errormsg/stack: File "/usr/local/Ascend/ascend-toolkit/latest/python/site-packages/tbe/common/utils/errormgr/error_manager_util.py", line 69, in raise_runtime_error_cube
raise RuntimeError(args_dict, *msgs)
根据上述信息,尝试把网络分层分块测试,最终定位到使用nn模块提供的conv3d会发生以上报错,烦请帮忙分析一下报错现象。

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Atlas800 9000T a2

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :2.2.11
    -- Python version (e.g., Python 3.7.5) :python 3.9.18
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):Ubuntu20.04LTS

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

GRAPH_MODE

Related testcase / 关联用例 (Mandatory / 必填)"

class PatchEmbed(nn.Cell):
  def __init__(
    self,
    in_channels: int,
    patch_size: tuple[int, int, int],
    patch_norm: bool,
    embed_dim: int,
    pre_patch_embed_paddings: tuple[int, int, int],
    patches_resolution: tuple[int, int, int],
  ) -> None:
    super(PatchEmbed, self).__init__()

    self.embed_dim = embed_dim
    self.num_patches = math.prod(patches_resolution)

    self.dhw_paddings: list[tuple[int, int]] = []
    for padding in pre_patch_embed_paddings:
      ahead = math.floor(padding / 2)
      behind = padding - ahead
      self.dhw_paddings.append((ahead, behind))
    self.pre_patch_embed_pad = P.Pad(
      paddings=(((0, 0), (0, 0)) + tuple(self.dhw_paddings))
    )
    self.proj = nn.Conv3d(
      in_channels=in_channels,
      out_channels=embed_dim,
      kernel_size=patch_size,
      stride=patch_size,  # type: ignore
      pad_mode="same",
    )

    self.norm = nn.LayerNorm((embed_dim,)) if patch_norm else nn.Identity()

    self.reshape = P.Reshape()
    self.transpose = P.Transpose()

  def construct(self, x):  # type: ignore
    """construct function.

    Args:
        x (Tensor): shape = (batch_size, in_channels, *input_size)

    Returns:
        Tensor: (batch_size, num_patches, embed_dim)
    """
    # last x shape = (batch_size, in_channels, *input_size)
    embed_dim = self.embed_dim
    num_patches = self.num_patches
    x = self.pre_patch_embed_pad(x)
    # last x shape = (batch_size, in_channels, *pre_patch_embed_resolution)
    x = self.proj(x)
    # last x shape = (batch_size, embed_dim, *patches_resolution)
    batch_size = x.shape[0]  # type: ignore
    x = self.reshape(x, (batch_size, embed_dim, num_patches))
    # last x shape = (batch_size, embed_dim, num_patches)
    x = self.transpose(x, (0, 2, 1))
    # last x shape = (batch_size, num_patches, embed_dim)
    x = self.norm(x)
    # last x shape = (batch_size, num_patches, embed_dim)
    return x

评论 (3)

baifeng Yao 创建了Bug-Report

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fangwenyi @chengxiaoli @Shawny

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
Shawny 负责人设置为hedongdong
Shawny 关联项目设置为MindSpore Issue Assistant
Shawny 计划开始日期设置为2024-05-27
Shawny 计划截止日期设置为2024-06-27
Shawny 添加了
 
mindspore-assistant
标签
Shawny 添加了
 
sig/ops
标签

您好,由于问题单没有回复,我们后续会关闭,如您仍有疑问,可以反馈下具体信息,并将ISSUE状态修改为WIP,我们这边会进一步跟踪,谢谢

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(2)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891