2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][大集群专项]mixtral网络在910B上单卡模拟编译日志中有打印ERROR日志,不影响编译和训练

DONE
Bug-Report
创建于  
2024-05-17 11:29
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

mixtral网络在910B上单卡模拟编译日志中有打印ERROR日志,不影响编译和训练

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    master_20240510061514_c6a1400a90

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

mixtral_ascend910b_mixtral_8x7b_4096_64_False_8_000

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindformers
  2. 设置参数值:
    RANK_ID: 13
    RANK_SIZE: 64
    batch_size: 1
    compile_cache: true
    data_parallel: 2
    device_id: 2
    enable_parallel_optimizer: true
    expert_num: 8
    expert_parallel: 2
    fine_grain_interleave: false
    micro_batch_interleave_num: 1
    micro_batch_num: 184
    mode: 8x7b
    model_parallel: 8
    param_init_type: bfloat16
    pipeline_stage: 4
    seq_length: 4096
    use_seq_parallel: true
    3.设置ENABLE_CELL_REUSE=1
    4.开始网络训练
    5.验证网络是否能够编译成功,日志中有无error

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络模拟编译成功,日志中午ERROR日志

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.464.367 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147486722, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.508.810 [npu_driver.cc:1165]530932 DevMemAllocManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147355650, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.508.852 [logger.cc:575]530932 DevMalloc:Device malloc failed, size=209715200(Byte), type=16.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.508.894 [api_c.cc:1148]530932 rtMalloc:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.508.917 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.508.933 [error_message_manage.cc:53]530932 FuncErrorReason:rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.509.077 [adapter_rts.cc:525][530932][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[209715200].
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.509.092 [mem_device.cc:45][530932][DeviceMem][Alloc]rt_malloc error, ret[15], size[209715200]
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.509.104 [ccl_buffer_manager.cc:36][530932][CCLBufferManager][CreateCCLbuffer]Create ccl buffer size[209715200] fail,please check environmental variable HCCL_BUFFSIZE.
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.509.115 [ccl_buffer_manager.cc:73][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.509.128 [hccl_comm.cc:100][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.509.140 [op_base.cc:467][530932][Init][CommRootInfo]errNo[0x0000000005000002] hcclComm init error
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.516.745 [op_base.cc:481][530932][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000002], rankNum[1], rank[0], rootInfo identifier[10.50.160.227%enp189s0f0_60002_2_1715339716162725], server[10.50.160.227%enp189s0f0], logicDevId[2]
[ERROR] HCCL(530932,python):2024-05-10-19:15:16.516.956 [op_base.cc:1352][530932][HcclCommDestroy] comm is not exist, comm=0xaaab11c3afe0, group=10.50.160.227%enp189s0f0_60002_2_1715339716162725, deviceLogicId=2
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.590.024 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=2097152(Byte), type=2, moduleId=7, drvFlag=504403158265644034, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:16.594.793 [engine.cc:4203]530932 ProcLogicCqReport:Task run failed, device_id=2, stream_id=7, task_id=0, sqe_type=3(place holder), errType=0x20(sq sw status error), sqSwStatus=0x4
[ERROR] DRV(530932,python):2024-05-10-19:15:26.716.417 [ascend][curpid: 530932, 534538][drv][queuemng][QueueIoctl 170]Ioctl failed. (cmd=40685102; error=0; ret=16)
[ERROR] DRV(530932,python):2024-05-10-19:15:26.716.498 [ascend][curpid: 530932, 534538][drv][queuemng][QueueSubmitEventSync 905]Ioctl failed. (cmd=1080578306; ret=16; event_id=28; gid=20; tid=0; timeout=5000ms; subevent_id=73).
[ERROR] DRV(530932,python):2024-05-10-19:15:26.716.513 [ascend][curpid: 530932, 534538][drv][queuemng][QueueSendQueueEventSyncTimeout 1091]Submit event failed. (ret=16; devId=2; qid=6)
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:26.716.587 [npu_driver.cc:3582]534538 MemQueueEnQueueBuff:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:26.716.598 [npu_driver.cc:3582]534538 MemQueueEnQueueBuff:[drv api] halQueueEnQueueBuff failed: device_id=2, qid=6, timeout=-1, drvRetCode=16.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:26.716.713 [api_c.cc:4213]534538 rtMemQueueEnQueueBuff:ErrCode=507012, desc=[report timeout], InnerCode=0x711000c
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:26.716.725 [error_message_manage.cc:53]534538 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:26.716.738 [error_message_manage.cc:53]534538 FuncErrorReason:rtMemQueueEnQueueBuff execute failed, reason=[report timeout]
[ERROR] ASCENDCL(530932,python):2024-05-10-19:15:26.716.779 [tensor_data_transfer.cpp:534]534538 acltdtSendTensorV2: Fail to execute acltdtSendTensor, device is 2, name is 0608d89c-0ebe-11ef-b8af-b04fa64774d7
[ERROR] MD(530932,fffbe77eefa0,python):2024-05-10-19:15:26.717.730 [mindspore/ccsrc/minddata/dataset/util/task_manager.cc:227] InterruptMaster] MindSpore dataset is terminated with err msg: Exception thrown from dataset pipeline. Refer to 'Dataset Pipeline Error Message'. Tdt Send data failed. The details refer to 'Ascend Error Message'.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.440 [task_info.cc:324]530932 DoCompleteSuccess:device_id=2, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.512 [task_info.cc:297]530932 PrintErrorInfoCommon:Task execute failed, device_id=2, stream_id=7, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.613 [stream.cc:1509]530932 GetError:Stream Synchronize failed, stream_id=7, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.625 [stream.cc:1512]530932 GetError:report error module_type=7, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.632 [stream.cc:1512]530932 GetError:Task execute failed, device_id=2, stream_id=7, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.698 [api_impl.cc:4685]530932 SyncGetDevMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.590.708 [api_impl.cc:4685]530932 SyncGetDevMsg:Failed to synchronize stream, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.174 [api_impl.cc:4704]530932 GetDevErrMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.188 [api_impl.cc:4704]530932 GetDevErrMsg:Sync get device msg failed, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.224 [api_impl.cc:4748]530932 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.234 [logger.cc:1564]530932 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.261 [api_c.cc:4090]530932 rtGetDevMsg:ErrCode=507001, desc=[tsfw param illegal], InnerCode=0x7150004
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.270 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.591.281 [error_message_manage.cc:53]530932 FuncErrorReason:rtGetDevMsg execute failed, reason=[tsfw param illegal]
[ERROR] DEVICE(530932,ffffb7c41020,python):2024-05-10-19:15:27.591.470 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_communication_group.cc:63] Initialize] HcclCommInitRootInfo failed. #umsg#Ascend Error Message:#umsg#EL0004: 2024-05-10-19:15:16.464.157 Failed to allocate memory.
[ERROR] DISTRIBUTED(530932,ffffb7c41020,python):2024-05-10-19:15:27.591.556 [mindspore/ccsrc/distributed/collective/collective_manager.cc:279] CreateCommunicationGroup] Failed to create comm group on device side for 16-10571707653870470303
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.890.914 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147486722, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.928.079 [npu_driver.cc:1165]530932 DevMemAllocManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147355650, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.928.161 [logger.cc:575]530932 DevMalloc:Device malloc failed, size=209715200(Byte), type=16.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.928.203 [api_c.cc:1148]530932 rtMalloc:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.928.215 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:27.928.228 [error_message_manage.cc:53]530932 FuncErrorReason:rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.928.309 [adapter_rts.cc:525][530932][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[209715200].
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.928.321 [mem_device.cc:45][530932][DeviceMem][Alloc]rt_malloc error, ret[15], size[209715200]
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.928.331 [ccl_buffer_manager.cc:36][530932][CCLBufferManager][CreateCCLbuffer]Create ccl buffer size[209715200] fail,please check environmental variable HCCL_BUFFSIZE.
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.928.340 [ccl_buffer_manager.cc:73][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.928.349 [hccl_comm.cc:100][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.928.358 [op_base.cc:467][530932][Init][CommRootInfo]errNo[0x0000000005000002] hcclComm init error
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.937.628 [op_base.cc:481][530932][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000002], rankNum[1], rank[0], rootInfo identifier[10.50.160.227%enp189s0f0_60002_2_1715339727592653], server[10.50.160.227%enp189s0f0], logicDevId[2]
[ERROR] HCCL(530932,python):2024-05-10-19:15:27.937.685 [op_base.cc:1352][530932][HcclCommDestroy] comm is not exist, comm=0xaaab11c3afe0, group=10.50.160.227%enp189s0f0_60002_2_1715339727592653, deviceLogicId=2
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:28.010.005 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=2097152(Byte), type=2, moduleId=7, drvFlag=504403158265644034, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:28.010.887 [engine.cc:4203]530932 ProcLogicCqReport:Task run failed, device_id=2, stream_id=8, task_id=0, sqe_type=3(place holder), errType=0x20(sq sw status error), sqSwStatus=0x4
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.371.885 [task_info.cc:324]530932 DoCompleteSuccess:device_id=2, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.371.969 [task_info.cc:297]530932 PrintErrorInfoCommon:Task execute failed, device_id=2, stream_id=8, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.103 [stream.cc:1509]530932 GetError:Stream Synchronize failed, stream_id=8, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.115 [stream.cc:1512]530932 GetError:report error module_type=7, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.125 [stream.cc:1512]530932 GetError:Task execute failed, device_id=2, stream_id=8, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.259 [api_impl.cc:4685]530932 SyncGetDevMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.272 [api_impl.cc:4685]530932 SyncGetDevMsg:Failed to synchronize stream, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.737 [api_impl.cc:4704]530932 GetDevErrMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.751 [api_impl.cc:4704]530932 GetDevErrMsg:Sync get device msg failed, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.787 [api_impl.cc:4748]530932 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.798 [logger.cc:1564]530932 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.829 [api_c.cc:4090]530932 rtGetDevMsg:ErrCode=507001, desc=[tsfw param illegal], InnerCode=0x7150004
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.838 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.372.850 [error_message_manage.cc:53]530932 FuncErrorReason:rtGetDevMsg execute failed, reason=[tsfw param illegal]
[ERROR] DEVICE(530932,ffffb7c41020,python):2024-05-10-19:15:38.372.986 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_communication_group.cc:63] Initialize] HcclCommInitRootInfo failed. #umsg#Ascend Error Message:#umsg#EL0004: 2024-05-10-19:15:27.890.856 Failed to allocate memory.
[ERROR] DISTRIBUTED(530932,ffffb7c41020,python):2024-05-10-19:15:38.373.027 [mindspore/ccsrc/distributed/collective/collective_manager.cc:279] CreateCommunicationGroup] Failed to create comm group on device side for 8-16980809411888771835
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.670.872 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147486722, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.708.481 [npu_driver.cc:1165]530932 DevMemAllocManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147355650, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.708.507 [logger.cc:575]530932 DevMalloc:Device malloc failed, size=209715200(Byte), type=16.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.708.534 [api_c.cc:1148]530932 rtMalloc:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.708.545 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.708.558 [error_message_manage.cc:53]530932 FuncErrorReason:rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.708.637 [adapter_rts.cc:525][530932][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[209715200].
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.708.649 [mem_device.cc:45][530932][DeviceMem][Alloc]rt_malloc error, ret[15], size[209715200]
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.708.676 [ccl_buffer_manager.cc:36][530932][CCLBufferManager][CreateCCLbuffer]Create ccl buffer size[209715200] fail,please check environmental variable HCCL_BUFFSIZE.
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.708.683 [ccl_buffer_manager.cc:73][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.708.692 [hccl_comm.cc:100][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.708.701 [op_base.cc:467][530932][Init][CommRootInfo]errNo[0x0000000005000002] hcclComm init error
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.717.651 [op_base.cc:481][530932][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000002], rankNum[1], rank[0], rootInfo identifier[10.50.160.227%enp189s0f0_60002_2_1715339738374070], server[10.50.160.227%enp189s0f0], logicDevId[2]
[ERROR] HCCL(530932,python):2024-05-10-19:15:38.717.699 [op_base.cc:1352][530932][HcclCommDestroy] comm is not exist, comm=0xaaab11c3afe0, group=10.50.160.227%enp189s0f0_60002_2_1715339738374070, deviceLogicId=2
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.789.888 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=2097152(Byte), type=2, moduleId=7, drvFlag=504403158265644034, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:38.790.641 [engine.cc:4203]530932 ProcLogicCqReport:Task run failed, device_id=2, stream_id=9, task_id=0, sqe_type=3(place holder), errType=0x20(sq sw status error), sqSwStatus=0x4
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.405.870 [task_info.cc:324]530932 DoCompleteSuccess:device_id=2, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.405.953 [task_info.cc:297]530932 PrintErrorInfoCommon:Task execute failed, device_id=2, stream_id=9, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.059 [stream.cc:1509]530932 GetError:Stream Synchronize failed, stream_id=9, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.070 [stream.cc:1512]530932 GetError:report error module_type=7, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.079 [stream.cc:1512]530932 GetError:Task execute failed, device_id=2, stream_id=9, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.159 [api_impl.cc:4685]530932 SyncGetDevMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.170 [api_impl.cc:4685]530932 SyncGetDevMsg:Failed to synchronize stream, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.651 [api_impl.cc:4704]530932 GetDevErrMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.690 [api_impl.cc:4704]530932 GetDevErrMsg:Sync get device msg failed, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.728 [api_impl.cc:4748]530932 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.738 [logger.cc:1564]530932 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.763 [api_c.cc:4090]530932 rtGetDevMsg:ErrCode=507001, desc=[tsfw param illegal], InnerCode=0x7150004
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.773 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.406.784 [error_message_manage.cc:53]530932 FuncErrorReason:rtGetDevMsg execute failed, reason=[tsfw param illegal]
[ERROR] DEVICE(530932,ffffb7c41020,python):2024-05-10-19:15:49.406.923 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_communication_group.cc:63] Initialize] HcclCommInitRootInfo failed. #umsg#Ascend Error Message:#umsg#EL0004: 2024-05-10-19:15:38.670.823 Failed to allocate memory.
[ERROR] DISTRIBUTED(530932,ffffb7c41020,python):2024-05-10-19:15:49.406.974 [mindspore/ccsrc/distributed/collective/collective_manager.cc:279] CreateCommunicationGroup] Failed to create comm group on device side for 2-12028586519724327479
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.706.894 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147486722, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.744.256 [npu_driver.cc:1165]530932 DevMemAllocManaged:[drv api] halMemAlloc failed:size=209715200(Byte), type=16, moduleId=3, drvFlag=216172782147355650, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.744.279 [logger.cc:575]530932 DevMalloc:Device malloc failed, size=209715200(Byte), type=16.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.744.299 [api_c.cc:1148]530932 rtMalloc:ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.744.309 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.744.321 [error_message_manage.cc:53]530932 FuncErrorReason:rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.744.393 [adapter_rts.cc:525][530932][Malloc][Mem]errNo[0x000000000500000f] rtMalloc failed, return[207001], para: devPtrAddr[(nil)], size[209715200].
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.744.405 [mem_device.cc:45][530932][DeviceMem][Alloc]rt_malloc error, ret[15], size[209715200]
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.744.438 [ccl_buffer_manager.cc:36][530932][CCLBufferManager][CreateCCLbuffer]Create ccl buffer size[209715200] fail,please check environmental variable HCCL_BUFFSIZE.
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.744.446 [ccl_buffer_manager.cc:73][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.744.453 [hccl_comm.cc:100][530932]call trace: hcclRet -> 2
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.744.463 [op_base.cc:467][530932][Init][CommRootInfo]errNo[0x0000000005000002] hcclComm init error
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.753.594 [op_base.cc:481][530932][HCCL_TRACE]HcclCommInitRootInfo failed, return[0x0000000005000002], rankNum[1], rank[0], rootInfo identifier[10.50.160.227%enp189s0f0_60002_2_1715339749408092], server[10.50.160.227%enp189s0f0], logicDevId[2]
[ERROR] HCCL(530932,python):2024-05-10-19:15:49.753.644 [op_base.cc:1352][530932][HcclCommDestroy] comm is not exist, comm=0xaaab11c3afe0, group=10.50.160.227%enp189s0f0_60002_2_1715339749408092, deviceLogicId=2
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.825.903 [npu_driver.cc:1100]530932 DevMemAllocHugePageManaged:[drv api] halMemAlloc failed:size=2097152(Byte), type=2, moduleId=7, drvFlag=504403158265644034, drvRetCode=6, device_id=2!
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:49.826.720 [engine.cc:4203]530932 ProcLogicCqReport:Task run failed, device_id=2, stream_id=10, task_id=0, sqe_type=3(place holder), errType=0x20(sq sw status error), sqSwStatus=0x4
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.249 [task_info.cc:324]530932 DoCompleteSuccess:device_id=2, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.349 [task_info.cc:297]530932 PrintErrorInfoCommon:Task execute failed, device_id=2, stream_id=10, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.458 [stream.cc:1509]530932 GetError:Stream Synchronize failed, stream_id=10, retCode=0x4, [illegal param].
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.471 [stream.cc:1512]530932 GetError:report error module_type=7, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.478 [stream.cc:1512]530932 GetError:Task execute failed, device_id=2, stream_id=10, task_id=0, flip_num=0, task_type=87.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.554 [api_impl.cc:4685]530932 SyncGetDevMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.386.565 [api_impl.cc:4685]530932 SyncGetDevMsg:Failed to synchronize stream, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.027 [api_impl.cc:4704]530932 GetDevErrMsg:report error module_type=0, module_name=EE9999
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.065 [api_impl.cc:4704]530932 GetDevErrMsg:Sync get device msg failed, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.103 [api_impl.cc:4748]530932 GetDevMsg:Failed to GetDeviceErrMsg, retCode=0x7150004.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.114 [logger.cc:1564]530932 GetDevMsg:GetDeviceMsg failed, getMsgType=0.
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.138 [api_c.cc:4090]530932 rtGetDevMsg:ErrCode=507001, desc=[tsfw param illegal], InnerCode=0x7150004
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.147 [error_message_manage.cc:53]530932 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(530932,python):2024-05-10-19:15:59.387.159 [error_message_manage.cc:53]530932 FuncErrorReason:rtGetDevMsg execute failed, reason=[tsfw param illegal]
[ERROR] DEVICE(530932,ffffb7c41020,python):2024-05-10-19:15:59.387.300 [mindspore/ccsrc/plugin/device/ascend/hal/hardware/ascend_communication_group.cc:63] Initialize] HcclCommInitRootInfo failed. #umsg#Ascend Error Message:#umsg#EL0004: 2024-05-10-19:15:49.706.839 Failed to allocate memory.
[ERROR] DISTRIBUTED(530932,ffffb7c41020,python):2024-05-10-19:15:59.387.351 [mindspore/ccsrc/distributed/collective/collective_manager.cc:279] CreateCommunicationGroup] Failed to create comm group on device side for 4-9179217086650072765

Special notes for this issue/备注 (Optional / 选填)

走给刘崇鸣

评论 (4)

baimz 创建了Bug-Report
baimz 添加了
 
kind/bug
标签
baimz 添加了
 
attr/function
标签
baimz 添加了
 
stage/coding
标签
baimz 添加了
 
master
标签
baimz 添加了
 
sig/parallel
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@baimz

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为刘崇鸣
fangwenyi 添加了
 
device/ascend
标签
刘崇鸣 任务状态TODO 修改为WIP
刘崇鸣 任务状态WIP 修改为VALIDATION
刘崇鸣 添加协作者刘崇鸣
刘崇鸣 负责人刘崇鸣 修改为baimz
刘崇鸣 里程碑B-SIG-Parallel 修改为B-SolutionTest

Appearance & Root Cause
max_device_memory配置过大,导致剩余显存无法满足hccl使用。

Fix Solution
调小max_device_memory,建议至少预留4GB显存。

Relation PR:

Selftest Result:

(zyq) [root@localhost mixtral_ascend910b_mixtral_8x7b_4096_64_0013_000]# cat /mnt/disk1/zyq/mindspore/build/package/mindspore/.commit_id
__commit_id__ = ''[sha1]:d78e5044,[branch]:(HEAD->master_master,upstream/master)''
(zyq) [root@localhost mixtral_ascend910b_mixtral_8x7b_4096_64_0013_000]# cat run_sim.sh
export ENABLE_CELL_REUSE=1
export HCCL_CONNECT_TIMEOUT=7200
export HCCL_EXEC_TIMEOUT=5400
export MS_ENABLE_NUMA=1
export MS_ENABLE_REF_MODE=1
export MS_SIMULATION_LEVEL=1
export RANK_SIZE=64
export RANK_ID=13

python run_mindformer.py --config=./research/mixtral/finetune_mixtral-8x7b.yaml --use_parallel=True --run_mode="train" > sim_run.log

(zyq) [root@localhost mixtral_ascend910b_mixtral_8x7b_4096_64_0013_000]# cat research/mixtral/finetune_mixtral-8x7b.yaml | grep max_device_memory
  max_device_memory: 57GB

自验已通过,无报错

Self-test Report & DT Review
是否需要补充ST/UT:无

刘崇鸣 添加了
 
rca/others
标签
刘崇鸣 添加了
 
rct/cann
标签
刘崇鸣 添加了
 
ctl/solutiontest
标签

回归版本:
master_20240520061517_659b25360be
回归步骤:参考issue步骤
测试结论:将max_device_memory调到56GB日志中没有打印ERROR信息,回归通过

low cond: False, loss_scale: unavailable
2024-05-22 14:40:45,615 - mindformers[mindformers/core/callback/callback.py:319] - INFO - { Epoch:[  1/  1], step:[  576/  812], loss: 0.000, per_step_time: 313ms, lr: 0.0, overflow cond: False, loss_scale: unavailable
2024-05-22 14:40:46,244 - mindformers[mindformers/core/callback/callback.py:319] - INFO - { Epoch:[  1/  1], step:[  578/  812], loss: 0.000, per_step_time: 310ms, lr: 0.0, overflow cond: False, loss_scale: unavailable
$ grep -a 'ERROR' train.log


回归人员:白梦真
回归时间:2024.05.022

baimz 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
6579380 liuchongming74 1593503138
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助

344bd9b3 5694891 D2dac590 5694891