name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
baichua2_13b网络在910B1环境跑微调,网络报内存不足问题,B3环境是可以跑的
模型仓地址:https://gitee.com/mindspore/mindformers/blob/dev/research/baichuan2/finetune_baichuan2_13b.yaml
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/
【CANN版本】:Milan_C18/20240517
【MindSpore版本】:master_108dc9c05
【MindFormers版本】:dev_5b4a973b2425d
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
用例仓地址:
MindFormers_Test/cases/baichuan2/13b/train/
用例:
test_mf_baichuan2_13b_train_belle_8p_0001/
网络训练成功
Traceback (most recent call last):
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/tools/cloud_adapter/cloud_monitor.py", line 34, in wrapper
result = run_func(*args, **kwargs)
File "/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/baichuan2/13b/train/test_mf_baichuan2_13b_train_belle_16p_0001/research/./baichuan2/run_baichuan2.py", line 139, in main
build_context(config)
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 47, in build_context
local_rank, device_num = init_context(use_parallel=config.use_parallel,
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 122, in init_context
raise RuntimeError("Notice: if you are trying to run with a single device, please set "
RuntimeError: Notice: if you are trying to run with a single device, please set use_parallel=False. If not, please check the error message above.
Traceback (most recent call last):
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindformers/core/context/build_context.py", line 120, in init_context
init()
File "/home/miniconda3/envs/large_model_39/lib/python3.9/site-packages/mindspore/communication/management.py", line 188, in init
init_hccl()
RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'.
----------------------------------------------------
- Framework Error Message:
----------------------------------------------------
Malloc device memory failed, size[64424509440], ret[207001], Device 7 Available HBM size:65464696832 free size:65157582848 may be other processes occupying this card, check as: ps -ef|grep python
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EL0004: 2024-05-22-22:47:53.613.358 Failed to allocate memory.
Possible Cause: Available memory is insufficient.
Solution: Close applications not in use.
TraceBack (most recent call last):
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:357 Init
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_memory_adapter.cc:293 MallocFromRts
走给张森镇
Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhangjie18
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
根因:max_device_memory太大,导致hccl内存不足
解决:max_device_memory由60GB改为59GB
已修复,参考PR:https://gitee.com/mindspore/mindformers/pulls/3151
登录 后才可以发表评论