GEMM计算耗时/结果对其问题

一、问题现象（附报错日志上下文）：
测试aclnnGemm/aclnnMatmul计算不同shape矩阵乘法，发现部分shape耗时突涨；并且部分shape和torch计算存在一定diff
输入图片说明

二、软件版本:
-- CANN 版本 Ascend-cann-toolkit_8.0.RC1.alpha003
-- 卡：910B2
输入图片说明
三、测试步骤：
构造不同shape的input调用aclnn对应函数计算

        // ======== GEMM
        aclScalar* alpha = nullptr;
        float alphaValue = 1.0f;
        alpha = aclCreateScalar(&alphaValue, aclDataType::ACL_FLOAT);
        aclScalar* beta = nullptr;
        float betaValue = 0.0f;
        beta = aclCreateScalar(&betaValue, aclDataType::ACL_FLOAT);

        int64_t transA = 0;
        int64_t transB = 0;

        aclrtSynchronizeStream(stream);
        ret = aclnnGemmGetWorkspaceSize(input1_acl, input2_acl, output_acl, alphaValue, betaValue, transA, transB, output_acl, cubeMathType, &workspaceSize, &executor);
        void* workspaceAddr;
        if(workspaceSize>0) workspaceAddr     = ASCEND_MALLOC(workspaceSize);
        ret = aclnnGemm(workspaceAddr, workspaceSize, executor, stream);

        aclrtSynchronizeStream(stream);
        beg = now2ms();

        for(int i = 0; i < 100; i++){
            ret = aclnnGemmGetWorkspaceSize(input1_acl, input2_acl, output_acl, alphaValue, betaValue, transA, transB, output_acl, cubeMathType, &workspaceSize, &executor);
            void* workspaceAddr;
            if(workspaceSize>0) workspaceAddr     = ASCEND_MALLOC(workspaceSize);
            ret = aclnnGemm(workspaceAddr, workspaceSize, executor, stream);
        }


        aclrtSynchronizeStream(stream);
        printf("======= [%d %d %d] ========\n", m,n,k);
        printf("finish [%d %d %d], time: %d\n",m,n,k, now2ms()-beg);

你好，这边是处理sample仓问题的，能麻烦问下，您个跑的是哪个sample吗？

根据aclnn的README写的，是我根据官方文档aclnn的gemm写的，构造输入的代码是用torch的cpu生成的数据；然后读取输入调用aclnnGemmGetWorkspaceSize

根据这个文档https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/700alpha003/infacldevg/aclcppdevg/aclcppdevg_000063.html
对应shape可以看截图分别对应m,n,k

for（100）是运行100次统计计算耗时

输入是用torch.randn(shape, dtype=torch.float16)构造的，保存float16的格式
验证结果的output生成是先把input转成float32，计算torch cpu 结果，然后转换成float16保存

您好，帮您转了相关算子的责任人帮忙看。
想跟您要一下input以及output数据。

这是我构造input和output的代码你看看哈

import os, sys
import numpy as np
import torch

class MkDataHelper:
    def __init__(self, data_dir="/workspace"):
        self.data_dir = data_dir
    
    def is_cuda(self):
        return os.path.exists("/bin/nvidia-smi")

    def mkdata(self, subdir, name, shape, dtype=torch.float16):
        outdir = os.path.join(self.data_dir, subdir)
        os.makedirs(outdir, exist_ok=True)
        fpath = os.path.join(outdir, name+".npy")
        data = torch.randn(shape, dtype=torch.float16)
        np.save(fpath, data.to(dtype).numpy())

        if self.is_cuda():
            return data.to("cuda")
        else:
            data = data.to(torch.float32)
        return data

    def gen_out_with_func(self, func, inputs):
        output = func(inputs)
        return output

    def savedata(self, subdir, name, data):
        outdir = os.path.join(self.data_dir, subdir)
        os.makedirs(outdir, exist_ok=True)
        fpath = os.path.join(outdir, name+".npy")
        np.save(fpath, data.detach().cpu().to(torch.float16).numpy())
    
    def loaddata(self, subdir, name):
        outdir = os.path.join(self.data_dir, subdir)
        fpath = os.path.join(outdir, name+".npy")
        return np.load(fpath)

helper = MkDataHelper()
TNAME="gemm_test"
def make_data(shape):
    m,n,k = shape

    test_info = helper.mkdata(TNAME, "test_info", [1])
    helper.savedata(TNAME, "test_info", test_info)
    
    input1 = helper.mkdata(TNAME, "input1", [m,n])
    input2 = helper.mkdata(TNAME, "input2", [n,k])
    
    def func(inputs):
        output = torch.matmul(inputs[0], inputs[1])
        output = output.to(torch.float16)
        return output

    def funcadd(inputs):
        output = torch.add(inputs[0], inputs[1])
        output = output.to(torch.float16)
        return output
    
    output = helper.gen_out_with_func(func, [input1, input2])
    helper.savedata(TNAME, "output", output)
    # print("input shape", input1.shape, input2.shape)
    # print("output shape", output.shape)


if __name__ == "__main__":
    idx = 4
    if len(sys.argv) > 1:
        idx = int(sys.argv[1])
    
    def gen_gemm(idx):
        if type(idx) in (tuple, list):
            shape = idx
        else:
            shapes = [[1280, 2, 320], [1280, 2, 1280], [320, 2, 1280], [640, 2, 1280], [640, 4160, 640], [1920, 4160, 640], [1280, 154, 2048], [1280, 32, 2048], [5120, 4160, 640], [640, 4160, 2560], [1280, 1040, 1280], [3840, 1040, 1280], [2560, 154, 2048], [2560, 32, 2048], [10240, 1040, 1280], [1280, 1040, 5120], [1536, 8320, 512], [512, 8320, 512]]
            print("total shape cnt: ", len(shapes))
            shape = shapes[idx]
        make_data(shape)

    gen_gemm(idx)

你好，这边在环境上并没有复现这个问题。所以想跟你确认下如下信息：
1.这个地方改成1次，然后跑下plog和profiling，给下plog日志和profiling文件
输入图片说明

2.精度错误用例的，那个输入和输出保存的数据文件（可以是bin或者npy文件），能不能提供下。

说明下：
1.取plog日志的方法：
设置环境变量 export ASCEND_SLOG_PRINT_TO_STDOUT=1;export ASCEND_GLOBAL_LOG_LEVEL=0 （这个是debug级别，性能采集的时候，日志级别回成3（error级）），在执行命令最后加>>error.log将打屏的所有信息重定向到error.log中，上传error.log以供分析
2.profilng的获取方法参考 https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC2alpha001/devguide/moddevg/tfonlineinfer1/tf1onlineinfer_26_0008.html

请问需要怎么把数据提供给你这边呢？另外跑成功的代码可以share我一份吗，我在我这边验证下

我把数据放在 https://gitee.com/kiokana/debug-data gemm_test 目录下

profile也添加了，不过是使用acl.json做的采集；plog也添加好了，在gemm_test目录下
另外方便添加其他联系方式吗？这里看消息不太及时

{
    "profiler": {
        "switch": "on",
        "output": "output"
    }
}

另外,我这试了下MatMulInvocationNeo的例子修改mnk，结果看着也不正确, 还是希望麻烦提供下你这边跑成功的相关代码，我看看是不是我的环境有问题
M = 10240
N = 1040
K = 1280
输入图片说明

MatMulInvocationNeo 这个的问题，能另外新提一个issue单么，会找下相关算子的伙伴一起看。

Ascend / samples

内容风险标识

评论 (13)

Ascend / samples .gitee-modal { width: 500px !important; }

内容风险标识