| 备注 | 修改日期 | 修改人 |
| 实际交付版本 | 2025-06-05 16:09:38[当前版本] | 罗定丰 |
| 创建版本 | 2025-05-31 19:50:53 | 罗定丰 |
本流程主要用于在 CentOS 7.9 环境下对 CPU 和 GPU 的单精度浮点运算能力进行测试。
请注意本文以 root 用户为例。
请在root用户目录作为整个流程的基准,即/root目录。
CentOS 7.9 官方已停止维护,需要更换为国内源,在本流程中我们以阿里云源为例:
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
在英伟达官网下载CUDA安装包:
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-rhel7-12-4-local-12.4.0_550.54.14-1.x86_64.rpm
安装源:
rpm -i cuda-repo-rhel7-12-4-local-12.4.0_550.54.14-1.x86_64.rpm
清理缓存:
yum clean all
安装:
yum -y install cuda-toolkit-12-4
yum install -y gcc gcc-c++ make numactl
安装Intel MKL:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/dc93af13-2b3f-40c3-a41b-2bc05a707a80/intel-onemkl-2025.1.0.803_offline.shsh
sh intel-mpi-2021.15.0.495_offline.sh
安装Intel MPI:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/6b6e395e-8f38-4da3-913d-90a2bcf41028/intel-mpi-2021.15.0.495_offline.sh
sh intel-onemkl-2025.1.0.803_offline.sh
创建编辑 sgemm_test.c 源代码:
nano ~/sgemm_test.c
#include <stdio.h>
#include <stdlib.h>
#include <float.h>
#include "mkl.h"
int main() {
int N = 16384;
int runs = 60;
float *A = (float*)mkl_malloc(N*N*sizeof(float), 64);
float *B = (float*)mkl_malloc(N*N*sizeof(float), 64);
float *C = (float*)mkl_malloc(N*N*sizeof(float), 64);
for (int i = 0; i < N*N; i++) {
A[i] = 1.0f;
B[i] = 1.0f;
C[i] = 0.0f;
}
double max_gflops = 0.0;
double sum_gflops = 0.0;
for (int r = 0; r < runs; r++) {
double start = dsecnd();
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
N, N, N, 1.0f, A, N, B, N, 0.0f, C, N);
double end = dsecnd();
double time = end - start;
double gflops = 2.0 * N * N * N / (time * 1e9);
printf("Run %d: SGEMM Time: %.5f sec, Performance: %.2f GFLOPS\n", r+1, time, gflops);
if (gflops > max_gflops) max_gflops = gflops;
sum_gflops += gflops;
}
printf("Peak Performance: %.2f GFLOPS\n", max_gflops);
printf("Average Performance: %.2f GFLOPS\n", sum_gflops / runs);
mkl_free(A); mkl_free(B); mkl_free(C);
return 0;
}
编译源代码:
gcc -O3 -std=c99 -march=native -I${MKLROOT}/include sgemm_test.c -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_gnu_thread -fopenmp -lpthread -lm -ldl -o sgemm_test
git clone https://github.com/wilicc/gpu-burn.git
cd gpu-burn
make CUDAPATH=/usr/local/cuda-12.4
source /opt/intel/oneapi/setvars.sh
numactl --interleave=all ./sgemm_test
应正常输出如下结果:
Run 1: SGEMM Time: 1.31447 sec, Performance: 6691.76 GFLOPS
...
Run 60: SGEMM Time: 1.03889 sec, Performance: 8466.86 GFLOPS
Peak Performance: 8500.90 GFLOPS
Average Performance: 8429.91 GFLOPS
cd ~/gpu-burn
./gpu_burn -i 0 60s ##-i指定测试第0张卡60s
./gpu_burn 60s ##所有卡一起进行测试60s
./gpu_burn -i 0 60s 会持续实时显示第0张卡的单精度浮点性能;
./gpu_burn 60s 会持续实时显示所有卡的单精度浮点性能。
该测试工具输出的结果
通过上述流程,可以在CentOS 7 上准确测试机器的单精度浮点性能,并验证是否符合理论计算值。