| 备注 | 修改日期 | 修改人 |
| 实际交付版本 | 2025-06-05 16:09:38[当前版本] | 罗定丰 |
| 创建版本 | 2025-05-31 19:50:53 | 罗定丰 |
本流程主要用于在CentOS 7.9环境下对CPU和GPU的浮点运算能力进行测试。
请注意本文以 root 用户为例。
请在root用户目录作为整个流程的基准,即/root目录。
CentOS 7.9 官方已停止维护,需要更换为国内源,在本流程中我们以阿里云源为例:
wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo
安装dkms与内核头:
yum install epel
yum install dkms kernel-devel
在英伟达官方下载驱动并安装:
下载链接:https://us.download.nvidia.com/XFree86/Linux-x86_64/570.153.02/NVIDIA-Linux-x86_64-570.153.02.run
修改 nouveau 相关配置:
nano /lib/modprobe.d/dist-blacklist
找到nvidia行,加上#注释掉,并在底部加上以下两行:
blacklist nouveau
options nouveau modeset=0
修改完成后使用dracut -force命令生成新的内核并重启,
重启后使用 init 3 切换至文本模式继续后续操作。
添加运行权限:
chmod +x NVIDIA-Linux-x86_64-570.153.02.run
运行:
./NVIDIA-Linux-x86_64-570.153.02.run
全部按默认设置继续,安装完成后使用 init 5 切换回图形化模式,通常无需重启即可生效。
在英伟达官网下载CUDA安装包:
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-rhel7-12-4-local-12.4.0_550.54.14-1.x86_64.rpm
安装源:
sudo rpm -i cuda-repo-rhel7-12-4-local-12.4.0_550.54.14-1.x86_64.rpm
清理缓存:
sudo yum clean all
安装:
sudo yum -y install cuda-toolkit-12-4
sudo yum install -y gcc g++ make cmake
sudo yum groupinstall -y "Development Tools"
安装Intel MPI:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/dc93af13-2b3f-40c3-a41b-2bc05a707a80/intel-onemkl-2025.1.0.803_offline.sh
安装Intel MKL:
wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/6b6e395e-8f38-4da3-913d-90a2bcf41028/intel-mpi-2021.15.0.495_offline.sh
设置环境变量:
echo 'source /opt/intel/oneapi/setvars.sh' >> ~/.bashrc
source ~/.bashrc
ln -s /opt/intel/oneapi/compiler/latest/lib/libiomp5.so /lib/libiomp5.so
下载HPL源代码并解压:
wget http://www.netlib.org/benchmark/hpl/hpl-2.3.tar.gz
tar -xzf hpl-2.3.tar.gz
cd hpl-2.3
创建编辑 Make.test 文件:
nano ~/hpl-2.3/Make.test
#
# -- High Performance Computing Linpack Benchmark (HPL)
# HPL - 2.3 - December 2, 2018
# Antoine P. Petitet
# University of Tennessee, Knoxville
# Innovative Computing Laboratory
# (C) Copyright 2000-2008 All Rights Reserved
#
# -- Copyright notice and Licensing terms:
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions, and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#
# 3. All advertising materials mentioning features or use of this
# software must display the following acknowledgement:
# This product includes software developed at the University of
# Tennessee, Knoxville, Innovative Computing Laboratory.
#
# 4. The name of the University, the name of the Laboratory, or the
# names of its contributors may not be used to endorse or promote
# products derived from this software without specific written
# permission.
#
# -- Disclaimer:
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
# OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
# DATA OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# ######################################################################
#
# ----------------------------------------------------------------------
# - shell --------------------------------------------------------------
# ----------------------------------------------------------------------
#
SHELL = /bin/sh
#
CD = cd
CP = cp
LN_S = ln -s
MKDIR = mkdir
RM = /bin/rm -f
TOUCH = touch
#
# ----------------------------------------------------------------------
# - Platform identifier ------------------------------------------------
# ----------------------------------------------------------------------
#
ARCH = test
#
# ----------------------------------------------------------------------
# - HPL Directory Structure / HPL library ------------------------------
# ----------------------------------------------------------------------
#
TOPdir = /root/hpl-2.3
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
#
HPLlib = $(LIBdir)/libhpl.a
#
# ----------------------------------------------------------------------
# - Message Passing library (MPI) --------------------------------------
# ----------------------------------------------------------------------
# MPinc tells the C compiler where to find the Message Passing library
# header files, MPlib is defined to be the name of the library to be
# used. The variable MPdir is only used for defining MPinc and MPlib.
#
MPdir = /opt/intel/oneapi/mpi/latest
MPinc = -I$(MPdir)/include
MPlib = -lmpi
#
# ----------------------------------------------------------------------
# - Linear Algebra library (BLAS or VSIPL) -----------------------------
# ----------------------------------------------------------------------
# LAinc tells the C compiler where to find the Linear Algebra library
# header files, LAlib is defined to be the name of the library to be
# used. The variable LAdir is only used for defining LAinc and LAlib.
#
LAdir = /opt/intel/oneapi/mkl/latest/lib
LAinc = -I/opt/intel/oneapi/mkl/latest/include
LAlib = $(LAdir)/libmkl_intel_lp64.a $(LAdir)/libmkl_intel_thread.a $(LAdir)/libmkl_core.a -liomp5 -lpthread -lm
#
# ----------------------------------------------------------------------
# - F77 / C interface --------------------------------------------------
# ----------------------------------------------------------------------
# You can skip this section if and only if you are not planning to use
# a BLAS library featuring a Fortran 77 interface. Otherwise, it is
# necessary to fill out the F2CDEFS variable with the appropriate
# options. **One and only one** option should be chosen in **each** of
# the 3 following categories:
#
# 1) name space (How C calls a Fortran 77 routine)
#
# -DAdd_ : all lower case and a suffixed underscore (Suns,
# Intel, ...), [default]
# -DNoChange : all lower case (IBM RS6000),
# -DUpCase : all upper case (Cray),
# -DAdd__ : the FORTRAN compiler in use is f2c.
#
# 2) C and Fortran 77 integer mapping
#
# -DF77_INTEGER=int : Fortran 77 INTEGER is a C int, [default]
# -DF77_INTEGER=long : Fortran 77 INTEGER is a C long,
# -DF77_INTEGER=short : Fortran 77 INTEGER is a C short.
#
# 3) Fortran 77 string handling
#
# -DStringSunStyle : The string address is passed at the string loca-
# tion on the stack, and the string length is then
# passed as an F77_INTEGER after all explicit
# stack arguments, [default]
# -DStringStructPtr : The address of a structure is passed by a
# Fortran 77 string, and the structure is of the
# form: struct {char *cp; F77_INTEGER len;},
# -DStringStructVal : A structure is passed by value for each Fortran
# 77 string, and the structure is of the form:
# struct {char *cp; F77_INTEGER len;},
# -DStringCrayStyle : Special option for Cray machines, which uses
# Cray fcd (fortran character descriptor) for
# interoperation.
#
F2CDEFS =
#
# ----------------------------------------------------------------------
# - HPL includes / libraries / specifics -------------------------------
# ----------------------------------------------------------------------
#
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
#
# - Compile time options -----------------------------------------------
#
# -DHPL_COPY_L force the copy of the panel L before bcast;
# -DHPL_CALL_CBLAS call the cblas interface;
# -DHPL_CALL_VSIPL call the vsip library;
# -DHPL_DETAILED_TIMING enable detailed timers;
#
# By default HPL will:
# *) not copy L before broadcast,
# *) call the BLAS Fortran 77 interface,
# *) not display detailed timing information.
#
HPL_OPTS = -DHPL_CALL_CBLAS
#
# ----------------------------------------------------------------------
#
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
#
# ----------------------------------------------------------------------
# - Compilers / linkers - Optimization flags ---------------------------
# ----------------------------------------------------------------------
#
CC = /opt/intel/oneapi/mpi/latest/bin/mpicc -lpthread
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
#
# On some platforms, it is necessary to use the Fortran linker to find
# the Fortran internals used in the BLAS library.
#
LINKER = /opt/intel/oneapi/mpi/latest/bin/mpicc -lpthread
LINKFLAGS = $(CCFLAGS)
#
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
#
# ----------------------------------------------------------------------
编辑 hpl-2.3/bin/linux/HPL.dat 文件:
nano ~/hpl-2.3/bin/linux/HPL.dat
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name
1 number of problems sizes
50000 Ns (problem size)
1 number of NBs
192 NBs (block size)
1 number of process grids
8 Ps
8 Qs
16.0 threshold
1 number of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 number of recursive stopping criterium
4 NBMINs (>= 1)
1 number of panels in recursion
2 NDIVs
1 number of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 number of broadcast
0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 number of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
cd ~/hpl-2.3/bin/linux
mpirun -np 64 ./xhpl
如环境配置正确,应正常输出如下结果:
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR00L2L4 50000 192 8 8 123.45 123.456e+00
Gflops值(如123.456 GFLOPs)转换为FLOPs:123.456 × 10⁹ FLOPs。
对比理论值(2.15 TFLOPs = 2150 GFLOPs)
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
mkdir build
cmake ..
make
./deviceQuery
cd /usr/local/cuda/samples/0_Introduction/matrixMul
mkdir build
cmake ..
make
./matrixMul -wA=4096 -hA=4096 -wB=4096 -hB=4096
记录性能结果(GFLOPS),转换为FLOPs。
理论单个GPU浮点计算能力 = 6912 × 1.41 × 10⁹ × 2 ≈ 1.95 × 10¹³ FLOPs ≈ 19.5 TFLOPs
理论8块GPU总浮点计算能力 = 8 × 19.5 TFLOPs = 156 TFLOPs
通过上述流程,可以在CentOS 7.9 上准确测试机器的浮点性能,并验证是否符合理论计算值。