Can Tensor Cores Benefit Memory-Bound Kernels? (NO!)

赖, 海斌
Paper Reading
1天前
23热度
0评论

本文学习自 Can Tensor Cores Benefit Memory-Bound Kernels? (NO!) https://dl.acm.org/doi/pdf/10.1145/3725798.3725803

这篇文章提出了一个有点惊人的观点：Tensorcore在面对 memory bound 的kernel/算子时效果并不是很好！文章用优秀的理论公式分析+实验验证了这点。读懂这篇文章，就能懂Roofline model的大用处。

GPU Arch

Basic knowledge

Machine Balance
定义为硬件的计算性能P除以硬件的内存带宽：
\B = P / B
Roofline Model
operational intensity I 定义为程序的computational work W 除以程序的memory traffic Q
I = W / Q

the model calculates attainable performance P as:

P = min(P, BxI)
这个公式能让我们看到程序的性能瓶颈。While operational intensity (I) characterizes a kernel’s computational density, machine balance (B) represents the hardware’s ratio of computational capability to memory bandwidth.

A kernel is compute bound if I > B
A kernel is memory bound if I < B

下图的图2句描述了不同硬件的roofline情况，以及不同的workload比如spmv，gemv，stencil在上面的计算情况

Workload

Scale ，将矩阵/vector 乘以一个数

B = k A

每个元素需要一次load，一次store，一次乘法操作。
W_scale = 1, Q_scale = 2 x D (D 代表data size, 如果是double，D=8)
在double 情况下， I_scale = W/Q = 1/16, 因此我们经常用scale来测量memory bandwidth

SpMV
矩阵乘向量
y = Ax, x is a vector 1 x n, A is R m x n, y is a vector 1 x m

对于稠密矩阵乘法GEMV：
W(GEMV) = m x n x 2
Q(GEMV) = (m x n + m + n) x D
I(GEMV) = W / Q, 约等于 2 / D = 1/ 4

对于稀疏矩阵乘法SpMV，可以使用CSR方法将矩阵A进行压缩

可以看到，在CSR作用下，计算强度 I 降到了 2/(D+I) = 1/6

Stencil 网格计算
2D stencil
v(i, j) = sum w_p,q x u (i+p, j+q)

Q = 2 x D, W = 2 x S, I = S/D

For 2d5pt, S =5, I = 5/8 ( 8 byte load, 5 compute)

Theoretical Analysis

Time for compute:

T = W/P

time for memory:

T = Q/B

T_mem / T_cmp = (Q/B) / (W/P) = B/I

For memory bound:
T_mem > T_compute

So we can compare overlap VS un-overlap

Fully overlap

对于完全overlap的情况：

T = max (T_compute, T_mem, T_others)

since for memory bound kernel, T_mem > T_compute

So Time = max(T_memory, T_others)

it's no use accelerating T computation.

un-overlap

对于完全没有overlap情况，tensorcore加速也很小。

the calculate speed up

我个人感觉这里有点问题。你换了TensorCore后，你的machine Balance其实会发生改变的，此时你用了 T_mem / T_cmp = (Q/B) / (W/P) = B/I 这条公式来代换 T_mem，那这里的B（beta）应该是CUDA Core的Balance。

似乎不只是Tensor Core，其实所有的memory bound 的情况都能用这个公式？！

Very possible that speed up less than 2.

Our analysis covers the two extremes of memory-computation overlap. Real-world kernels typically exhibit partial overlap, resulting in speedups between 1× and 1.33× for double precision. Performance differences beyond that would require memory access optimizations, which, we argue, function equally when applied to tensor and CUDA cores since both access data through the register file (As Figure 1 shows).

Experiments

So is tensor core not very good in memory bound kernels?

Unfortunately, yes.

SCALE: Figure 6 reveals consistent, though modest, performance degradation when using tensor cores compared to CUDA cores. Given that the computational time difference is negligible, this performance gap likely arises from suboptimal memory access patterns associated with tensor core usage on current GPU architectures.

SpMV: Figure 7 demonstrates that for datasets exceeding the L2 cache size, cuSPARSE (CUDA core) outperforms DASP (tensor core) on average.

Stencils: Figure 8 shows that equivalently optimized tensor core implementations generally underperform their CUDA core counterparts.

Summary: While our goal was to verify the theoretical bounds, empirical evaluation reveals that tensor core implementations usually underperform their CUDA core counterparts.

Conclusion

Through systematic theoretical and empirical analysis, we demonstrate that leveraging the tensor cores for computation in memorybound kernels fails to deliver sound performance benefits. Our theoretical analysis establishes an upper bound of 1.33× speedup for double-precision memory-bound kernels, while empirical results across SCALE, SpMV, and stencil show that tensor core implementations usually underperform their CUDA core counterparts. While these findings may temper expectations for using tensor cores in memory-bound kernels, efforts leveraging tensor cores still provide valuable insights for the design and utilization of matrix processing units in broader contexts.