<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"><channel><title>RaspCat Home</title><link>https://b1ankcat.github.io</link><description>live for life</description><copyright>RaspCat Home</copyright><docs>http://www.rssboard.org/rss-specification</docs><generator>python-feedgen</generator><image><url>https://github.githubassets.com/favicons/favicon.svg</url><title>avatar</title><link>https://b1ankcat.github.io</link></image><lastBuildDate>Sun, 10 May 2026 16:41:01 +0000</lastBuildDate><managingEditor>RaspCat Home</managingEditor><ttl>60</ttl><webMaster>RaspCat Home</webMaster><item><title>CUDA 从零实现 Flash Attention v1</title><link>https://b1ankcat.github.io/post/CUDA%20-cong-ling-shi-xian-%20Flash%20Attention%20v1.html</link><description>TODO。</description><guid isPermaLink="true">https://b1ankcat.github.io/post/CUDA%20-cong-ling-shi-xian-%20Flash%20Attention%20v1.html</guid><pubDate>Sun, 10 May 2026 10:50:30 +0000</pubDate></item><item><title>GPU架构下的矩阵乘法深入优化</title><link>https://b1ankcat.github.io/post/GPU-jia-gou-xia-de-ju-zhen-cheng-fa-shen-ru-you-hua.html</link><description>### 矩阵setup

1. 使用NVIDIA V100S-PCIE-32GB测试
2. 行优先存储
3. shape为4096 x 4096
4. 大小为 128 MB，远大于shared memory
5. A矩阵为m行k列，B矩阵为k行n列，矩阵乘的结果C矩阵为m行n列
6. 峰值算力公式为 FP32-15.7 TFLOPS，FP16-125 TFLOPS

---
### 0. 朴素矩阵乘kernel实现

```cu
constexpr int BLOCK_SIZE = 32;

__global__ void NaiveGemmKernel(const float* A, const float* B, float* C, int M, int N, int K) {
    uint32_t row = blockIdx.x * blockDim.x + threadIdx.x;
    uint32_t col = blockIdx.y * blockDim.y + threadIdx.y;

    if (row &lt; M &amp;&amp; col &lt; N) {
        float tmp = 0.0f;
        for (uint32_t i = 0; i &lt; K; i++) {
            tmp += A[row * K + i] * B[i * N + col];
        }

        C[row * N + col] = tmp;
    }
}

void LaunchNaiveGemm(const float* A, const float* B, float* C, int M, int N, int K, cudaStream_t stream) {
  const dim3 block(BLOCK_SIZE, BLOCK_SIZE);
  const dim3 grid((M + BLOCK_SIZE - 1) / BLOCK_SIZE, (N + BLOCK_SIZE - 1) / BLOCK_SIZE);
  NaiveGemmKernel&lt;&lt;&lt;grid, block, 0, stream&gt;&gt;&gt;(A, B, C, M, N, K);
  CUDA_CHECK(cudaPeekAtLastError());
}

```

&amp;emsp;&amp;emsp;注意，在这里使用了列n为BlockX，行m为BlockY，因此是对m和n进行了并行，对每个thread只需要考虑计算k即可，并且BlockX是threadIdx中优先变化的，当x变化到blockDim-1时，y才会加1，也就是对二维并行来说变化效果是(0,0),(1,0),...,(m,0),(0,1),(1,1),...,(m,1),...。</description><guid isPermaLink="true">https://b1ankcat.github.io/post/GPU-jia-gou-xia-de-ju-zhen-cheng-fa-shen-ru-you-hua.html</guid><pubDate>Thu, 16 Apr 2026 11:55:45 +0000</pubDate></item><item><title>CPU架构下的矩阵乘法深入优化</title><link>https://b1ankcat.github.io/post/CPU-jia-gou-xia-de-ju-zhen-cheng-fa-shen-ru-you-hua.html</link><description>### **本文是CppCon演讲《Matrix Multiplication Deep Dive》的文字整理和一些自己的理解**

---
### 矩阵setup

1. std::vector&lt;double&gt;并且对齐64 bytes的cpu cache
2. 行优先存储
3. shape为2880 x 2880
4. 大小为63 MB，大于L3 cache size (6 MB)
5. 峰值算力公式如下:

$$
\text{FLOPS} = \text{cores} \times \frac{\text{cycles}}{\text{seconds}} \times \frac{\text{FLOPs}}{\text{cycles}}
$$

---
### 0. 朴素c++实现

```c++
void matmul_naive(const Matrix&lt;double&gt;&amp; A, const Matrix&lt;double&gt;&amp; B, const Matrix&lt;double&gt;&amp; C)
{
    auto M = A.row();
    auto K = A.col();
    auto N = B.col();
    for (int i = 0; i &lt; M; ++i){
        for (int j = 0; j &lt; N; ++j){
            for (int k = 0; k &lt; K; ++k){
                C(i, j) += A(i, k) * B(k, j);  // 最内层按一行A乘以一列B得到一个C的顺序计算
            }
        }
    }
}
```

&lt;img width='897' height='236' alt='Image' src='https://github.com/user-attachments/assets/9040bb6f-5244-4755-aa70-d9b6be699953' /&gt;

---
### 1. 改变循环顺序

```c++
void matmul_change_order(const Matrix&lt;double&gt;&amp; A, const Matrix&lt;double&gt;&amp; B, const Matrix&lt;double&gt;&amp; C)
{
    auto M = A.row();
    auto K = A.col();
    auto N = B.col();
    for (int i = 0; i &lt; M; ++i){
        for (int k = 0; k &lt; K; ++k){
            for (int j = 0; j &lt; N; ++j){
                C(i, j) += A(i, k) * B(k, j);  // 最内层按一个A乘以一行B得到一行C的顺序计算
            }
        }
    }
}
```

在 **i-j-k** 顺序下，在最内层的循环中，可以看到对 **A(i, k)** 的访问是连续的，但是 **B(k, j)** 的访问会有一个 **j** 大小的Strided。</description><guid isPermaLink="true">https://b1ankcat.github.io/post/CPU-jia-gou-xia-de-ju-zhen-cheng-fa-shen-ru-you-hua.html</guid><pubDate>Sun, 29 Mar 2026 02:12:49 +0000</pubDate></item></channel></rss>