LLM－Haibin's blog

GQA、MHA、MQA、MLA

在苏剑林博客+油管上有更好的介绍。 GQA（Grouped Query Attention，组查询注意力）是注意力机制（Attention）的一种优化变体，主要用于提高大型语言模型（LLM）的计算效率和内存使用效率，同时尽量保持模型性能。它在注意力机制中起到优化多头注意力（Multi-Head Attention, MHA）的作用，特别是在 Transformer 模型中。以下我会用简单易懂的语言

机器学习
赖, 海斌
2025-08-30
375 热度
0评论

FAST25 Mooncake 组会

组会录播【组会FAST25-Mooncake讨论会】 https://www.bilibili.com/video/BV1ZkgUz5E5n/?share_source=copy_web&vd_source=72eac555730ba7e7a64f9fa1d7f2b2d4 学习笔记【【RG 25 Spring】 Mooncake】 https://www.bilibili.c

Paper Reading
赖, 海斌
2025-08-02
127 热度
0评论

SC25 gLLM

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling 几种并行方法尝试消除泡泡：目前在LLM推理里有两种不平衡： stage 间不平衡 inter-stage dependency, where a stage cannot begin comput

Paper Reading
赖, 海斌
2025-07-23
97 热度
0评论

MoE-Sys 文章记录

MoE Survey withinmiaov/A-Survey-on-Mixture-of-Experts-in-LLMs: The official GitHub page for the survey paper "A Survey on Mixture of Experts in Large Language Models". 一文弄懂Mixture of Experts

高性能计算
赖, 海斌
2025-05-20
642 热度
0评论

LLM Parameter Estimation

大模型参数量估计推导 1. 为什么需要估计参数量？大模型（如 BERT、GPT、LLaMA）参数量通常亿级甚至万亿级，估计参数量有助于：硬件需求评估：参数量影响内存和计算资源需求。模型规模比较：参数量反映复杂度和潜在能力。优化设计：在资源有限时，调整结构以平衡性能和效率。参数量由模型的各个组成部分（层、权重矩阵、偏置等）决定，以下以 Transformer 架构为例推导。 2. Tran

高性能计算
赖, 海斌
2025-05-12
214 热度
0评论

LLM on CPU 推理流程python源码解析

其他框架解析： vllm 框架解析：LLM 高速推理框架 vLLM 源代码分析 / vLLM Source Code Analysis - 知乎 vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention | vLLM Blog llama.cpp llama.cpp源码解读--推理流程总览 - 知乎纯新手教程：用llama.cpp本地

框架赏析
赖, 海斌
2025-04-18
819 热度
0评论