Paper Reading－Haibin's blog

ISCA25 Neoscope: How Resilient Is My SoC to Workload Churn?

未来的硬件怎么应对不断演变的软件？ https://dl.acm.org/doi/pdf/10.1145/3695053.3731014 这篇文章是 ISCA 2025 的论文《Neoscope: How Resilient Is My SoC to Workload Churn?》，核心在回答一个非常系统/架构导向的问题：当软件和工作负载不断演进（churn）时，一个 SoC 设计在整个生命

Paper Reading
Haibin
2026-02-01
94 Views
0 Comments

ATC25 Colocating ML Inference and Training with Fast GPU Memory Handover

今天yf来分享一篇来自IPADS的ATC25文章。 Colocating ML Inference and Training with Fast GPU Memory Handover 简短点评：依旧IPADS特有的大工程，TVM+vLLM+NCCL+Pytorch 开组会大家一起问了很多问题。 https://ipads.se.sjtu.edu.cn/_media/publications/si

Paper Reading
Haibin
2026-01-15
173 Views
0 Comments

STOC81 I/O Complexity: The Red-Blue Pebble Game

STOC81 I/O Complexity: The Red-Blue Pebble Game 这是一篇理论计算机科学文章，但是描述了一个非常有趣的问题：就像时间复杂度一样，我们能不能做一个I/O复杂度，衡量一个程序最少要进行多少次I/O? 文章链接： https://www.eecs.harvard.edu/~htk/publication/1981-stoc-hong-kung.pdf Com

Paper Reading
Haibin
2026-01-09
187 Views
0 Comments

In-depth analysis: RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

之前用LLM看文章，后来发现同样20分钟时间，学到的东西其实不如自己认真读读+关键问题请教。 KVCache可以用上 RAG 技术吗？这篇文章的idea是：能不能 "build KVCache as a Vector Storage System." 在长上下文情况中，KVCache经常超出显存，那么我们只能把多余的KVCache存进CPU内存里。而这样就很慢（CPU-GPU

Paper Reading
Haibin
2026-01-08
302 Views
0 Comments

You and your research | Richard W. Hamming

你和你的研究 https://gwern.net/doc/science/1986-hamming Great work is something else than mere brains. Brains are measured in various ways. In mathematics, theoretical physics, astrophysics, typically brain

Paper Reading
Haibin
2025-12-30
151 Views
0 Comments

我在CPU修PMU：Can We Trust Profiling Results?

Can We Trust Profiling Results? Understanding and Fixing the Inaccuracy in Modern Profilers https://par.nsf.gov/servlets/purl/10122098 在上次阅读完博客 # Where Do Interrupts Happen? 后（我的中文解析：https://www.haibi

Paper Reading
Haibin
2025-11-11
392 Views
0 Comments

AI Compiler Group Meeting

109 pages PPT，from TVM to Mirage. Introducing AI Compiler 101. Cost 90 minutes. PPT and videos： https://drive.google.com/drive/folders/1eKcHZKMpix31EcioiNCf16AzLIHkvGyy?usp=sharing

Paper Reading
Haibin
2025-11-11
301 Views
0 Comments

Can Tensor Cores Benefit Memory-Bound Kernels? (NO!)

本文学习自 Can Tensor Cores Benefit Memory-Bound Kernels? (NO!) https://dl.acm.org/doi/pdf/10.1145/3725798.3725803 这篇文章提出了一个有点惊人的观点：Tensorcore在面对 memory bound 的kernel/算子时效果并不是很好！文章用优秀的理论公式分析+实验验证了这点。读懂这篇文章

Paper Reading
Haibin
2025-11-02
302 Views
0 Comments

NSDI26: Can we use MLFQ in LLM Serving?

This paper is in arxiv for 2 years. Then it goes into NSDI26. Maybe we can see the difference between versions of 2023 and 2026. Paper link: https://arxiv.org/pdf/2305.05920 Main idea: Can we use MLFQ

Paper Reading
Haibin
2025-10-21
718 Views
0 Comments

GridFTP: SC25 Test of Time Award

How to move massive data from server to client? How to serve multiple users around the world to use the compute machine? This technology was not invented in cloud computing, but grid computing. And th

Paper Reading
Haibin
2025-10-10
371 Views
0 Comments

Eurosys24 Orion – GPU Kernel Scheduling for ML Inference

Paper Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications Github eth-easl/orion: An interference-aware scheduler for fine-grained GPU sharing Abstract GPUs are critical for maximiz

Paper Reading
Haibin
2025-10-10
747 Views
0 Comments

FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training

FALCON: Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that

Paper Reading
Haibin
2025-09-30
507 Views
0 Comments

ICML25 Rocket KV – KV Cache Compression

kaixin li github repo: NVlabs/RocketKV: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression To learn LLM KV Cache Compression October2001/Awesome-KV-Cache-

Paper Reading
Haibin
2025-09-16
1301 Views
0 Comments

ICPP24 Grace Hopper GPU中的系统内存管理

文章链接：Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper NVIDIA Grace Hopper 与 NVLink Fusion 架构对异构并行计算优化的影响 - William的文章 - 知乎 https://zhuanlan.zhihu.com/p/1911971133923

Paper Reading
Haibin
2025-08-30
822 Views
0 Comments

NSDI23 Transparent GPU Sharing in Container Clouds for Deep Learning Workloads

这篇文章介绍了一种名为 TGS (Transparent GPU Sharing) 的系统，旨在在容器云环境中在OS层为深度学习（DL）训练工作负载提供透明的GPU共享，以提高GPU利用率并减少作业完成时间。 links: https://www.usenix.org/conference/nsdi23/presentation/wu 1. 背景与动机容器云与DL训练：容器（如Docker）在数

Paper Reading
Haibin
2025-08-29
360 Views
0 Comments

ATC24 Power-aware Deep Learning Model Serving with u-Serve

Power-aware Deep Learning Model Serving with u-Serve 这篇文章是发表于2024年 USENIX ATC\'24 的论文，标题为《Power-aware Deep Learning Model Serving with μ-Serve》，作者来自伊利诺伊大学厄巴纳-香槟分校和IBM Research。论文聚焦于深度学习（DL）模型服务（即推理）中的功

Paper Reading
Haibin
2025-08-26
346 Views
0 Comments

OSDI25 PipeThreader

PipeThreader: Software-Defined Pipelining for Efficient DNN Execution AlpaServe 简单总结背景问题现在的深度学习模型越来越大，单块 GPU 内存不够用。多模型在线服务要保证低延迟、高吞吐量，但请求量有时会突然激增，传统方法效率低。核心想法模型并行：把一个模型拆成几部分放到多块 GPU 上。统计多路复用：当一个

Paper Reading
Haibin
2025-08-21
412 Views
0 Comments

OSDI25 XSched

scheduling for XPUs 在XPU上实现抢占式调度 CPU的抢占式调度抢占式调度（Preemptive Scheduling）是一种操作系统调度策略，核心思想是：当有更高优先级或更紧急的任务需要运行时，操作系统可以立即中断当前正在运行的任务，把 CPU 资源“抢”过来给那个更高优先级的任务。 XPU XPU: FPGA, NPU, GPU 目前在XPU上做了很多的任务但是好像没

Paper Reading
Haibin
2025-08-12
420 Views
0 Comments

SIGCOMM07 How to read a paper

How to read a paper | ACM SIGCOMM Computer Communication Review 没想到真有这种神奇文章，讲怎么读文章。很好，我就用你的方法来读你的文章。 S. Keshav教授写了这篇paper分享了自己多年来阅读论文的经验——即"three-pass"方法。它的关键思想是拿到一篇paper不要直接开始从头读到尾，而是分三遍去阅读

Paper Reading
Haibin
2025-08-06
535 Views
1 Comments

25年7月文章 Attention on Hardware

link: SystolicAttention: Fusing FlashAttention within a Single Systolic Array 这篇文章提出了一种针对 Transformer 模型中的 FlashAttention 加速的新架构 FSA（Full Systolic Attention），旨在解决现有基于 systolic array 的加速器在执行 FlashAtten

Paper Reading
Haibin
2025-08-06
429 Views
0 Comments