我最好奇的是,这种extreme parallelism是怎么做的。 技术报告 *Serving Large Language Models on Huawei CloudMatrix384 用1机384节点来执行Deepseek R1 671B的推理,采用了3个优化 优化1 一个p2p的架构,将LLM推理拆解为prefill, decode, caching 优化2 large-scale ex
推理引擎会成为新时代的操作系统吗? RG-1210 PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU 2406.06282 【【RG 24 Fall】PowerInfer: Fast Large Language Model Serving with a Consumer-grad..】 https://
本keynote来自 Fail at Scale: Reliability in the face of rapid change Fail at Scale: Reliability in the face of rapid change: Queue: Vol 13, No 8 One of Facebook\'s cultural values is embracing failure. Th
Scalability! But at what COST 文章介绍 hotos15-paper-mcsherry.pdf 这篇文章讲了一个很重要的问题:在图计算这一领域中,我们要去思考,Scalable是否真的带来Effective? 即使算法的逻辑(如PageRank的迭代公式)看起来相同,分布式系统的实现方式(通信、同步、数据分区、语言开销)引入了大量额外工作,导致性能低于单线程。 多线程或
总链接: https://www.haibinlaiblog.top/index.php/sc-2024-passage/ Parallel Program Analysis and Code Optimization MCFuser: High-performance and Rapid-fusion of Memory-bound Compute-intensive Operators Aut
RisGraph: A Real-Time Streaming System for Evolving Graphs to Support Sub-millisecond Per-update Analysis at Millions Ops/s low latency and high though put Batch 能解决 high thoughput , 但是很多信息消失,同时实时性不够
总链接: https://www.haibinlaiblog.top/index.php/sc-2024-passage/ ChatBLAS: The First AI-Generated and Portable BLAS Library 用GPT写的BLAS库 ChatBLAS: The First AI-Generated and Portable BLAS Library We prese