ICPP25 Conference story: Day 1
- 学术活动
- 6小时前
- 12热度
- 0评论
Conference 会议
Congrats to all accepted papers!
Welcome Ceremony
Jack Dongarra, "An Overview of High Performance Computing and Responsibly Reckless Algorithms"
I heard from Jack Dongarra's report for lots of time, but still it's happy to record the speech.
The speech is mainly about his jobs on BLAS in the past decades of years. And he also introduced some current projects on mixed precesion linear algebra computing works.
OK this is the story I haven't heard of ...... To be a high school teacher haha!
LINPACK. A very good library which I used to work at ASC student challenge.
Best Paper: Efficient Construction of Large Search Spaces for Auto-Tuning
Background
The hardware evloves 2-3years, while HPC software evolved 20 years
So we should develop from codes for specfic arch to more flexible tunable codes
Hypertuner for HPC software
search space, time
search config, like GPU x,y axis
SOTA & Problem
You can see the SOTA framework as following chart.
Most of them are brute force, or using ATF chain tree.
We mark time and effectiveness as our major goal for tuning tools' requirements.
But currently there exist 3 major problem formation for auto-tuning/search space searching:
- SAT: use boolean logic
- SMT: Satisfiability Modulo Theories
- CSP: Constraint Satisfaction Problems
So what we do in this paper:
We build a good CSP solver
We want to build a auto-tuning framework, which utilize CSP solver as its core, to optimize the problem with search space construction
We built a kernel tuner to do local search in order to reduce the search space.
Problem Formation
Constraint-based auto-tuning treats the search for optimal performance parameters (e.g., tile sizes, thread counts, memory layouts) as a Constraint Satisfaction Problem (CSP).
- Variables: tunable parameters (e.g., block size, unroll factor, thread binding).
- Domains: the allowed values for each variable.
- Constraints: rules that define legal or efficient configurations, such as:
Shared memory usage must not exceed hardware limits.
Thread counts ≤ number of available cores.
Tile sizes must divide matrix dimensions.
Constraint parsing
Design
Parser rewrites them using AST Abstract Synatx Trees
Optimizations
- Optimized Backtracking algorithm with improved variable ordering
- constraint -> [ parser - search space - optimized algorithm ]
- Search space construction: performance (real-world)
- Use C-Python
Experiment
search space
Time
Effectiveness
Conclusion
We gets orders of magnitude speedup
Kernel Tuner
also has a paper in eScience on Chicago
How to use:
pip install kernel_tuner
Paper Session 4: System Software
Macaw
TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks
Author: Ziji Shi
From Alibaba cloud
What we do: Use approx 30min to find the best strategy on LLM model tensor parallelism (use CPU to find result)
We present TAPAS, an automatic parallelism framework that efficiently discovers tensor parallel plans for large neural networks.
Problem
Auting for Tensor Parallel DNN
GPU Memory can't chase Tensor blooming in DNN
Parameter: 2941X
GPU mem: 5X
current solution:
Tensor Parallelism
-> Communication suffers from overhead evolving: Broadcast+AllGather
Related Works
How to spilt model: Megatron-LM
Good for transformer, bad for other
OSDI22 Alpa
Two-level opti can discover TP-PP nested strategy
Search time is large for big model
Challenge
-
Heterogenousr Model Arch: UNet, Ueven Arch
FC layer -
Search Space for TP is Exponential
the size of decision space is too large
complexity: $O(3^N)$
labor-intensive, stiff -
operation's spilt also depends on its neighbors
in DAG spilts, any change will affect its neighbor DAG
Main Design
We Design
Abstraction -> Graph pruning -> Spilt the tensor
Graph Pre-processor -> Strategy Explorer -> Graph Reconsturctor
Abstraction
network -> DAG
Tensor Parallelism = Graph Partiioning
-> Graph IR: Constrained Search Space
Spilt, Replicate, Communicate
So we can spilt ...
Graph-pruning
Key: Subgraph
观察Observation:
In many model, we get similar layers/structs
EX
- Linear Layer
- Self-attention
They have many pattern similar
Use Frequent Subgraph Search algorithm to find some chairman subgraph
找到频繁子图,进行挖掘,
tensor-parallel spilt/validator
make sure that it works and runs
validator makes sure that communication/copy is good
cost model Eval
11K+LOC on tensorflow
Paper Session 1: AI in Computing
Cockatoo
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks
Pisces: Towards Adaptive and Fair Congestion Control via Multi-Agent Meta-Reinforcement Learning
Congestion Control
Existing CCAs:
Rule-Based CCAs
- Simple
But - adaptablilty
- Manual tuning
- Imitation Learning
Learning-Based CCAs
Fairness not guaranteed
Hybird methods like Orca: may break cubic fairness
not know for workload, so may be not effective
motivation
limited generalization and adaptability
Design
A Learning based CCA for Federated Learning
Use POMDP (a kind of markov process)
Framework
Learner
2 level optimizer
inner loop: MAPPO
Outer loop: meta-leraning, reptile algorithm
P3P-Fed: Peer-to-Peer Personalized Federated Learning with DHT-based Local Clustering
Onsite Paper
Aheon Lim
Master in Korea Aerospace University
What is Personalized FL?
Federated Learning (FL) enables the training of a single global model in a distributed manner while preserving the privacy of local data. In real-world FL environments, however, characterized by high heterogeneity in client data distributions (non-IID), the global model suffers from performance degradation. To address this challenge, Personalized Federated Learning (PFL) has been proposed to provide personalized models either by clustering clients or by decoupling the model into shared and personalized components.
FL
Challenge
Design
- DHT
Distributed Hash Table
cat -> 5D
- P2P
Local clustering algorithm
Push - pull strategy
Experiment
use MPS
Onsite Paper
Paper Session 6: Performance
Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint
Large data and computing centers consume a significant share of the world’s energy consumption. A prominent subset of the workloads in such centers are workflows with interdependent tasks, usually represented as directed acyclic graphs (DAGs). **To reduce the carbon emissions resulting from executing such workflows in centers with a mixed (renewable and non-renewable) energy supply, it is advisable to move task executions to time intervals with sufficient green energy when possible.
To this end, we formalize the above problem as a scheduling problem with a given mapping and ordering of the tasks. We show that this problem can be solved in polynomial time in the uniprocessor case. For at least two processors, however, the problem becomes NP-hard. Hence, we propose a heuristic framework called CaWoSched that combines several greedy approaches with local search.** To assess the 16 heuristics resulting from different combinations, we also devise a simple baseline algorithm and an exact ILP-based solution. Our experimental results show that our heuristics provide significant savings in carbon emissions compared to the baseline.
SOTA
a paper from 2002
Intro
CaWoSched with efficient heuristics combining various greedy approaches with local search
Extensive simulations
DP
Model
The Computing hierarchy can be represented as DAG
We want to schedule this problem with this model on green scheduling.
At here we show the example of the DAG program. For example, we run Job1 on Machine1 (black one), then we can run the job on the left (red) on Machine3(red). When we try to move data from machine1->machine3, we get a communication (Diamond)
Objective: Find a schedule that minimizes carbon cost while meeting the deadline.
Problem1: 1 CPU, N ordered tasks
Polynomial Complexity
use DP
Problem2: N CPU, N ordered tasks
NP-Complete
Reduction from 3-partition
Algorithm fro multi processor case
- Assign scores on task: A. ddl; B. green power budget
- Greedily try to find optimal regarding carbon cost (if at some point the green power is low, then it will begin the task later)
**Score?
A. for ddl
Slack
Pressure
B. for CPU
CPU slides XXXX
Greedy
first put every jobs into the CPU workers
Local search
Exploit the remaining flexibility in the sechedule
Try to move some jobs into better power time
Setup & Exp
All in simulations. 1D 1.5D 2D 3D
including local search
cost ratio is best cost found divided by the algorithm
achieve 40% decrease in CO2 emission
HHOTuner: Efficient Performance Tuning with Harris Hawks Optimization
iowa state University
NSF
For large space optuning
Autotuning: Code+Arch+Tune = Best performance
They try to use an evolutionary algorithm call Harris Hawks
This is a very very cooooool figure
Harris Howk Algorithm idea
Main Ideas
- Exploration phase:
- Hawks search the environment (global exploration of the solution space).
- Two strategies: random perching or scanning for potential prey.
- Exploitation phase:
- Once prey is spotted, hawks begin encircling and besieging it (local exploitation).
- Several strategies are used: “soft siege,” “hard siege,” or random surprise pounces.
- Energy parameter (E):
- Represents the prey’s escaping energy, which decreases over time.
- When ∣E∣≥1|E| ≥ 1∣E∣≥1, exploration dominates.
- When ∣E∣<1|E| < 1∣E∣<1, exploitation dominates.
- Surprise pounce mechanism:
- Mimics the unpredictable behavior of prey escaping.
- Introduces randomness to help the algorithm escape local optima.
算法核心思想 Chinese Core idea
- 探索阶段(Exploration):鹰群在广阔区域内搜索猎物(解空间的全局搜索)。
- 鹰会“坐等”或“随机游走”,模拟寻找潜在目标。
- 开发阶段(Exploitation):一旦发现猎物(候选解),鹰群会采用不同战术 围捕猎物,逐步缩小包围圈。
- 对应优化算法中从全局搜索转向局部搜索。
- 突袭机制(Surprise Pounce / Dynamic Besiege Strategy):
- 模拟兔子试图逃跑,鹰群不断调整进攻方式(随机性 + 动态更新)。这增加了跳出局部最优的能力。
SOTA Benchmark
2021 Bliss: auto-tuning complex applications using a pool of diverse lightweight learning
models.
2014 OpenTuner From MIT CSAIL
One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource Sharing
ANL
TL;DR from Haibin: they use MIG/MPS/Time Sharing on GPU to accerlate a HPC workflow.
In-transit visualization has become essential in high-performance computing (HPC) to reduce I/O overheads and enable real-time data analysis. However, as simulations grow in scale and complexity, these visualization tasks increasingly demand substantial computational resources, exacerbating energy consumption and limiting system scalability. As we show in this paper, a key bottleneck is the conventional one-rank-per-GPU allocation model, which leads to irregular GPU utilization and waste of hardware resources.
To tackle this challenge, we propose using GPU-sharing strategies to improve energy efficiency in in-transit visualization without compromising performance. We evaluate six distinct configurations built upon three NVIDIA GPU-sharing mechanisms: the default CUDA model with context switching between processes, Multi-Process Service (MPS), which enables dynamic context sharing, and Multi-Instance GPU (MIG), which provides hardware-level partitioning. Using the WarpX simulation code and Ascent visualization framework, our experiments on the Polaris supercomputer span multiple rendering techniques, node counts, and data sizes.
Results show that GPU-sharing strategies can improve the trade-off between performance and energy, represented by the energy-delay product (EDP) metric, by up to 81.7%. We also show that workload-aware strategy selection is essential to improve performance-energy efficiency: MIG-based configurations are more effective for lightweight and regular workloads, offering up to 64.5% energy savings, while MPS better handles GPU-intensive workloads, achieving up to 71.1% EDP improvement. Finally, we demonstrate that optimized sharing strategies can reduce the required compute nodes by up to 75%, freeing system resources for concurrent workloads.
WarpX, HACC, Nyx
-> I/O bottleneck
how to minimize?
GPU is underutilized
-
GPU Sharing: Time-Slice, MPS, MIG
-
How many nodes do you really need?
-
what's the impact
Test bed
Methodology
just testing
scaled each config from 1 to 10 nodes
Schedule on GPU energy+performance
MPS can oversubscribe the GPU, so that maybe why the it performs well