ICPP25 Conference story: Day 1

赖, 海斌
学术活动
2025-09-10
262热度
0评论

ICPP Conference 会议

Congrats to all accepted papers!

Welcome Ceremony

185 Attendee come to the conference!

We have 292 submission and 78 of them are accepted!

The Chairman's Welcoming!

Jack Dongarra, "An Overview of High Performance Computing and Responsibly Reckless Algorithms"

I heard from Jack Dongarra's report for lots of time, but still it's happy to record the speech.

The speech is mainly about his jobs on BLAS in the past decades of years. And he also introduced some current projects on mixed precesion linear algebra computing works.

OK this is the story I haven't heard of ...... To be a high school teacher haha!

LINPACK. A very good library which I used to work at ASC student challenge.

Nowadays most investment for High performance computing ecosystem are from: Google, Microsoft, Apple and NVIDIA.

Now the newest HPC platform: EI Capitan (What? not Frontier!?)

And they are using new floating point precision!!

And we find that mixed precision on HPC jobs, can also benefits. Sometimes the GPU are better at low precision!!!

So, some methods, they are trying to find low precision with O(n3) (like LU), then refine the solution with O(n2) to boost the problem finding.

So this is HPL-MxP, do LU factorization with low+high precision!

Top10 HPC Clusters! And we can see that new machines are better at these Clusters!

Nowadays there are many new methods!

Now there is a new method called Ozaki's method, and we are implmented it!

That's all!

Best Paper: Efficient Construction of Large Search Spaces for Auto-Tuning

Background
The hardware evloves 2-3years, while HPC software evolved 20 years

So we should develop from codes for specfic arch to more flexible tunable codes

So, if you go to ICPP and win the best paper, you will dance in front of us!

Hypertuner for HPC software
search space, time
search config, like GPU x,y axis

SOTA & Problem
You can see the SOTA framework as following chart.

Most of them are brute force, or using ATF chain tree.

We mark time and effectiveness as our major goal for tuning tools' requirements.

But currently there exist 3 major problem formation for auto-tuning/search space searching:

SAT: use boolean logic
SMT: Satisfiability Modulo Theories
CSP: Constraint Satisfaction Problems

So what we do in this paper:

We build a good CSP solver

We want to build a auto-tuning framework, which utilize CSP solver as its core, to optimize the problem with search space construction

We built a kernel tuner to do local search in order to reduce the search space.

Problem Formation

Constraint-based auto-tuning treats the search for optimal performance parameters (e.g., tile sizes, thread counts, memory layouts) as a Constraint Satisfaction Problem (CSP).

Variables: tunable parameters (e.g., block size, unroll factor, thread binding).
Domains: the allowed values for each variable.
Constraints: rules that define legal or efficient configurations, such as:

Shared memory usage must not exceed hardware limits.
Thread counts ≤ number of available cores.
Tile sizes must divide matrix dimensions.
Constraint parsing

Design

Parser rewrites them using AST Abstract Synatx Trees

Optimizations

Optimized Backtracking algorithm with improved variable ordering
constraint -> [ parser - search space - optimized algorithm ]
Search space construction: performance (real-world)
Use C-Python

Experiment

search space

Time
Effectiveness

Conclusion
We gets orders of magnitude speedup
Kernel Tuner

also has a paper in eScience on Chicago

How to use:

pip install kernel_tuner

Paper Session 4: System Software

Macaw

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Author: Ziji Shi

Ziji is a NUS phd student who did a great job!

This job is from his intern in Alibaba cloud

What we do: Use approx 30min to find the best strategy on LLM model tensor parallelism (use CPU to find result)

We present TAPAS, an automatic parallelism framework that efficiently discovers tensor parallel plans for large neural networks.

Problem
We want to do Auto tuning for Tensor Parallel DNN

Noadays GPU Memory can't chase Tensor blooming in DNN
Parameter: 2941X
GPU mem: 5X

current solution:
Tensor Parallelism

-> Communication suffers from overhead evolving: Broadcast+AllGather

Related Works

How to spilt model: Megatron-LM
Good for transformer, bad for other

OSDI22 Alpa
Two-level opti can discover TP-PP nested strategy
Search time is large for big model

Challenge

Heterogenousr Model Arch: UNet, Ueven Arch
FC layer
Search Space for TP is Exponential
the size of decision space is too large
complexity: $O(3^N)$
labor-intensive, stiff
operation's spilt also depends on its neighbors
in DAG spilts, any change will affect its neighbor DAG

Main Design

We Design

Abstraction -> Graph pruning -> Spilt the tensor
Graph Pre-processor -> Strategy Explorer -> Graph Reconsturctor

Abstraction

network -> DAG

Tensor Parallelism = Graph Partiioning

-> Graph IR: Constrained Search Space
Spilt, Replicate, Communicate

So we can spilt ...

Graph-pruning

Key: Subgraph

观察Observation：
In many model, we get similar layers/structs
EX

Linear Layer
Self-attention

They have many pattern similar

Use Frequent Subgraph Search algorithm to find some chairman subgraph
找到频繁子图，进行挖掘，

Then we can optimize the Frequent Subgraph, and implenment the optimization among all the graph structure.

tensor-parallel spilt/validator
make sure that it works and runs

validator makes sure that communication/copy is good

cost model Eval

11K+LOC on tensorflow

Paper Session 1: AI in Computing

Cockatoo
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Pisces: Towards Adaptive and Fair Congestion Control via Multi-Agent Meta-Reinforcement Learning

CCA: Congestion Control Algorithm

Existing CCAs:
Rule-Based CCAs

Simple
But
adaptablilty
Manual tuning
Imitation Learning

Learning-Based CCAs
Fairness not guaranteed
Hybird methods like Orca: may break cubic fairness
not know for workload, so may be not effective

motivation
limited generalization and adaptability

Design
A Learning based CCA for Federated Learning

Use POMDP (a kind of markov process)

Framework

Learner
2 level optimizer
inner loop: MAPPO
Outer loop: meta-leraning, reptile algorithm

P3P-Fed: Peer-to-Peer Personalized Federated Learning with DHT-based Local Clustering

Onsite Paper

Aheon Lim
Master in Korea Aerospace University

What is Personalized FL?
Federated Learning (FL) enables the training of a single global model in a distributed manner while preserving the privacy of local data. In real-world FL environments, however, characterized by high heterogeneity in client data distributions (non-IID), the global model suffers from performance degradation. To address this challenge, Personalized Federated Learning (PFL) has been proposed to provide personalized models either by clustering clients or by decoupling the model into shared and personalized components.

Current FL Problem: Overfitting

Challenge

Design

DHT
Distributed Hash Table

cat -> 5D

P2P
Local clustering algorithm

Push - pull strategy

Experiment
use MPS

noon

Onsite Paper

Paper Session 6: Performance

Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint

Large data and computing centers consume a significant share of the world’s energy consumption. A prominent subset of the workloads in such centers are workflows with interdependent tasks, usually represented as directed acyclic graphs (DAGs). **To reduce the carbon emissions resulting from executing such workflows in centers with a mixed (renewable and non-renewable) energy supply, it is advisable to move task executions to time intervals with sufficient green energy when possible.

To this end, we formalize the above problem as a scheduling problem with a given mapping and ordering of the tasks. We show that this problem can be solved in polynomial time in the uniprocessor case. For at least two processors, however, the problem becomes NP-hard. Hence, we propose a heuristic framework called CaWoSched that combines several greedy approaches with local search.** To assess the 16 heuristics resulting from different combinations, we also devise a simple baseline algorithm and an exact ILP-based solution. Our experimental results show that our heuristics provide significant savings in carbon emissions compared to the baseline.

SOTA
a paper from 2002

Intro
CaWoSched with efficient heuristics combining various greedy approaches with local search
Extensive simulations

Try to do Data Parallelism with scheduling

Model/Problem Formation

The Computing hierarchy can be represented as DAG

We want to schedule this problem with this model on green scheduling.

At here we show the example of the DAG program. For example, we run Job1 on Machine1 (black one), then we can run the job on the left (red) on Machine3(red). When we try to move data from machine1->machine3, we get a communication (Diamond)

Objective: Find a schedule that minimizes carbon cost while meeting the deadline.

Problem1: 1 CPU, N ordered tasks
Polynomial Complexity
use DP

Problem2: N CPU, N ordered tasks
NP-Complete
Reduction from 3-partition

Algorithm fro multi processor case

Assign scores on task: A. ddl; B. green power budget
Greedily try to find optimal regarding carbon cost (if at some point the green power is low, then it will begin the task later)

**Score?

A. for ddl
Slack
Pressure

B. for CPU
CPU slides XXXX

Greedy
first put every jobs into the CPU workers

Local search
Exploit the remaining flexibility in the sechedule
Try to move some jobs into better power time

Setup & Exp
All in simulations. 1D 1.5D 2D 3D

including local search
cost ratio is best cost found divided by the algorithm

achieve 40% decrease in CO2 emission

HHOTuner: Efficient Performance Tuning with Harris Hawks Optimization

iowa state University

NSF

An auto-tuning framework for large space optuning

Autotuning: Code+Arch+Tune = Best performance

They try to use an evolutionary algorithm call Harris Hawks

This is a very very cooooool figure

Harris Howk Algorithm idea

Main Ideas

Exploration phase:
- Hawks search the environment (global exploration of the solution space).
- Two strategies: random perching or scanning for potential prey.
Exploitation phase:
- Once prey is spotted, hawks begin encircling and besieging it (local exploitation).
- Several strategies are used: “soft siege,” “hard siege,” or random surprise pounces.
Energy parameter (E):
- Represents the prey’s escaping energy, which decreases over time.
- When ∣E∣≥1|E| ≥ 1∣E∣≥1, exploration dominates.
- When ∣E∣<1|E| < 1∣E∣<1, exploitation dominates.
Surprise pounce mechanism:
- Mimics the unpredictable behavior of prey escaping.
- Introduces randomness to help the algorithm escape local optima.

算法核心思想 Core idea of the algortihm in Chinese

探索阶段（Exploration）：鹰群在广阔区域内搜索猎物（解空间的全局搜索）。
- 鹰会“坐等”或“随机游走”，模拟寻找潜在目标。
开发阶段（Exploitation）：一旦发现猎物（候选解），鹰群会采用不同战术围捕猎物，逐步缩小包围圈。
- 对应优化算法中从全局搜索转向局部搜索。
突袭机制（Surprise Pounce / Dynamic Besiege Strategy）：
- 模拟兔子试图逃跑，鹰群不断调整进攻方式（随机性 + 动态更新）。这增加了跳出局部最优的能力。

SOTA Benchmark
2021 Bliss: auto-tuning complex applications using a pool of diverse lightweight learning
models.
2014 OpenTuner From MIT CSAIL

One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource Sharing

ANL

TL;DR from Haibin: they use MIG/MPS/Time Sharing on GPU to accerlate a HPC workflow.

In-transit visualization has become essential in high-performance computing (HPC) to reduce I/O overheads and enable real-time data analysis. However, as simulations grow in scale and complexity, these visualization tasks increasingly demand substantial computational resources, exacerbating energy consumption and limiting system scalability. As we show in this paper, a key bottleneck is the conventional one-rank-per-GPU allocation model, which leads to irregular GPU utilization and waste of hardware resources.

To tackle this challenge, we propose using GPU-sharing strategies to improve energy efficiency in in-transit visualization without compromising performance. We evaluate six distinct configurations built upon three NVIDIA GPU-sharing mechanisms: the default CUDA model with context switching between processes, Multi-Process Service (MPS), which enables dynamic context sharing, and Multi-Instance GPU (MIG), which provides hardware-level partitioning. Using the WarpX simulation code and Ascent visualization framework, our experiments on the Polaris supercomputer span multiple rendering techniques, node counts, and data sizes.

Results show that GPU-sharing strategies can improve the trade-off between performance and energy, represented by the energy-delay product (EDP) metric, by up to 81.7%. We also show that workload-aware strategy selection is essential to improve performance-energy efficiency: MIG-based configurations are more effective for lightweight and regular workloads, offering up to 64.5% energy savings, while MPS better handles GPU-intensive workloads, achieving up to 71.1% EDP improvement. Finally, we demonstrate that optimized sharing strategies can reduce the required compute nodes by up to 75%, freeing system resources for concurrent workloads.

WarpX, HACC, Nyx
-> I/O bottleneck

how to minimize?

GPU is underutilized

GPU Sharing: Time-Slice, MPS, MIG
How many nodes do you really need?
what's the impact

Test bed

Methodology
just testing
scaled each config from 1 to 10 nodes

Schedule on GPU energy+performance

MPS can oversubscribe the GPU, so that maybe why the it performs well

A usecase