ICPP25 Conference story: Day 1

ICPP Conference 会议

Congrats to all accepted papers!

Welcome Ceremony

185 Attendee come to the conference!

d5891bb41c8b45089072dc82193c3961.jpg

We have 292 submission and 78 of them are accepted!

e81c385dc1f4745707594877c4e4fb96.jpg

The Chairman's Welcoming!

ff6af971a89cf92835dab605bfa9e1de.jpg

Jack Dongarra, "An Overview of High Performance Computing and Responsibly Reckless Algorithms"

I heard from Jack Dongarra's report for lots of time, but still it's happy to record the speech.

The speech is mainly about his jobs on BLAS in the past decades of years. And he also introduced some current projects on mixed precesion linear algebra computing works.

9888105a363fb9d50dd0b6c32494c0e1.jpg

OK this is the story I haven't heard of ...... To be a high school teacher haha!
a3635ea8191e7179272c4c1799eb1674.jpg

LINPACK. A very good library which I used to work at ASC student challenge.
c1b4ced001f108ba60e0f009e7d2674b.jpg

Nowadays most investment for High performance computing ecosystem are from: Google, Microsoft, Apple and NVIDIA.

b9f5cd45e6cb25561112d41b186d120a.jpg

Now the newest HPC platform: EI Capitan (What? not Frontier!?)
1d0feed5b26990d68385038b56ce6cc9.jpg

And they are using new floating point precision!!

51115141cf3e1dd19395d02c2e62169c.jpg

And we find that mixed precision on HPC jobs, can also benefits. Sometimes the GPU are better at low precision!!!

3ed5b4cdc65b5bc756d7e0bcc4077aef.jpg

So, some methods, they are trying to find low precision with O(n3) (like LU), then refine the solution with O(n2) to boost the problem finding.
795e0eaff9df03abf471077f839ba76e.jpg

So this is HPL-MxP, do LU factorization with low+high precision!
3e9f7adf00b0f021181cf86f41badfd4.jpg

Top10 HPC Clusters! And we can see that new machines are better at these Clusters!
0205fafda1c2bb2e49c03fa7aef4b1dd.jpg

Nowadays there are many new methods!

d6c92ec40e1eb764b48003801090c7df.jpg

Now there is a new method called Ozaki's method, and we are implmented it!

2caa361e2c4e7eb906a6cdeeea52218b.jpg

That's all!

445afa23f12b571b30c2397a7627edab.jpg

Best Paper: Efficient Construction of Large Search Spaces for Auto-Tuning

Background
The hardware evloves 2-3years, while HPC software evolved 20 years

So we should develop from codes for specfic arch to more flexible tunable codes

b1190a0bed7598c2d4c742dd997e1b68.jpg

So, if you go to ICPP and win the best paper, you will dance in front of us!

40769cac879c79050dacbc53cd3ef94e.jpg

Hypertuner for HPC software
search space, time
search config, like GPU x,y axis

SOTA & Problem
You can see the SOTA framework as following chart.

Most of them are brute force, or using ATF chain tree.

We mark time and effectiveness as our major goal for tuning tools' requirements.


But currently there exist 3 major problem formation for auto-tuning/search space searching:

  1. SAT: use boolean logic
  2. SMT: Satisfiability Modulo Theories
  3. CSP: Constraint Satisfaction Problems

So what we do in this paper:

We build a good CSP solver

We want to build a auto-tuning framework, which utilize CSP solver as its core, to optimize the problem with search space construction

We built a kernel tuner to do local search in order to reduce the search space.
835929c29c593627ec612e1e657b80f6.jpg


Problem Formation

Constraint-based auto-tuning treats the search for optimal performance parameters (e.g., tile sizes, thread counts, memory layouts) as a Constraint Satisfaction Problem (CSP).

  • Variables: tunable parameters (e.g., block size, unroll factor, thread binding).
  • Domains: the allowed values for each variable.
  • Constraints: rules that define legal or efficient configurations, such as:

Shared memory usage must not exceed hardware limits.
Thread counts ≤ number of available cores.
Tile sizes must divide matrix dimensions.
Constraint parsing

c6ea74d267bb1f276ec5f08287a219a5.jpg

1323f9c16135a9176018d9456c2836ce.jpg

Design

Parser rewrites them using AST Abstract Synatx Trees

Optimizations

  • Optimized Backtracking algorithm with improved variable ordering
  • constraint -> [ parser - search space - optimized algorithm ]
  • Search space construction: performance (real-world)
  • Use C-Python

Experiment

search space
cf8facbf93076bce8ef5d1cc7483b08a.jpg

Time
Effectiveness

Conclusion
We gets orders of magnitude speedup
Kernel Tuner

also has a paper in eScience on Chicago

How to use:

pip install kernel_tuner

Paper Session 4: System Software

Macaw

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Author: Ziji Shi

Ziji is a NUS phd student who did a great job!

This job is from his intern in Alibaba cloud

c9e90ec0d8209e2ded9a9d485516f4e3.jpg

What we do: Use approx 30min to find the best strategy on LLM model tensor parallelism (use CPU to find result)

We present TAPAS, an automatic parallelism framework that efficiently discovers tensor parallel plans for large neural networks.

Problem
We want to do Auto tuning for Tensor Parallel DNN

Noadays GPU Memory can't chase Tensor blooming in DNN
Parameter: 2941X
GPU mem: 5X

current solution:
Tensor Parallelism

9865c6c361ca1cabcfe6343f6bd18698.jpg

-> Communication suffers from overhead evolving: Broadcast+AllGather

Related Works

How to spilt model: Megatron-LM
Good for transformer, bad for other

OSDI22 Alpa
Two-level opti can discover TP-PP nested strategy
Search time is large for big model

Challenge

  1. Heterogenousr Model Arch: UNet, Ueven Arch
    FC layer

  2. Search Space for TP is Exponential
    the size of decision space is too large
    complexity: $O(3^N)$
    labor-intensive, stiff

  3. operation's spilt also depends on its neighbors
    in DAG spilts, any change will affect its neighbor DAG


Main Design

We Design

Abstraction -> Graph pruning -> Spilt the tensor
Graph Pre-processor -> Strategy Explorer -> Graph Reconsturctor

Abstraction

network -> DAG

Tensor Parallelism = Graph Partiioning
8a6a0c0b599e9d9ac1656f82d77a062f.jpg

-> Graph IR: Constrained Search Space
Spilt, Replicate, Communicate

So we can spilt ...

1767efc83f24286ae6cc1eae3783e7d3.jpg

Graph-pruning

Key: Subgraph

观察Observation:
In many model, we get similar layers/structs
EX

  • Linear Layer
  • Self-attention

They have many pattern similar

Use Frequent Subgraph Search algorithm to find some chairman subgraph
找到频繁子图,进行挖掘,

Then we can optimize the Frequent Subgraph, and implenment the optimization among all the graph structure.

tensor-parallel spilt/validator
make sure that it works and runs

validator makes sure that communication/copy is good

cost model Eval

11K+LOC on tensorflow

373f80b589afea8e232df37c99d7bf39.jpg

65a9ace856422cf0804ecd68e2355807.jpg

1407289807b07aa5f7952d761545bcfc.jpg

Paper Session 1: AI in Computing

Cockatoo
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Pisces: Towards Adaptive and Fair Congestion Control via Multi-Agent Meta-Reinforcement Learning

CCA: Congestion Control Algorithm

373110e3f3ae1875526da21ade402c74.jpg

Existing CCAs:
Rule-Based CCAs

  • Simple
    But
  • adaptablilty
  • Manual tuning
  • Imitation Learning

Learning-Based CCAs
Fairness not guaranteed
Hybird methods like Orca: may break cubic fairness
not know for workload, so may be not effective

motivation
limited generalization and adaptability

0c934612c3d1992b746b63aa3c115cde.jpg

Design
A Learning based CCA for Federated Learning

Use POMDP (a kind of markov process)

Framework

Learner
2 level optimizer
inner loop: MAPPO
Outer loop: meta-leraning, reptile algorithm

c5a69b9ac8576e76986e943ec5e230f1.jpg

P3P-Fed: Peer-to-Peer Personalized Federated Learning with DHT-based Local Clustering

Onsite Paper

Aheon Lim
Master in Korea Aerospace University

a177379b039f045f48eddd3806106e07.jpg

What is Personalized FL?
Federated Learning (FL) enables the training of a single global model in a distributed manner while preserving the privacy of local data. In real-world FL environments, however, characterized by high heterogeneity in client data distributions (non-IID), the global model suffers from performance degradation. To address this challenge, Personalized Federated Learning (PFL) has been proposed to provide personalized models either by clustering clients or by decoupling the model into shared and personalized components.

Current FL Problem: Overfitting
cb3a715daeebefd7080ba5977c0214f9.jpg

Challenge
d5417af272991b850f1fc0174005ccc2.jpg

Design

1c5568621fc1a5ca3e6a23cc68071689.jpg

  1. DHT
    Distributed Hash Table

cat -> 5D

  1. P2P
    Local clustering algorithm

Push - pull strategy

Experiment
use MPS
0f6e5d95eb0efaebc024660b13c7cb29.jpg

ebedbb4af8dabafe1c5cc9b6840f60c9.jpg

noon

9381078e59c1b0b4b04d619c1cc54747.jpg

ae0cb7eef28918a20f1fe722ce0b18c8.jpg

0a44da30ef39e9cdb21f5b35d6046094.jpg

Onsite Paper

Paper Session 6: Performance

Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint

Large data and computing centers consume a significant share of the world’s energy consumption. A prominent subset of the workloads in such centers are workflows with interdependent tasks, usually represented as directed acyclic graphs (DAGs). **To reduce the carbon emissions resulting from executing such workflows in centers with a mixed (renewable and non-renewable) energy supply, it is advisable to move task executions to time intervals with sufficient green energy when possible.

To this end, we formalize the above problem as a scheduling problem with a given mapping and ordering of the tasks. We show that this problem can be solved in polynomial time in the uniprocessor case. For at least two processors, however, the problem becomes NP-hard. Hence, we propose a heuristic framework called CaWoSched that combines several greedy approaches with local search.** To assess the 16 heuristics resulting from different combinations, we also devise a simple baseline algorithm and an exact ILP-based solution. Our experimental results show that our heuristics provide significant savings in carbon emissions compared to the baseline.

SOTA
a paper from 2002

Intro
CaWoSched with efficient heuristics combining various greedy approaches with local search
Extensive simulations

Try to do Data Parallelism with scheduling

Model/Problem Formation

The Computing hierarchy can be represented as DAG

We want to schedule this problem with this model on green scheduling.

b5b6038cda4902267c77ba0cd0c05bec.jpg

At here we show the example of the DAG program. For example, we run Job1 on Machine1 (black one), then we can run the job on the left (red) on Machine3(red). When we try to move data from machine1->machine3, we get a communication (Diamond)

09956686f13459dfa7610d2aa7160998.jpg

Objective: Find a schedule that minimizes carbon cost while meeting the deadline.

50b8520c71654ed74fd0c0f0281cab6f.jpg

Problem1: 1 CPU, N ordered tasks
Polynomial Complexity
use DP

Problem2: N CPU, N ordered tasks
NP-Complete
Reduction from 3-partition


Algorithm fro multi processor case

  1. Assign scores on task: A. ddl; B. green power budget
  2. Greedily try to find optimal regarding carbon cost (if at some point the green power is low, then it will begin the task later)

**Score?

A. for ddl
Slack
Pressure

b88a259084bc7e4c3dd6af224fb882d7.jpg

B. for CPU
CPU slides XXXX

Greedy
first put every jobs into the CPU workers

Local search
Exploit the remaining flexibility in the sechedule
Try to move some jobs into better power time

Setup & Exp
All in simulations. 1D 1.5D 2D 3D

including local search
cost ratio is best cost found divided by the algorithm

achieve 40% decrease in CO2 emission

HHOTuner: Efficient Performance Tuning with Harris Hawks Optimization

iowa state University

NSF

An auto-tuning framework for large space optuning

Autotuning: Code+Arch+Tune = Best performance

They try to use an evolutionary algorithm call Harris Hawks

This is a very very cooooool figure

05cc251ea84fab7bc67c86aab8d816de.jpg

Harris Howk Algorithm idea

Main Ideas

  1. Exploration phase:
    • Hawks search the environment (global exploration of the solution space).
    • Two strategies: random perching or scanning for potential prey.
  2. Exploitation phase:
    • Once prey is spotted, hawks begin encircling and besieging it (local exploitation).
    • Several strategies are used: “soft siege,” “hard siege,” or random surprise pounces.
  3. Energy parameter (E):
    • Represents the prey’s escaping energy, which decreases over time.
    • When ∣E∣≥1|E| ≥ 1∣E∣≥1, exploration dominates.
    • When ∣E∣<1|E| < 1∣E∣<1, exploitation dominates.
  4. Surprise pounce mechanism:
    • Mimics the unpredictable behavior of prey escaping.
    • Introduces randomness to help the algorithm escape local optima.

算法核心思想 Core idea of the algortihm in Chinese

  • 探索阶段(Exploration):鹰群在广阔区域内搜索猎物(解空间的全局搜索)。
    • 鹰会“坐等”或“随机游走”,模拟寻找潜在目标。
  • 开发阶段(Exploitation):一旦发现猎物(候选解),鹰群会采用不同战术 围捕猎物,逐步缩小包围圈。
    • 对应优化算法中从全局搜索转向局部搜索。
  • 突袭机制(Surprise Pounce / Dynamic Besiege Strategy):
    • 模拟兔子试图逃跑,鹰群不断调整进攻方式(随机性 + 动态更新)。这增加了跳出局部最优的能力。

SOTA Benchmark
2021 Bliss: auto-tuning complex applications using a pool of diverse lightweight learning
models.
2014 OpenTuner From MIT CSAIL

One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource Sharing

ANL

TL;DR from Haibin: they use MIG/MPS/Time Sharing on GPU to accerlate a HPC workflow.

In-transit visualization has become essential in high-performance computing (HPC) to reduce I/O overheads and enable real-time data analysis. However, as simulations grow in scale and complexity, these visualization tasks increasingly demand substantial computational resources, exacerbating energy consumption and limiting system scalability. As we show in this paper, a key bottleneck is the conventional one-rank-per-GPU allocation model, which leads to irregular GPU utilization and waste of hardware resources.

To tackle this challenge, we propose using GPU-sharing strategies to improve energy efficiency in in-transit visualization without compromising performance. We evaluate six distinct configurations built upon three NVIDIA GPU-sharing mechanisms: the default CUDA model with context switching between processes, Multi-Process Service (MPS), which enables dynamic context sharing, and Multi-Instance GPU (MIG), which provides hardware-level partitioning. Using the WarpX simulation code and Ascent visualization framework, our experiments on the Polaris supercomputer span multiple rendering techniques, node counts, and data sizes.

Results show that GPU-sharing strategies can improve the trade-off between performance and energy, represented by the energy-delay product (EDP) metric, by up to 81.7%. We also show that workload-aware strategy selection is essential to improve performance-energy efficiency: MIG-based configurations are more effective for lightweight and regular workloads, offering up to 64.5% energy savings, while MPS better handles GPU-intensive workloads, achieving up to 71.1% EDP improvement. Finally, we demonstrate that optimized sharing strategies can reduce the required compute nodes by up to 75%, freeing system resources for concurrent workloads.

WarpX, HACC, Nyx
-> I/O bottleneck

how to minimize?

GPU is underutilized
5cfa8c877b3209e095045f0b29ee80e6.jpg

  1. GPU Sharing: Time-Slice, MPS, MIG
    c894c14def0059a4e3426a17d41012da.jpg

  2. How many nodes do you really need?

  3. what's the impact

Test bed
cde7eeb9dca41f21656832664f42ba8c.jpg

Methodology
just testing
scaled each config from 1 to 10 nodes

Schedule on GPU energy+performance

MPS can oversubscribe the GPU, so that maybe why the it performs well

A usecase
fe2f389537999bef9faf603c5bad2010.jpg

61344ff37b1faeadaaef79a57fc6a3b1.jpg