ICPP25 Conference story: Day 1

Conference 会议

Congrats to all accepted papers!

Welcome Ceremony

ff6af971a89cf92835dab605bfa9e1de.jpg

Jack Dongarra, "An Overview of High Performance Computing and Responsibly Reckless Algorithms"

I heard from Jack Dongarra's report for lots of time, but still it's happy to record the speech.

The speech is mainly about his jobs on BLAS in the past decades of years. And he also introduced some current projects on mixed precesion linear algebra computing works.

9888105a363fb9d50dd0b6c32494c0e1.jpg

OK this is the story I haven't heard of ...... To be a high school teacher haha!
c24bb4c40165cafdedb92b691f69be31.jpg

LINPACK. A very good library which I used to work at ASC student challenge.

985536973fa169c944afd832ea22b16b.jpg

Best Paper: Efficient Construction of Large Search Spaces for Auto-Tuning

Background
The hardware evloves 2-3years, while HPC software evolved 20 years

So we should develop from codes for specfic arch to more flexible tunable codes

b1190a0bed7598c2d4c742dd997e1b68.jpg

Hypertuner for HPC software
search space, time
search config, like GPU x,y axis

SOTA & Problem
You can see the SOTA framework as following chart.

Most of them are brute force, or using ATF chain tree.

We mark time and effectiveness as our major goal for tuning tools' requirements.


But currently there exist 3 major problem formation for auto-tuning/search space searching:

  1. SAT: use boolean logic
  2. SMT: Satisfiability Modulo Theories
  3. CSP: Constraint Satisfaction Problems

So what we do in this paper:

We build a good CSP solver

We want to build a auto-tuning framework, which utilize CSP solver as its core, to optimize the problem with search space construction

We built a kernel tuner to do local search in order to reduce the search space.
835929c29c593627ec612e1e657b80f6.jpg


Problem Formation

Constraint-based auto-tuning treats the search for optimal performance parameters (e.g., tile sizes, thread counts, memory layouts) as a Constraint Satisfaction Problem (CSP).

  • Variables: tunable parameters (e.g., block size, unroll factor, thread binding).
  • Domains: the allowed values for each variable.
  • Constraints: rules that define legal or efficient configurations, such as:

Shared memory usage must not exceed hardware limits.
Thread counts ≤ number of available cores.
Tile sizes must divide matrix dimensions.
Constraint parsing

Design

Parser rewrites them using AST Abstract Synatx Trees

Optimizations

  • Optimized Backtracking algorithm with improved variable ordering
  • constraint -> [ parser - search space - optimized algorithm ]
  • Search space construction: performance (real-world)
  • Use C-Python

Experiment

search space
Time
Effectiveness

Conclusion
We gets orders of magnitude speedup
Kernel Tuner

also has a paper in eScience on Chicago

How to use:

pip install kernel_tuner

Paper Session 4: System Software

Macaw

TAPAS: Fast and Automatic Derivation of Tensor Parallel Strategies for Large Neural Networks

Author: Ziji Shi

From Alibaba cloud

What we do: Use approx 30min to find the best strategy on LLM model tensor parallelism (use CPU to find result)

We present TAPAS, an automatic parallelism framework that efficiently discovers tensor parallel plans for large neural networks.

Problem
Auting for Tensor Parallel DNN

GPU Memory can't chase Tensor blooming in DNN
Parameter: 2941X
GPU mem: 5X

current solution:
Tensor Parallelism

-> Communication suffers from overhead evolving: Broadcast+AllGather

Related Works

How to spilt model: Megatron-LM
Good for transformer, bad for other

OSDI22 Alpa
Two-level opti can discover TP-PP nested strategy
Search time is large for big model

Challenge

  1. Heterogenousr Model Arch: UNet, Ueven Arch
    FC layer

  2. Search Space for TP is Exponential
    the size of decision space is too large
    complexity: $O(3^N)$
    labor-intensive, stiff

  3. operation's spilt also depends on its neighbors
    in DAG spilts, any change will affect its neighbor DAG


Main Design

We Design

Abstraction -> Graph pruning -> Spilt the tensor
Graph Pre-processor -> Strategy Explorer -> Graph Reconsturctor

Abstraction

network -> DAG

Tensor Parallelism = Graph Partiioning

-> Graph IR: Constrained Search Space
Spilt, Replicate, Communicate

So we can spilt ...

Graph-pruning

Key: Subgraph

观察Observation:
In many model, we get similar layers/structs
EX

  • Linear Layer
  • Self-attention

They have many pattern similar

Use Frequent Subgraph Search algorithm to find some chairman subgraph
找到频繁子图,进行挖掘,

tensor-parallel spilt/validator
make sure that it works and runs

validator makes sure that communication/copy is good

cost model Eval

11K+LOC on tensorflow

Paper Session 1: AI in Computing

Cockatoo
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Pisces: Towards Adaptive and Fair Congestion Control via Multi-Agent Meta-Reinforcement Learning

Congestion Control

Existing CCAs:
Rule-Based CCAs

  • Simple
    But
  • adaptablilty
  • Manual tuning
  • Imitation Learning

Learning-Based CCAs
Fairness not guaranteed
Hybird methods like Orca: may break cubic fairness
not know for workload, so may be not effective

motivation
limited generalization and adaptability

Design
A Learning based CCA for Federated Learning

Use POMDP (a kind of markov process)

Framework

Learner
2 level optimizer
inner loop: MAPPO
Outer loop: meta-leraning, reptile algorithm

P3P-Fed: Peer-to-Peer Personalized Federated Learning with DHT-based Local Clustering

Onsite Paper

Aheon Lim
Master in Korea Aerospace University

What is Personalized FL?
Federated Learning (FL) enables the training of a single global model in a distributed manner while preserving the privacy of local data. In real-world FL environments, however, characterized by high heterogeneity in client data distributions (non-IID), the global model suffers from performance degradation. To address this challenge, Personalized Federated Learning (PFL) has been proposed to provide personalized models either by clustering clients or by decoupling the model into shared and personalized components.

FL

Challenge

Design

  1. DHT
    Distributed Hash Table

cat -> 5D

  1. P2P
    Local clustering algorithm

Push - pull strategy

Experiment
use MPS

Onsite Paper

Paper Session 6: Performance

Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint

Large data and computing centers consume a significant share of the world’s energy consumption. A prominent subset of the workloads in such centers are workflows with interdependent tasks, usually represented as directed acyclic graphs (DAGs). **To reduce the carbon emissions resulting from executing such workflows in centers with a mixed (renewable and non-renewable) energy supply, it is advisable to move task executions to time intervals with sufficient green energy when possible.

To this end, we formalize the above problem as a scheduling problem with a given mapping and ordering of the tasks. We show that this problem can be solved in polynomial time in the uniprocessor case. For at least two processors, however, the problem becomes NP-hard. Hence, we propose a heuristic framework called CaWoSched that combines several greedy approaches with local search.** To assess the 16 heuristics resulting from different combinations, we also devise a simple baseline algorithm and an exact ILP-based solution. Our experimental results show that our heuristics provide significant savings in carbon emissions compared to the baseline.

SOTA
a paper from 2002

Intro
CaWoSched with efficient heuristics combining various greedy approaches with local search
Extensive simulations

DP

Model

The Computing hierarchy can be represented as DAG

We want to schedule this problem with this model on green scheduling.

b5b6038cda4902267c77ba0cd0c05bec.jpg

At here we show the example of the DAG program. For example, we run Job1 on Machine1 (black one), then we can run the job on the left (red) on Machine3(red). When we try to move data from machine1->machine3, we get a communication (Diamond)

09956686f13459dfa7610d2aa7160998.jpg

Objective: Find a schedule that minimizes carbon cost while meeting the deadline.

Problem1: 1 CPU, N ordered tasks
Polynomial Complexity
use DP

Problem2: N CPU, N ordered tasks
NP-Complete
Reduction from 3-partition


Algorithm fro multi processor case

  1. Assign scores on task: A. ddl; B. green power budget
  2. Greedily try to find optimal regarding carbon cost (if at some point the green power is low, then it will begin the task later)

**Score?

A. for ddl
Slack
Pressure

b88a259084bc7e4c3dd6af224fb882d7.jpg

B. for CPU
CPU slides XXXX

Greedy
first put every jobs into the CPU workers

Local search
Exploit the remaining flexibility in the sechedule
Try to move some jobs into better power time

Setup & Exp
All in simulations. 1D 1.5D 2D 3D

including local search
cost ratio is best cost found divided by the algorithm

achieve 40% decrease in CO2 emission

HHOTuner: Efficient Performance Tuning with Harris Hawks Optimization

iowa state University
NSF

For large space optuning

Autotuning: Code+Arch+Tune = Best performance

They try to use an evolutionary algorithm call Harris Hawks

This is a very very cooooool figure

05cc251ea84fab7bc67c86aab8d816de.jpg

Harris Howk Algorithm idea

Main Ideas

  1. Exploration phase:
    • Hawks search the environment (global exploration of the solution space).
    • Two strategies: random perching or scanning for potential prey.
  2. Exploitation phase:
    • Once prey is spotted, hawks begin encircling and besieging it (local exploitation).
    • Several strategies are used: “soft siege,” “hard siege,” or random surprise pounces.
  3. Energy parameter (E):
    • Represents the prey’s escaping energy, which decreases over time.
    • When ∣E∣≥1|E| ≥ 1∣E∣≥1, exploration dominates.
    • When ∣E∣<1|E| < 1∣E∣<1, exploitation dominates.
  4. Surprise pounce mechanism:
    • Mimics the unpredictable behavior of prey escaping.
    • Introduces randomness to help the algorithm escape local optima.

算法核心思想 Chinese Core idea

  • 探索阶段(Exploration):鹰群在广阔区域内搜索猎物(解空间的全局搜索)。
    • 鹰会“坐等”或“随机游走”,模拟寻找潜在目标。
  • 开发阶段(Exploitation):一旦发现猎物(候选解),鹰群会采用不同战术 围捕猎物,逐步缩小包围圈。
    • 对应优化算法中从全局搜索转向局部搜索。
  • 突袭机制(Surprise Pounce / Dynamic Besiege Strategy):
    • 模拟兔子试图逃跑,鹰群不断调整进攻方式(随机性 + 动态更新)。这增加了跳出局部最优的能力。

SOTA Benchmark
2021 Bliss: auto-tuning complex applications using a pool of diverse lightweight learning
models.
2014 OpenTuner From MIT CSAIL

One GPU, Many Ranks: Enabling Performance and Energy-Efficient In-Transit Visualization via Resource Sharing

ANL

TL;DR from Haibin: they use MIG/MPS/Time Sharing on GPU to accerlate a HPC workflow.

In-transit visualization has become essential in high-performance computing (HPC) to reduce I/O overheads and enable real-time data analysis. However, as simulations grow in scale and complexity, these visualization tasks increasingly demand substantial computational resources, exacerbating energy consumption and limiting system scalability. As we show in this paper, a key bottleneck is the conventional one-rank-per-GPU allocation model, which leads to irregular GPU utilization and waste of hardware resources.

To tackle this challenge, we propose using GPU-sharing strategies to improve energy efficiency in in-transit visualization without compromising performance. We evaluate six distinct configurations built upon three NVIDIA GPU-sharing mechanisms: the default CUDA model with context switching between processes, Multi-Process Service (MPS), which enables dynamic context sharing, and Multi-Instance GPU (MIG), which provides hardware-level partitioning. Using the WarpX simulation code and Ascent visualization framework, our experiments on the Polaris supercomputer span multiple rendering techniques, node counts, and data sizes.

Results show that GPU-sharing strategies can improve the trade-off between performance and energy, represented by the energy-delay product (EDP) metric, by up to 81.7%. We also show that workload-aware strategy selection is essential to improve performance-energy efficiency: MIG-based configurations are more effective for lightweight and regular workloads, offering up to 64.5% energy savings, while MPS better handles GPU-intensive workloads, achieving up to 71.1% EDP improvement. Finally, we demonstrate that optimized sharing strategies can reduce the required compute nodes by up to 75%, freeing system resources for concurrent workloads.

WarpX, HACC, Nyx
-> I/O bottleneck

how to minimize?

GPU is underutilized

  1. GPU Sharing: Time-Slice, MPS, MIG
    c894c14def0059a4e3426a17d41012da.jpg

  2. How many nodes do you really need?

  3. what's the impact

Test bed
cde7eeb9dca41f21656832664f42ba8c.jpg

Methodology
just testing
scaled each config from 1 to 10 nodes

Schedule on GPU energy+performance

MPS can oversubscribe the GPU, so that maybe why the it performs well