ICPP25 Conference story: Day 2

16d4a3b80eb7a9b440a92e21425f314b.jpg

Anne Elster, "Parallel Computing and Geophysical Forecasting"

Professor Anne C. Elster Norwegian Univ. of Science and Technology Center for Geophysical Forecasting University of Texas at Austin

Geophysical forecasting offers the opportunity to leverage some of the cutting-edge technologies from the oil and gas sector to improve, for instance, geohazard monitoring and forecasting sudden events along roads and railways. This also includes the use of new methods for monitoring and mapping life and geophysical events at sea and near the seabed. Modern seismic sensors and DAS (Distributed Acoustic Sensing) systems also generate vast datasets we will need both AI and parallel computing techniques to fully make use of. These tasks thus offer many interesting research challenges related to parallel and distributed computing over the next several years.

This talk will highlight some of the ongoing work my group and colleagues are involved in at The Center for Geophysical Forecasting at NTNU. This includes discussing some of our work related to utilizing AI and HPC techniques such as autotuning and combining real experimentation with modeling and vice versa, and how these can impact applications.

They are developing a good Geophysical Forecasting software using OpenMP and CUDA. And they accerlate the running time from 4 hour to 6 seconds (?)

Their software running result:

da8faa30e085147901787717d2e015d4.jpg

234726e52270ce413547e75467e85ec0.jpg

Smoothed Particle Hydrodynamics

Some of the main computational methods they used is SPH:
SPH simulation -> Smoothed Particle Hydrodynamics(光滑粒子流体动力学方法 in Chinese)

How it works

  • Imagine each particle has a “smoothing kernel” (like a soft sphere around it).
  • Physical quantities (like density or pressure) at any point are computed by taking a weighted average of nearby particles inside this kernel.
  • The fluid equations (mass, momentum, energy conservation) are rewritten in terms of particle interactions.

So the particles move according to physics, and together they behave like a continuous fluid.

Full Waveform Inversion

9d74e6203f904d8bc0850449abad2ae3.jpg

Full Waveform Inversion (FWI) is a computational method in geophysics used to create very detailed images of the Earth’s subsurface.

  • Input: Seismic data (waves recorded after they travel through the Earth).
  • Output: A high-resolution model of underground properties (like velocity, density, etc.).

It’s widely used in oil & gas exploration, earthquake studies, and geothermal energy research.

How it works

  1. Start with a guess: Assume an initial Earth model (e.g., velocity distribution).
  2. Simulate waves: Use numerical methods (like finite differences) to compute how seismic waves would travel through this model.
  3. Compare with real data: Compare the simulated seismic waveforms with the actual recorded seismic signals.
  4. Update the model: Adjust the Earth model to reduce the difference (the misfit).
  5. Iterate: Repeat the process until simulated and real data match closely.

AI+HPC

That's a little close to Lucas' jobs!

0a3fffa5a36f14920fbeee5359b90d74.jpg

Industry Session: Max Chung, Supermicro

67d0906feca018b6b6f2bea983debf62.jpg

Max introduce what supermicro are selling
ea64b6f9e5c9cbab9dac1c32cbe504c1.jpg

e552174f1e2aea69555c4b694a07cca4.jpg

eeaad663eab720160351bb584b95649c.jpg

How much is their NVLink speed:
200GB/s for RDMA

dedc833ed8ae6693d32b9421b9eb8919.jpg

CXL still lack of scenario

aeb5b7fe05e3494583a8c8a237ebb25c.jpg

How do you know that how many power do you want?
Run HPL to measure the full power

How to control power on each node?
On board Hardware:
BMC
PDU: power distribution unit

break

Paper Session 7: Architecture

Toucan

Chairman: Taisuke Boku

LLaMCAT: Optimizing Large Language Model Inference with Cache Arbitration and Throttling 

d83368d6d765ba9dfe3dba2c45411027.jpg

We introduce LLaMCAT, a novel approach to optimize the LLC for LLM inference. LLaMCAT combines Miss Status Holding Register (MSHR)- and load balance-aware cache arbitration with thread throttling to address stringent bandwidth demands and minimize cache stalls in KV Cache access. We also propose a hybrid simulation framework integrating analytical models with cycle-level simulators via memory traces, balancing architecture detail and efficiency.

Problem

LLM decoding is memory bound.

We found that decoding here they only utilize LLC well, but for the first level Cache, that's not very good

4fd3ad7b91630a17052d8839d3381144.jpg

Key Pain
Miss Handling Arch contention stalls even hit traffic

LLC is shared among CPU cores / GPU cores.

GQA reduces but not eliminates memory-boundedness

3355149b7e74b45533891683eace1e4b.jpg

Cache has 2 bandwidth
d91513d3c60bed12549b1717bb15d8fb.jpg

upstream bandwidth (SRAM bandwidth)
downstream bandwidth (more related to the miss handling throughput)
8caf9e7d079be7ea1e5c9e0d3ff1e240.jpg

Main Design
332ac9fe7ff09d3aba1a0410cb83c306.jpg

Contribution
91a3305d006d31b41b895920d16b0216.jpg

Power Capping of GPU Servers for Machine Learning Inference Optimization 

OSU Mayuan

a7beda5f960a218446258867c8f87c68.jpg

Data center analysis system
server lab

Scene: AI on datacenter

Power capping technique
88bb06a5c564c70a8e138b4a69f0c039.jpg

Dynamic Voltage and Frequency Scaling
DVFS

Solution
40a07e5b9d4d97b7ad9bc5b40c1c575e.jpg

cdbde9e68476c7e21427e37c8bc9d89e.jpg

Noon

A good part I didn't listen:

Efficient Cross-Datacenter Congestion Control with Fast Control Loops

Paper Session 10: Programming Environment
Design and Optimization of GPU-Aware MPI Allreduce Using Direct Sendrecv Communication
Optimizing NumPy with SVE Acceleration on ARM Architectures
pyGinkgo: A Sparse Linear Algebra Operator Framework for Python

4572c9e1062e6fbfcf56a878509cff64.jpg

afternoon 下午

Online Paper Session 8
Zoom online session
SmartBlock: Adaptive Block Floating Point Quantization for Efficient DNN Acceleration
Design of Interposer Interconnection Network Based on High-Radix Interposer Routers
Automated FPGA Accelerator Generation Framework for Transformers with Dataflow Optimization

Paper Session 8: Multidisciplinary

Now come to my part!

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

From MIT

They try to profile GPU kernel on computation for different GPU, on singular value

ParaCOSM: A Parallel Framework for Continuous Subgraph Matching

39c9c1e99b88eb7f960b7191799f1761.jpg

Full video:

【ICPP25 当高频交易遇上图计算】 https://www.bilibili.com/video/BV1njHhzbEAT/?share_source=copy_web&vd_source=72eac555730ba7e7a64f9fa1d7f2b2d4

Thievory: Graph Processing with Multi-GPU Memory Stealing

Paper Session 11: System Architecture and System Software
PTWalker: Cache-Efficient Random Walks via Alternating Dual-Subgraph Walker Updating
Accelerating an Electromagnetic Simulation via Memory-Constrained Task-Based Load Balancing
Leave No One Behind: Fair and Efficient Tiered Memory Management for Multi-Applications

Online Paper Session 9
Zoom online session
SINA: Accelerating Time Synchronization in Large-Scale Network Simulation Using In-Network Allreduce
Optimizing Direct Convolutions on High-Performance Multi-Core DSPs
SpeedSketch: An Ultra-Fast Sketch Generation and Delta Encoding Framework for Delta Compression

Online ICPP Paper session

Paper Session 9: AI in Computing & Performance

Toucan
Auto-Stencil: Performance-Driven Stencil Optimization with Hardware Feedback for LLMs
From Qinghua HPC-Science Team

ed8067cf760edec6864892347646bb25.jpg

Solving Extended Flexible Job Shop Scheduling Problems with Deep Reinforcement Learning
Architecture-Aware Models of AI Engines for High-Performance Matrix Matrix Multiplication

Onsite Paper

Paper Session 12: Algorithms & Performance

Macaw

Deadline-Aware Scheduling of Mixed-Criticality Tasks

Maxime focus on an AI4Sci Software: ALCF, to do scheduling works on different CPU nodes.

2281578525aa5ea51d30667af080097c.jpg

Now they are good for using DP linear scheduling, but on multiple CPU, it's NP hard.
218a07d64a9705360dafa1e5794ab2a1.jpg

So maxime use a greedy algorithm:
9689c51f4d459a2831ce7d1cb535d01f.jpg

Fast Exact Diameter Computation of Sparse Graphs

533a8913d15842e52fb3b5487f5be11c.jpg

Joint Prediction and Matching for Computing Resource Exchange Platforms

Wonderful Night

f786b1d82c7eb803d57e53135c0e6a14.jpg

20e26aa76646feea694f8fd7db57dda5.jpg

ce671ed3f54de9698c9fd0cb4b92a0ab.jpg

9c5fbc4fb8ae769cdb217f375a47fde2.jpg

b1f9ee56787818e75816956f2a98b4b0.jpg

3881cf8d8f77455bc167823185df8b92.jpg

Just have a great day and a great conference!!