Task-based Parallelism models and their techniques Overivew

So far there are many task programming models.

Charm++

Website: https://charmplusplus.org/applications/
Github: https://github.com/charmplusplus/charm
Tutorial: https://charm.readthedocs.io/en/latest/index.html

From UIUC PPL Lab

A Lib that updated until 2023

Used in HPC and Scientific Computing area

image.png

Vector Load Balancing

https://charm.cs.illinois.edu/newPapers/23-01/Dissertation_Buch_Final.pdf

In this doctoral dissertation, Vector Load Balancing is an innovative solution proposed by the author to address the increasingly complex load imbalance problems in modern high-performance computing (HPC).

Traditional load balancing treats "load" as a scalar value, typically representing only the total execution time. In contrast, Vector Load Balancing redefines load as a vector $\vec{lo} \in \mathbb{R}^d{\ge 0}$, where each dimension of the vector represents different performance characteristics or resource consumption aspects during execution.

When dealing with modern complex applications and systems, traditional scalar load balancing often fails to make optimal decisions due to information loss (dimensional compression):

  • Inability to Capture Phased Structures: Many programs consist of multiple sequentially executed phases, and the load distribution can vary significantly between phases. Scalar aggregation masks severe imbalances within individual phases.
  • Parallelism of Heterogeneous Resources: In systems using both CPUs and accelerators (e.g., GPUs) simultaneously, a single numerical value cannot accurately describe the concurrent loads on both types of resources.
  • Limitations of Non-Temporal Constraints: Constraints in non-temporal dimensions, such as memory footprint or network bandwidth, are difficult to represent in a scalar model.

This thesis addresses performance bottlenecks in the following specific scenarios through a vectorized model:

  • Phase-Based Applications:
    By storing the load of each phase in a different dimension of the vector, the load balancer can identify which objects are overloaded in specific phases. This enables more precise redistribution, preventing processors from experiencing significant idle wait times at each barrier.
  • Heterogeneous Systems Load:
    Handles tasks running concurrently on CPUs and GPUs. The vector model allows for the simultaneous optimization of utilization across multiple processing targets.
  • Constraint-Based Load Balancing:
    While optimizing execution time, it incorporates hard physical resource constraints like memory capacity as vector dimensions. This prevents load balancing strategies from causing node memory overflow (OOM) crashes.
  • Algorithm Scalability:
    Vectorization increases the dimensionality of the problem (i.e., the "curse of dimensionality"). The author addresses computational efficiency challenges for vector load balancing at scales of tens of thousands of processors by introducing optimization techniques such as spatial partitioning trees (e.g., k-d trees), Pareto frontier search, and hierarchical load balancing (Hierarchical LB).

Vector load balancing upgraded from a "one-dimensional" to a "multi-dimensional" perspective, enables the load balancer to analyze and optimize the performance.

TBB

doc: https://uxlfoundation.github.io/oneTBB/main/tbb_userguide/Task-Based_Programming.html
Github: https://github.com/uxlfoundation/oneTBB.git

When Task-Based Programming Is Inappropriate

Using the task scheduler is usually the best approach to threading for performance, however there are cases when the task scheduler is not appropriate. The task scheduler is intended for high-performance algorithms composed from non-blocking tasks. It still works if the tasks rarely block. However, if threads block frequently, there is a performance loss when using the task scheduler because while the thread is blocked, it is not working on any tasks. Blocking typically occurs while waiting for I/O or mutexes for long periods. If threads hold mutexes for long periods, your code is not likely to perform well anyway, no matter how many threads it has. If you have blocking tasks, it is best to use full-blown threads for those.

How Task Scheduler Works

The scheduler runs tasks in a way that tries to achieve several targets simultaneously:

  • Enable as many threads as possible, by creating enough job, to achieve actual parallelism
  • Preserve data locality to make a single thread execution more efficient
  • Minimize both memory demands and cross-thread communication to reduce an overhead

To achieve this, a balance between depth-first and breadth-first execution strategies must be reached. Assuming that the task graph is finite, depth-first is better for a sequential execution because:

  • Strike when the cache is hot. The deepest tasks are the most recently created tasks and therefore are the hottest in the cache. Also, if they can be completed, tasks that depend on it can continue executing, and though not the hottest in a cache, they are still warmer than the older tasks deeper in the dequeue.
  • Minimize space. Execution of the shallowest task leads to the breadth-first unfolding of a graph. It creates an exponential number of nodes that co-exist simultaneously. In contrast, depth-first execution creates the same number of nodes, but only a linear number can exists at the same time, since it creates a stack of other ready tasks.

When a thread participates in the evaluation of tasks, it constantly executes a task obtained by the first rule that applies from the roughly equivalent ruleset:

  • Get the task returned by the previous one, if any.
  • Take a task from the bottom of its deque, if any.
  • Steal a task from the top of another randomly chosen deque. If the selected deque is empty, the thread tries again to execute this rule until it succeeds.

Task Scheduler Bypass

Scheduler bypass is an optimization where you directly specify the next task to run. According to the rules of execution described in How Task Scheduler Works, the spawning of the new task to be executed by the current thread involves the next steps

HPX

doc: https://hpx-docs.stellar-group.org/latest/html/index.html
Git: https://github.com/STEllAR-GROUP/hpx.git

update until 2024

image.png

Clik

Git: https://github.com/OpenCilk/opencilk-project.git
Paper: https://www.cse.wustl.edu/~angelee/home_page/papers/opencilk.pdf
doc: https://www.opencilk.org/doc/users-guide/getting-started/

The OpenCilk infrastructure consists of three main components: a compiler designed to compile fork-join task-parallel code, an efficient work-stealing runtime scheduler, and a productivity-tool development framework based on compiler instrumentation designed for fork-join parallel computations. OpenCilk is modular — modifying one component for the most part does not necessitate modifications to the other components — and easy to extend — its construction naturally encourages code reuse. Despite being modular and easy to extend, OpenCilk produces high-performing code.

The Burdened-dag Model

Related Paper:

  1. https://www.csd.uwo.ca/~mmorenom/CS433-CS9624/Resources/p145-he.pdf
  2. https://arxiv.org/pdf/2211.08800v2

Actual speedup is influenced not only by these intrinsic characteristics, but also by the performance of the scheduling algorithm and the cost of migrating tasks to load-balance the computation across processor cores. This section discusses prior work on scheduling bounds and proposes a new model, called “burdened dags,” for incorporating migration costs.

Before this bound, there exists an upper bound describe the compute accelerate bound of DAG flow: Graham's Bound.
Graham's Bound traditionally states that for a DAG with total work W and critical path length L executed on m processors, the execution time T satisfies:

$$T \leq \frac{W - L}{m} + L$$

T \leq \frac{W - L}{m} + L

In an ideal Directed Acyclic Graph model, only the task execution time (work T_1) and critical path length (span T_\infty) are considered. However, in real-world multi-core systems, task migration incurs additional overhead.

  • Explicit costs: Management overhead from the scheduler for task allocation and state maintenance.
  • Implicit costs: Data migration costs due to cache misses when a task moves from one core to another.
  • Return costs: Synchronization overhead incurred when a function returns and finds that its parent task has been "stolen" by another core.

The Burdened-DAG is an enhanced model over the ordinary DAG:

  • It embeds the aforementioned "burden" onto every continuation edge of the DAG.
  • This is equivalent to assuming that task stealing occurs at every potential parallel point (simulating the worst-case scenario).
  • The Burdened Span \hat{T}_\infty refers to the length of the longest path found in this graph, which includes all burden weights.

image.png

So the speedup may look like:
image.png