HPC Performance Optimization Tips (Slurm-Based Workloads)
Overview
Performance in High Performance Computing (HPC) environments is defined by two key factors:
- Execution time of your application
- Queue wait time before your job starts running
Optimizing both is critical. A fast application that waits hours in the queue is still inefficient.
After ensuring correctness, performance optimization is the most important step in HPC usage.
Key Concepts
Efficient HPC usage depends on how well your workload matches the underlying hardware:
- Tasks sharing CPU cache → lower latency
- Memory-intensive workloads → benefit from NUMA awareness
- Poor placement → wasted CPU cycles and longer runtimes
Proper use of Slurm scheduler parameters allows you to control this behavior.
Scope
This guide applies to Linux batch jobs executed via Slurm
1. Understand Your Application Type
Before tuning anything, determine:
Is your application:
- Single-threaded?
- Multi-threaded (OpenMP, shared memory)?
- Distributed (MPI)?
This determines everything.
2. Multi-threaded Workloads (Shared Memory)
If your application is multi-threaded:
Best Practice
Keep all threads within the same CPU socket to:
-
Maximize cache usage
-
Avoid NUMA penalties
-
Reduce memory latency
Example: Use 24 cores on a single CPU
Command line:
--ntasks=1 --cpus-per-task=24
Batch script:
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=24
Important Considerations
This request may increase queue wait time.
You are asking for a large contiguous resource block.
Scaling Strategy
If your application scales well:
Gradually increase:
--cpus-per-task=N
Slurm default behavior:
Expands CPUs within the same socket first
Then spills over to next socket if needed
Recommended Additions (Critical in Practice)
1. Explicit CPU Binding
Avoid scheduler ambiguity:
#SBATCH --cpu-bind=cores
or:
#SBATCH --hint=nomultithread
2. Control Thread Count in Application
Set environment variables:
export OMP_NUM_THREADS=24
Mismatch between Slurm and application = performance loss
3. Avoid Oversubscription
Wrong:
--ntasks=24 --cpus-per-task=24 # 576 logical CPUs requested
Correct (for OpenMP):
--ntasks=1 --cpus-per-task=24
3. When NOT to Use This Approach
Do NOT use --cpus-per-task if:
Your workload is MPI-based
-
Tasks are independent
-
You need multi-node scaling
Instead use:
--ntasks=N
4. Trade-off: Performance vs Queue Time
| Strategy | Result |
|---|---|
| Large CPU block | Faster execution, longer wait |
| Smaller allocation | Faster start, longer runtime |
Best practice:
-
Benchmark both approaches
-
Optimize total turnaround time, not just runtime