Lewis NDAMBIRI | Industrial Engineer & AI/Data Science Specialist

Dual Master's in Industrial Engineering & Computer Science | Optimization, Data, Automation

View My GitHub Profile

πŸš€ Parallel Romberg Integration: HPC Implementation

License: MIT Language: C MPI OpenMP

High-performance parallel implementation of Romberg numerical integration using MPI and hybrid MPI+OpenMP strategies

Developed as part of the High Performance Computing for Data Science (HPC4DS) course at the University of Trento (2025-2026).

Authors: Lewis Ndambiri & Mehrab Fajar


πŸ“‹ Table of Contents


🎯 Overview

This project implements parallel Romberg integration, a sophisticated numerical method combining the trapezoidal rule with Richardson extrapolation. We developed three implementations:

  1. Serial baseline – Pure sequential implementation for benchmarking
  2. Pure MPI – Distributed memory parallelization using message passing
  3. Hybrid MPI+OpenMP – Two-level parallelism exploiting distributed + shared memory

The implementations were tested on the University of Trento HPC cluster (6,092 CPU cores, Infiniband/Omnipath interconnect) with comprehensive scalability analysis.


✨ Key Features


πŸ† Performance Highlights

Strong Scaling (Fixed Problem Size)

Configuration Speedup Efficiency Problem Size
32 cores 31.77Γ— 99.3% Level 16 (65K intervals)
32 cores 21.70Γ— 67.8% Level 20 (1M intervals)
16 cores 18.12Γ— 113.2% Level 16 (super-linear!)

Key Findings


πŸ“š Algorithm Background

What is Romberg Integration?

Romberg integration is an extrapolation method that:

  1. Computes successive trapezoidal approximations with increasing refinement
  2. Applies Richardson extrapolation to eliminate low-order error terms
  3. Achieves higher-order accuracy with fewer function evaluations than basic methods

Mathematical formulation:

R[i][0] = Trapezoidal rule with n = 2^i intervals
R[i][j] = R[i][j-1] + (R[i][j-1] - R[i-1][j-1]) / (4^j - 1)

Test Problem

We integrate:

f(x) = sin(x) * exp(-xΒ²)  over [0, Ο€]

To simulate expensive HPC workloads (PDE solvers, physics simulations), each function evaluation is repeated 1 million times.


πŸ”§ Implementation Strategies

1️⃣ Pure MPI (Distributed Memory)

Parallelization approach:

// Block partitioning with balanced remainder distribution
int base = n / comm_sz;
int remainder = n % comm_sz;
int local_n = base + (my_rank < remainder);

Why MPI_Reduce instead of MPI_Allreduce?

2️⃣ Hybrid MPI+OpenMP

Two-level parallelism:

#pragma omp parallel for reduction(+: local_sum) schedule(static)
for (int i = 1; i < local_n; i++) {
    local_sum += f(local_a + i * h);
}

Hybrid configuration tested:

Finding: Pure MPI outperformed hybrid for this algorithm due to thread management overhead and NUMA effects.


πŸ”§ Project Archictecture

Romberg Integration Architecture



πŸ“ Project Structure

romberg-hpc4ds/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ romberg_serial.c          # Serial baseline
β”‚   β”œβ”€β”€ romberg_mpi.c              # Pure MPI implementation
β”‚   └── romberg_mpi_openmp.c       # Hybrid MPI+OpenMP version
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ romberg_serial.pbs         # PBS job template (serial)
β”‚   β”œβ”€β”€ romberg_mpi.pbs            # PBS job template (MPI)
β”‚   β”œβ”€β”€ romberg_hybrid.pbs         # PBS job template (hybrid)
β”‚   └── run_experiments.sh         # Automated benchmarking suite
β”‚
β”œβ”€β”€ analysis/
β”‚   └── analyze_romberg_logs.py    # Performance visualization tools
β”‚
β”œβ”€β”€ results/
β”‚   β”œβ”€β”€ *.out                      # Job output files
β”‚   β”œβ”€β”€ *.csv                      # Performance data
β”‚   └── *.png                      # Generated plots
β”‚
β”œβ”€β”€ Makefile                       # Build automation
β”œβ”€β”€ README.md                      # This file
└── LICENSE                        # MIT License

πŸš€ Getting Started

Prerequisites

Installation

  1. Clone the repository:
    git clone https://github.com/lewisndambiri.io/romberg-hpc4ds.git
    cd romberg-hpc4ds
    
  2. Build all versions:
    make clean
    make all
    

This compiles:

  1. Verify compilation:
    ./romberg_serial 10      # Run serial with 10 Romberg levels
    mpiexec -n 4 ./romberg_mpi 10  # Run MPI with 4 processes
    

πŸ§ͺ Running Experiments

Local Testing (Single Machine)

Serial execution:

./romberg_serial 16

MPI parallel execution:

mpiexec -n 4 ./romberg_mpi 16

Hybrid execution:

export OMP_NUM_THREADS=2
mpiexec -n 4 ./romberg_mpi_openmp 16

Cluster Deployment (PBS)

Submit individual jobs:

qsub scripts/romberg_serial.pbs
qsub scripts/romberg_mpi.pbs
qsub scripts/romberg_hybrid.pbs

Run complete benchmark suite:

./scripts/run_experiments.sh

This automated script submits jobs for:

Monitor job status:

qstat -u $USER

πŸ“Š Results Visualization

Generate Performance Plots

After jobs complete, run the analysis script:

cd analysis/
python3 analyze_romberg_logs.py

This generates:

  1. Strong scaling speedup (L16 vs L20)
  2. Strong scaling efficiency (L16 vs L20)
  3. Weak scaling efficiency
  4. Hybrid scaling speedup & efficiency
  5. Placement strategy comparison
  6. Amdahl’s Law fit

Plots are saved to results/*.png with CSV data exported for further analysis.


πŸ“ˆ Performance Analysis

Strong Scaling Results

Strong Scaling L16 vs L20 Strong Efficiency L16 vs L20

Level 16 (65,536 intervals)

Cores Time (s) Speedup Efficiency
1 176.17 1.00 100.0%
2 74.07 2.38 118.9%
4 39.97 4.41 110.2%
8 21.22 8.30 103.8%
16 9.72 18.12 113.2%
32 5.55 31.77 99.3%

Analysis: Super-linear speedup up to 16 cores due to improved cache locality. Near-perfect efficiency at 32 cores.

Level 20 (1,048,576 intervals)

Cores Time (s) Speedup Efficiency
1 2482.75 1.00 100.0%
2 1250.90 1.98 99.2%
4 795.84 3.12 78.0%
8 389.62 6.37 79.7%
16 168.07 14.77 92.3%
32 114.42 21.70 67.8%
64 306.86 8.09 12.6%

Analysis: Efficiency degrades at 64 cores due to communication overhead exceeding computation benefits.

Weak Scaling Results

Weak Scaling

Cores Level Intervals Time (s) Efficiency
1 12 4,096 9.99 100.0%
2 13 8,192 11.51 43.4%
4 14 16,384 15.87 15.7%
8 15 32,768 10.61 11.8%
16 16 65,536 104.27 0.6%

Analysis: Weak scaling degrades because Richardson extrapolation overhead grows as O(LΒ²), and each level requires completing all previous levels.

Placement Strategy Impact

Placement Comparison

Configuration: p=4, Level 20

Placement Time (s) Speedup
pack 640.97 0.84
scatter 767.95 0.70
pack:excl 536.17 1.00
scatter:excl 658.30 0.81

Conclusion: Minimal impact (<2% variation) for communication-light algorithms. pack:excl slightly optimal due to eliminating resource contention.

Hybrid MPI+OpenMP Results

Hybrid Scaling

Total Cores MPI Ranks OMP Threads Time (s) Speedup Efficiency
4 2 2 1211.02 2.05 51.3%
8 4 2 1781.68 1.39 17.4%
16 8 2 305.26 8.13 50.8%

Conclusion: Hybrid model underperforms pure MPI due to thread management overhead and suboptimal load balancing between parallelism levels.

Amdahl’s Law Analysis

Amdahl Fit

Empirical fitting to Level 20 data estimates serial fraction f β‰ˆ 4.8%, yielding:

Maximum theoretical speedup = 1/0.048 β‰ˆ 20.8Γ—

Observed speedup at p=32 (21.7Γ—) slightly exceeds this due to cache effects. Performance degradation at p=64 confirms communication overhead dominance.


πŸ”¬ Technical Details

Computational Complexity

Communication Pattern

Load Balancing Strategy

Block partitioning with remainder distribution:

base = n / comm_sz;
remainder = n % comm_sz;

// First 'remainder' processes get base+1 intervals
// Remaining processes get base intervals
local_n = base + (my_rank < remainder ? 1 : 0);

Example: n=10 intervals, p=3 processes


πŸ› οΈ Future Work

Potential improvements and extensions:

  1. Adaptive quadrature: Focus computation on high-error regions
  2. Pipeline parallelism: Overlap extrapolation with next level’s computation
  3. GPU acceleration: Offload heavy integrand evaluations to accelerators
  4. Non-blocking communication: Use MPI_Iallreduce for overlapping computation/communication
  5. Dynamic load balancing: Adjust partitioning based on runtime performance
  6. Fault tolerance: Implement checkpointing for long-running jobs

πŸ™ Acknowledgments

This project was developed as part of the HPC4DS course (2025-2026) at the University of Trento, taught by Prof. Sandro Fiore.

Resources:

Special thanks to the DISI (Department of Information Engineering and Computer Science) system administrators for cluster support.


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ“§ Contact

Authors:

Course Instructor:


Made with ❀️ at University of Trento