Dual Master's in Industrial Engineering & Computer Science | Optimization, Data, Automation
High-performance parallel implementation of Romberg numerical integration using MPI and hybrid MPI+OpenMP strategies
Developed as part of the High Performance Computing for Data Science (HPC4DS) course at the University of Trento (2025-2026).
Authors: Lewis Ndambiri & Mehrab Fajar
This project implements parallel Romberg integration, a sophisticated numerical method combining the trapezoidal rule with Richardson extrapolation. We developed three implementations:
The implementations were tested on the University of Trento HPC cluster (6,092 CPU cores, Infiniband/Omnipath interconnect) with comprehensive scalability analysis.
| Configuration | Speedup | Efficiency | Problem Size |
|---|---|---|---|
| 32 cores | 31.77Γ | 99.3% | Level 16 (65K intervals) |
| 32 cores | 21.70Γ | 67.8% | Level 20 (1M intervals) |
| 16 cores | 18.12Γ | 113.2% | Level 16 (super-linear!) |
Romberg integration is an extrapolation method that:
Mathematical formulation:
R[i][0] = Trapezoidal rule with n = 2^i intervals
R[i][j] = R[i][j-1] + (R[i][j-1] - R[i-1][j-1]) / (4^j - 1)
We integrate:
f(x) = sin(x) * exp(-xΒ²) over [0, Ο]
To simulate expensive HPC workloads (PDE solvers, physics simulations), each function evaluation is repeated 1 million times.
Parallelization approach:
MPI_Reduce per Romberg level// Block partitioning with balanced remainder distribution
int base = n / comm_sz;
int remainder = n % comm_sz;
int local_n = base + (my_rank < remainder);
Why MPI_Reduce instead of MPI_Allreduce?
Two-level parallelism:
#pragma omp parallel for reduction(+: local_sum) schedule(static)
for (int i = 1; i < local_n; i++) {
local_sum += f(local_a + i * h);
}
Hybrid configuration tested:
Finding: Pure MPI outperformed hybrid for this algorithm due to thread management overhead and NUMA effects.
romberg-hpc4ds/
βββ src/
β βββ romberg_serial.c # Serial baseline
β βββ romberg_mpi.c # Pure MPI implementation
β βββ romberg_mpi_openmp.c # Hybrid MPI+OpenMP version
β
βββ scripts/
β βββ romberg_serial.pbs # PBS job template (serial)
β βββ romberg_mpi.pbs # PBS job template (MPI)
β βββ romberg_hybrid.pbs # PBS job template (hybrid)
β βββ run_experiments.sh # Automated benchmarking suite
β
βββ analysis/
β βββ analyze_romberg_logs.py # Performance visualization tools
β
βββ results/
β βββ *.out # Job output files
β βββ *.csv # Performance data
β βββ *.png # Generated plots
β
βββ Makefile # Build automation
βββ README.md # This file
βββ LICENSE # MIT License
mpicc)git clone https://github.com/lewisndambiri.io/romberg-hpc4ds.git
cd romberg-hpc4ds
make clean
make all
This compiles:
romberg_serial β Serial baselineromberg_mpi β Pure MPI versionromberg_mpi_openmp β Hybrid version./romberg_serial 10 # Run serial with 10 Romberg levels
mpiexec -n 4 ./romberg_mpi 10 # Run MPI with 4 processes
Serial execution:
./romberg_serial 16
MPI parallel execution:
mpiexec -n 4 ./romberg_mpi 16
Hybrid execution:
export OMP_NUM_THREADS=2
mpiexec -n 4 ./romberg_mpi_openmp 16
Submit individual jobs:
qsub scripts/romberg_serial.pbs
qsub scripts/romberg_mpi.pbs
qsub scripts/romberg_hybrid.pbs
Run complete benchmark suite:
./scripts/run_experiments.sh
This automated script submits jobs for:
Monitor job status:
qstat -u $USER
After jobs complete, run the analysis script:
cd analysis/
python3 analyze_romberg_logs.py
This generates:
Plots are saved to results/*.png with CSV data exported for further analysis.

| Cores | Time (s) | Speedup | Efficiency |
|---|---|---|---|
| 1 | 176.17 | 1.00 | 100.0% |
| 2 | 74.07 | 2.38 | 118.9% |
| 4 | 39.97 | 4.41 | 110.2% |
| 8 | 21.22 | 8.30 | 103.8% |
| 16 | 9.72 | 18.12 | 113.2% |
| 32 | 5.55 | 31.77 | 99.3% |
Analysis: Super-linear speedup up to 16 cores due to improved cache locality. Near-perfect efficiency at 32 cores.
| Cores | Time (s) | Speedup | Efficiency |
|---|---|---|---|
| 1 | 2482.75 | 1.00 | 100.0% |
| 2 | 1250.90 | 1.98 | 99.2% |
| 4 | 795.84 | 3.12 | 78.0% |
| 8 | 389.62 | 6.37 | 79.7% |
| 16 | 168.07 | 14.77 | 92.3% |
| 32 | 114.42 | 21.70 | 67.8% |
| 64 | 306.86 | 8.09 | 12.6% |
Analysis: Efficiency degrades at 64 cores due to communication overhead exceeding computation benefits.

| Cores | Level | Intervals | Time (s) | Efficiency |
|---|---|---|---|---|
| 1 | 12 | 4,096 | 9.99 | 100.0% |
| 2 | 13 | 8,192 | 11.51 | 43.4% |
| 4 | 14 | 16,384 | 15.87 | 15.7% |
| 8 | 15 | 32,768 | 10.61 | 11.8% |
| 16 | 16 | 65,536 | 104.27 | 0.6% |
Analysis: Weak scaling degrades because Richardson extrapolation overhead grows as O(LΒ²), and each level requires completing all previous levels.

Configuration: p=4, Level 20
| Placement | Time (s) | Speedup |
|---|---|---|
| pack | 640.97 | 0.84 |
| scatter | 767.95 | 0.70 |
| pack:excl | 536.17 | 1.00 |
| scatter:excl | 658.30 | 0.81 |
Conclusion: Minimal impact (<2% variation) for communication-light algorithms. pack:excl slightly optimal due to eliminating resource contention.

| Total Cores | MPI Ranks | OMP Threads | Time (s) | Speedup | Efficiency |
|---|---|---|---|---|---|
| 4 | 2 | 2 | 1211.02 | 2.05 | 51.3% |
| 8 | 4 | 2 | 1781.68 | 1.39 | 17.4% |
| 16 | 8 | 2 | 305.26 | 8.13 | 50.8% |
Conclusion: Hybrid model underperforms pure MPI due to thread management overhead and suboptimal load balancing between parallelism levels.

Empirical fitting to Level 20 data estimates serial fraction f β 4.8%, yielding:
Maximum theoretical speedup = 1/0.048 β 20.8Γ
Observed speedup at p=32 (21.7Γ) slightly exceeds this due to cache effects. Performance degradation at p=64 confirms communication overhead dominance.
Block partitioning with remainder distribution:
base = n / comm_sz;
remainder = n % comm_sz;
// First 'remainder' processes get base+1 intervals
// Remaining processes get base intervals
local_n = base + (my_rank < remainder ? 1 : 0);
Example: n=10 intervals, p=3 processes
Potential improvements and extensions:
MPI_Iallreduce for overlapping computation/communicationThis project was developed as part of the HPC4DS course (2025-2026) at the University of Trento, taught by Prof. Sandro Fiore.
Resources:
Special thanks to the DISI (Department of Information Engineering and Computer Science) system administrators for cluster support.
This project is licensed under the MIT License - see the LICENSE file for details.
Authors:
Course Instructor:
Made with β€οΈ at University of Trento