MPI Programming
This chapter covers important considerations when writing MPI programs using SafePETSc, particularly regarding error handling and collective operations.
The Challenge of Exceptions in MPI
MPI programming is fundamentally incompatible with traditional exception handling. When you write parallel code that runs across multiple MPI ranks, standard assertions or exceptions can cause serious problems:
The Problem: If one rank throws an exception while others continue execution, the MPI cluster becomes desynchronized and will hang. For example:
# DANGEROUS: Don't do this in MPI code!
using SafePETSc
SafePETSc.Init()
x = Vec_uniform([1.0, 2.0, NaN, 4.0]) # NaN only on some ranks
# This will hang! Some ranks will pass, others will fail
@assert all(isfinite.(Vector(x))) # ❌ Causes hang if ranks disagreeIn the example above, if some ranks have finite values but others have NaN, some ranks will assert while others continue. The MPI cluster becomes desynchronized and will deadlock.
Safe Exception Handling
To handle errors safely in MPI programs, exceptions must be collective operations that either fail on all ranks simultaneously or pass on all ranks. SafePETSc provides several tools for this.
Using @mpiassert for Collective Assertions
The SafeMPI.@mpiassert macro provides a collective assertion mechanism:
using SafePETSc
using SafePETSc.SafeMPI
SafePETSc.Init()
x = Vec_uniform([1.0, 2.0, 3.0, 4.0])
# Safe: All ranks check together
SafeMPI.@mpiassert all(isfinite.(Vector(x))) "Vector contains non-finite values"How @mpiassert works:
- Each rank evaluates the condition locally
- All ranks communicate to determine if ANY rank failed
- If any rank's condition is false, ALL ranks throw an error simultaneously
- If all ranks' conditions are true, ALL ranks continue
Important: @mpiassert is a collective operation and therefore slower than regular assertions. Use it only when necessary for correctness in MPI contexts.
Using mpi_any for Conditional Logic
When you need to make decisions based on conditions that might differ across ranks, use mpi_any:
using SafePETSc
using SafePETSc.SafeMPI
SafePETSc.Init()
# Each rank computes some local condition
local_has_error = some_local_check()
# Collective operation: true if ANY rank has an error
any_rank_has_error = mpi_any(local_has_error)
if any_rank_has_error
# All ranks execute this branch together
println(io0(), "Error detected on at least one rank")
# Handle error collectively
else
# All ranks execute this branch together
println(io0(), "All ranks are healthy")
endThis ensures all ranks make the same decision, preventing desynchronization.
See the SafeMPI API Reference for more details.
Using mpi_uniform to Verify Consistency
The mpi_uniform function checks whether a value is identical across all ranks:
using SafePETSc
using SafePETSc.SafeMPI
SafePETSc.Init()
# Create a matrix that should be the same on all ranks
A = [1.0 2.0; 3.0 4.0]
# Verify it's actually uniform across all ranks
SafeMPI.@mpiassert mpi_uniform(A) "Matrix A is not uniform across ranks"
# Safe to use A as a uniform matrix
A_petsc = Mat_uniform(A)This is particularly useful for debugging distributed algorithms where you expect certain values to be synchronized.
See the SafeMPI API Reference for more details.
Best Practices
Never use standard
@assertorthrowin MPI code unless you are certain all ranks will agree on the outcomeUse
@mpiassertfor correctness checks that involve distributed data:SafeMPI.@mpiassert size(A) == size(B) "Matrix dimensions must match"Use
mpi_anyfor error detection when local conditions might differ:if mpi_any(local_error_condition) # Handle error collectively on all ranks endUse
mpi_uniformto verify assumptions about distributed data:SafeMPI.@mpiassert mpi_uniform(config) "Configuration must be uniform"Remember that collective operations are slow - use them judiciously. They require communication between all ranks, so they can impact performance if used excessively.
Example: Safe Error Handling Pattern
Here's a complete example showing safe error handling in an MPI context:
using SafePETSc
using SafePETSc.SafeMPI
SafePETSc.Init()
function safe_computation(x::Vec)
# Convert to local array for checking
x_local = Vector(x)
# Check for problems locally
local_has_nan = any(isnan, x_local)
local_has_inf = any(isinf, x_local)
# Collective check: any rank has problems?
if mpi_any(local_has_nan)
error("NaN detected in vector on at least one rank")
end
if mpi_any(local_has_inf)
error("Inf detected in vector on at least one rank")
end
# All ranks confirmed data is good, proceed with computation
result = x .* 2.0
return result
end
# Usage
x = Vec_uniform([1.0, 2.0, 3.0, 4.0])
y = safe_computation(x) # Safe: all ranks execute togetherCommon Pitfalls to Avoid
Pitfall 1: Rank-Dependent Assertions
# ❌ WRONG: Will hang if condition differs by rank
if MPI.Comm_rank(MPI.COMM_WORLD) == 0
@assert check_something() # Only rank 0 might assert!
end# ✓ CORRECT: Use collective operations
local_check = (MPI.Comm_rank(MPI.COMM_WORLD) == 0) ? check_something() : true
SafeMPI.@mpiassert local_check "Check failed on rank 0"Pitfall 2: File I/O Errors
# ❌ WRONG: File might exist on some ranks but not others
@assert isfile("config.txt") # Might differ by rank!# ✓ CORRECT: Use collective check
SafeMPI.@mpiassert isfile("config.txt") "config.txt not found"Pitfall 3: Floating-Point Comparisons
# ❌ WRONG: Floating-point round-off might differ by rank
@assert computed_value ≈ expected_value# ✓ CORRECT: Use collective assertion
SafeMPI.@mpiassert computed_value ≈ expected_value "Value mismatch detected"Summary
MPI programming requires careful handling of exceptions and error conditions:
- Use
@mpiassertfor collective assertions that must pass or fail on all ranks together - Use
mpi_anyto make collective decisions based on local conditions - Use
mpi_uniformto verify data consistency across ranks - Never use standard assertions or exceptions that might execute differently on different ranks
- Remember that collective operations have performance costs - use them wisely but don't hesitate to use them for correctness
By following these patterns, you can write robust MPI programs that won't hang or deadlock due to desynchronized exception handling.