API Reference
This page provides detailed documentation for all exported types and functions in HPCSparseArrays.jl.
Unless otherwise noted, all functions are MPI collective operations. Every MPI rank must call these functions together.
Distributed Types
HPCVector
HPCSparseArrays.HPCVector — Type
HPCVector{T, B<:HPCBackend{T}}A distributed dense vector partitioned across MPI ranks.
Type Parameters
T: Element type (e.g.,Float64,ComplexF64)B<:HPCBackend{T}: Backend configuration with matching element type
Fields
structural_hash::Blake3Hash: 256-bit Blake3 hash of the partitionpartition::Vector{Int}: Partition boundaries, length = nranks + 1 (always on CPU)v::AbstractVector{T}: Local vector elements owned by this rankbackend::B: The HPC backend configuration
HPCMatrix
HPCSparseArrays.HPCMatrix — Type
HPCMatrix{T, B<:HPCBackend{T}}A distributed dense matrix partitioned by rows across MPI ranks.
Type Parameters
T: Element type (e.g.,Float64,ComplexF64)B<:HPCBackend{T}: Backend configuration with matching element type
Fields
structural_hash::Blake3Hash: 256-bit Blake3 hash of the structural patternrow_partition::Vector{Int}: Row partition boundaries, length = nranks + 1 (always on CPU)col_partition::Vector{Int}: Column partition boundaries, length = nranks + 1 (always on CPU)A::AbstractMatrix{T}: Local rows (NOT transposed), size = (local_nrows, ncols)backend::B: The HPC backend configuration
Invariants
row_partitionandcol_partitionare sortedrow_partition[nranks+1]= total number of rows + 1col_partition[nranks+1]= total number of columns + 1size(A, 1) == row_partition[rank+2] - row_partition[rank+1]size(A, 2) == col_partition[end] - 1
HPCSparseMatrix
HPCSparseArrays.HPCSparseMatrix — Type
HPCSparseMatrix{T,Ti,AV}A distributed sparse matrix partitioned by rows across MPI ranks.
Type Parameters
T: Element type (e.g.,Float64,ComplexF64)Ti: Index type (e.g.,Int,Int32), defaults toIntAV<:AbstractVector{T}: Storage type for nonzero values (Vector{T}for CPU,MtlVector{T}for GPU)
Fields
structural_hash::Blake3Hash: 256-bit Blake3 hash of the structural patternrow_partition::Vector{Int}: Row partition boundaries, length = nranks + 1col_partition::Vector{Int}: Column partition boundaries, length = nranks + 1 (placeholder for transpose)col_indices::Vector{Int}: Global column indices that appear in the local part (local→global mapping)rowptr::Vector{Ti}: Row pointers for CSR format (always CPU)colval::Vector{Ti}: LOCAL column indices 1:length(col_indices) for each nonzero (always CPU)nzval::AV: Nonzero values (CPU or GPU)nrows_local::Int: Number of local rowsncols_compressed::Int: Number of unique columns = length(col_indices)cached_transpose: Cached materialized transpose (bidirectionally linked)
Invariants
col_indices,row_partition, andcol_partitionare sortedrow_partition[nranks+1]= total number of rowscol_partition[nranks+1]= total number of columnsnrows_local == row_partition[rank+1] - row_partition[rank](number of local rows)ncols_compressed == length(col_indices)(compressed column dimension)colvalcontains local indices in1:ncols_compressedrowptrhas lengthnrows_local + 1
Storage Details
The local rows are stored in CSR format (Compressed Sparse Row), which enables efficient row-wise iteration - essential for a row-partitioned distributed matrix.
The CSR storage consists of:
rowptr: Row pointers where row i has nonzeros at positions rowptr[i]:(rowptr[i+1]-1)colval: LOCAL column indices (1:ncols_compressed), not global indicesnzval: Nonzero valuescol_indices[local_idx]maps local→global column indices
This compression avoids "hypersparse" storage where the column dimension would be the global number of columns even if only a few columns have nonzeros locally.
GPU Support
Structure arrays (rowptr, colval) always stay on CPU for MPI communication and indexing. Only nzval can live on GPU, with type determined by the backend device:
DeviceCPU: Vector{T} storageDeviceMetal: MtlVector{T} storage (macOS)DeviceCUDA: CuVector{T} storage
Use to_backend(A, target_backend) to convert between backends.
SparseMatrixCSR
HPCSparseArrays.SparseMatrixCSR — Type
SparseMatrixCSR{Tv,Ti} = Transpose{Tv, SparseMatrixCSC{Tv,Ti}}Type alias for CSR (Compressed Sparse Row) storage format.
The Dual Life of Transpose{SparseMatrixCSC}
In Julia, the type Transpose{Tv, SparseMatrixCSC{Tv,Ti}} has two interpretations:
Semantic interpretation: A lazy transpose wrapper around a CSC matrix. When you call
transpose(A)on a SparseMatrixCSC, you get this wrapper that represents A^T without copying data.Storage interpretation: CSR (row-major) access to sparse data. The underlying CSC stores columns contiguously, but through the transpose wrapper, we can iterate efficiently over rows instead of columns.
This alias clarifies intent: use SparseMatrixCSR when you want row-major storage semantics, and transpose(A) when you want the mathematical transpose.
CSR vs CSC Storage
- CSC (Compressed Sparse Column): Julia's native sparse format. Efficient for column-wise operations, matrix-vector products with column access.
- CSR (Compressed Sparse Row): Efficient for row-wise operations, matrix-vector products with row access, and row-partitioned distributed matrices.
For SparseMatrixCSR, the underlying parent::SparseMatrixCSC stores the transposed matrix. If B = SparseMatrixCSR(A) represents matrix M, then B.parent is a CSC storing M^T. This means:
B.parent.colptracts as row pointers for MB.parent.rowvalcontains column indices for MB.parent.nzvalcontains values in row-major order
Usage Note
Julia will still display this type as Transpose{Float64, SparseMatrixCSC{...}}, not as SparseMatrixCSR. The alias improves code clarity but doesn't affect type printing.
Local Constructors
These constructors create distributed types from local data without global communication.
HPCSparseArrays.HPCVector_local — Function
HPCVector_local(v_local::AbstractVector{T}, backend::HPCBackend) where TCreate a HPCVector from a local vector on each rank.
Unlike HPCVector(v_global, backend) which takes a global vector and partitions it, this constructor takes only the local portion of the vector that each rank owns. The partition is computed by gathering the local sizes from all ranks.
Arguments
v_local: Local vector portion owned by this rankbackend: The HPCBackend configuration (determines communication)
Example
backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Rank 0 has [1.0, 2.0], Rank 1 has [3.0, 4.0, 5.0]
v = HPCVector_local([1.0, 2.0], backend) # on rank 0
v = HPCVector_local([3.0, 4.0, 5.0], backend) # on rank 1
# Result: distributed vector [1.0, 2.0, 3.0, 4.0, 5.0] with partition [1, 3, 6]HPCSparseArrays.HPCMatrix_local — Function
HPCMatrix_local(A_local::AbstractMatrix{T}, backend::HPCBackend; col_partition=...) where TCreate a HPCMatrix from a local matrix on each rank.
Unlike HPCMatrix(M_global, backend) which takes a global matrix and partitions it, this constructor takes only the local rows of the matrix that each rank owns. The row partition is computed by gathering the local row counts from all ranks.
The type of A_local determines the storage backend (CPU Matrix, GPU MtlMatrix, etc.).
All ranks must have local matrices with the same number of columns. A collective error is raised if the column counts don't match.
Arguments
A_local::AbstractMatrix{T}: Local rows owned by this rankbackend::HPCBackend: The HPC backend configuration
Keyword Arguments
col_partition::Vector{Int}: Column partition boundaries (default:uniform_partition(size(A_local,2), nranks))
Example
backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Rank 0 has 2×3 matrix, Rank 1 has 3×3 matrix
A = HPCMatrix_local(randn(2, 3), backend) # on rank 0
A = HPCMatrix_local(randn(3, 3), backend) # on rank 1
# Result: 5×3 distributed matrix with row_partition [1, 3, 6]HPCSparseArrays.HPCSparseMatrix_local — Function
HPCSparseMatrix_local(A_local::SparseMatrixCSR{T,Ti}, backend::HPCBackend; col_partition=...) where {T,Ti}
HPCSparseMatrix_local(A_local::Adjoint{T,SparseMatrixCSC{T,Ti}}, backend::HPCBackend; col_partition=...) where {T,Ti}Create a HPCSparseMatrix from a local sparse matrix on each rank.
Unlike HPCSparseMatrix{T}(A_global, backend) which takes a global matrix and partitions it, this constructor takes only the local rows of the matrix that each rank owns. The row partition is computed by gathering the local row counts from all ranks.
The input A_local must be a SparseMatrixCSR{T,Ti} (or Adjoint of SparseMatrixCSC{T,Ti}) where:
A_local.parent.n= number of local rows on this rankA_local.parent.m= global number of columns (must match on all ranks)A_local.parent.rowval= global column indices
All ranks must have local matrices with the same number of columns (block widths must match). A collective error is raised if the column counts don't match.
Note: For Adjoint inputs, the values are conjugated to match the adjoint semantics.
Arguments
A_local: Local sparse matrix in CSR formatbackend::HPCBackend: Backend configuration (device, comm, solver)
Keyword Arguments
col_partition::Vector{Int}: Column partition boundaries (default:uniform_partition(A_local.parent.m, nranks))
Example
backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Create local rows in CSR format
# Rank 0 owns rows 1-2 of a 5×3 matrix, Rank 1 owns rows 3-5
local_csc = sparse([1, 1, 2], [1, 2, 3], [1.0, 2.0, 3.0], 2, 3) # 2 local rows, 3 cols
A = HPCSparseMatrix_local(SparseMatrixCSR(local_csc), backend)Row-wise Operations
HPCSparseArrays.map_rows — Function
map_rows(f, A...)Apply function f to corresponding rows of distributed arrays, with CPU fallback.
This is the safe version that handles functions with arbitrary closures by converting GPU arrays to CPU, applying the function, and converting back. Use map_rows_gpu for performance-critical inner loops where f is isbits-compatible.
Arguments
f: Function to apply to each row (can capture non-isbits data)A...: One or more distributed arrays (HPCVector or HPCMatrix)
Returns
- HPCVector or HPCMatrix depending on the return type of
f
See also: map_rows_gpu for GPU-native version (requires isbits closures)
HPCSparseArrays.map_rows_gpu — Function
map_rows_gpu(f, A...)Apply function f to corresponding rows of distributed vectors/matrices (GPU-native).
Each argument in A... must be either a HPCVector or HPCMatrix. All inputs are repartitioned to match the partition of the first argument before applying f.
This implementation uses GPU-friendly broadcasting: matrices are converted to Vector{SVector} via transpose+reinterpret, then f is broadcast over all arguments. This avoids GPU->CPU->GPU round-trips when the underlying arrays are on GPU.
Important: The function f must be isbits-compatible (no captured non-isbits data) for GPU execution. Use map_rows for functions with arbitrary closures.
For each row index i, f is called with:
- For
HPCVector: the scalar element at index i - For
HPCMatrix: an SVector containing the i-th row
Result Type
The result type depends on what f returns:
f returns | Result |
|---|---|
scalar (Number) | HPCVector (one element per input row) |
SVector{K,T} | HPCMatrix (K columns, one row per input row) |
Examples
# Element-wise product of two vectors
u = HPCVector([1.0, 2.0, 3.0])
v = HPCVector([4.0, 5.0, 6.0])
w = map_rows_gpu((a, b) -> a * b, u, v) # HPCVector([4.0, 10.0, 18.0])
# Row norms of a matrix
A = HPCMatrix(randn(5, 3))
norms = map_rows_gpu(r -> norm(r), A) # HPCVector of row norms
# Return SVector to build a matrix
A = HPCMatrix(randn(3, 2))
result = map_rows_gpu(r -> SVector(sum(r), prod(r)), A) # 3×2 HPCMatrix
# Mixed inputs: matrix rows combined with vector elements
A = HPCMatrix(randn(4, 3))
w = HPCVector([1.0, 2.0, 3.0, 4.0])
result = map_rows_gpu((row, wi) -> sum(row) * wi, A, w) # HPCVectorSee also: map_rows for CPU fallback version (handles arbitrary closures)
Linear System Solvers
HPCSparseArrays.solve — Function
solve(F::MUMPSFactorization{Tin, Bin, Tinternal}, b::HPCVector) where {Tin, Bin, Tinternal}Solve the linear system A*x = b using the precomputed MUMPS factorization.
The input vector b can have any compatible element type and backend. The result is returned with the same element type and backend as the original matrix used for factorization.
HPCSparseArrays.solve! — Function
solve!(x::HPCVector, F::MUMPSFactorization{Tin, Bin, Tinternal}, b::HPCVector) where {Tin, Bin, Tinternal}Solve A*x = b in-place using MUMPS factorization.
Automatically converts inputs to CPU Float64/ComplexF64 for MUMPS, then converts results back to the element type and backend of x.
HPCSparseArrays.finalize! — Function
finalize!(F::MUMPSFactorization)Release MUMPS resources. Must be called on all ranks together.
This properly finalizes the MUMPS object and frees all associated memory. After calling finalize!, the factorization should not be used.
Partition Utilities
HPCSparseArrays.uniform_partition — Function
uniform_partition(n::Int, nranks::Int) -> Vector{Int}Compute a balanced partition of n elements across nranks ranks. Returns a vector of length nranks + 1 with 1-indexed partition boundaries.
The first mod(n, nranks) ranks get div(n, nranks) + 1 elements, the remaining ranks get div(n, nranks) elements.
Example
partition = uniform_partition(10, 4) # [1, 4, 7, 9, 11]
# Rank 0: 1:3 (3 elements)
# Rank 1: 4:6 (3 elements)
# Rank 2: 7:8 (2 elements)
# Rank 3: 9:10 (2 elements)HPCSparseArrays.repartition — Function
repartition(x::HPCVector{T}, p::Vector{Int}) where TRedistribute a HPCVector to a new partition p.
The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == length(x) + 1.
Returns a new HPCVector with the same data but partition == p.
Example
v = HPCVector([1.0, 2.0, 3.0, 4.0]) # uniform partition
new_partition = [1, 2, 5] # rank 0 gets 1 element, rank 1 gets 3
v_repart = repartition(v, new_partition)repartition(A::HPCMatrix{T}, p::Vector{Int}) where TRedistribute a HPCMatrix to a new row partition p. The col_partition remains unchanged.
The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == size(A, 1) + 1.
Returns a new HPCMatrix with the same data but row_partition == p.
Example
A = HPCMatrix(randn(6, 4)) # uniform partition
new_partition = [1, 2, 4, 5, 7] # 1, 2, 1, 2 rows per rank
A_repart = repartition(A, new_partition)repartition(A::HPCSparseMatrix{T}, p::Vector{Int}) where TRedistribute a HPCSparseMatrix to a new row partition p. The col_partition remains unchanged.
The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == size(A, 1) + 1.
Returns a new HPCSparseMatrix with the same data but row_partition == p.
Example
A = HPCSparseMatrix{Float64}(sprand(6, 4, 0.5)) # uniform partition
new_partition = [1, 2, 4, 5, 7] # 1, 2, 1, 2 rows per rank
A_repart = repartition(A, new_partition)Cache Management
HPCSparseArrays.clear_plan_cache! — Function
clear_plan_cache!()Clear all memoized plan caches, including the MUMPS analysis cache. This is a collective operation that must be called on all MPI ranks together.
HPCSparseArrays.clear_mumps_analysis_cache! — Function
clear_mumps_backslash_cache!()Clear the MUMPS backslash cache. This is a collective operation that must be called on all MPI ranks together.
IO Utilities
HPCSparseArrays.io0 — Function
io0(io=stdout; r::Set{Int}=Set{Int}([0]), dn=devnull)Return io if the current MPI rank is in set r, otherwise return dn (default: devnull).
This is useful for printing only from specific ranks:
println(io0(), "Hello from rank 0!")
println(io0(r=Set([0,1])), "Hello from ranks 0 and 1!")With string interpolation:
println(io0(), "Matrix A = $A")Backend Types
HPCSparseArrays.HPCBackend — Type
HPCBackend{T, Ti<:Integer, D<:AbstractDevice, C<:AbstractComm, S<:AbstractSolver}Unified backend type that encapsulates element type, index type, device, communication, and solver configuration.
Type Parameters
T: Element type for array values (e.g., Float64, ComplexF64)Ti: Index type for sparse matrix indices (e.g., Int32, Int64)D: Device type (DeviceCPU, DeviceMetal, DeviceCUDA)C: Communication type (CommSerial, CommMPI)S: Solver type (SolverMUMPS, or cuDSS variants)
Fields
device::D: The compute devicecomm::C: The communication backendsolver::S: The linear solver
Example
# CPU with MPI and MUMPS, Float64 values and Int32 indices
backend = backend_cpu_mpi(Float64, Int32)
# CPU serial (single process) with default types (Float64, Int)
backend = backend_cpu_serial()HPCSparseArrays.DeviceCPU — Type
DeviceCPU <: AbstractDeviceCPU device - uses standard Julia arrays (Vector, Matrix).
HPCSparseArrays.DeviceMetal — Type
DeviceMetal <: AbstractDeviceMetal GPU device (macOS) - uses Metal.jl arrays (MtlVector, MtlMatrix). Defined here as a placeholder; actual Metal support is in the Metal extension.
HPCSparseArrays.DeviceCUDA — Type
DeviceCUDA <: AbstractDeviceCUDA GPU device - uses CUDA.jl arrays (CuVector, CuMatrix). Defined here as a placeholder; actual CUDA support is in the CUDA extension.
HPCSparseArrays.CommSerial — Type
CommSerial <: AbstractCommSerial (single-process) communication - all collective operations become no-ops.
HPCSparseArrays.CommMPI — Type
CommMPI <: AbstractCommMPI-based distributed communication.
Fields
comm::MPI.Comm: The MPI communicator to use for all operations.
HPCSparseArrays.SolverMUMPS — Type
SolverMUMPS <: AbstractSolverMUMPS sparse direct solver. Works on CPU with MPI or serial communication.
HPCSparseArrays.BACKEND_CPU_MPI — Constant
BACKEND_CPU_MPIPre-constructed CPU backend with MPI communication, Float64 values, and Int indices.
HPCSparseArrays.BACKEND_CPU_SERIAL — Constant
BACKEND_CPU_SERIALPre-constructed CPU backend with serial communication, Float64 values, and Int indices.
HPCSparseArrays.backend_metal_mpi — Function
backend_metal_mpi(comm::MPI.Comm) -> HPCBackendCreate a Metal GPU backend with MPI communication. Requires the Metal extension to be loaded.
HPCSparseArrays.backend_cuda_mpi — Function
backend_cuda_mpi(comm::MPI.Comm) -> HPCBackendCreate a CUDA GPU backend with MPI communication and cuDSS solver. Requires the CUDA extension to be loaded.
HPCSparseArrays.to_backend — Function
to_backend(v::HPCVector, backend::HPCBackend) -> HPCVectorConvert a HPCVector to use a different backend.
to_backend(A::HPCMatrix, backend::HPCBackend) -> HPCMatrixConvert a HPCMatrix to use a different backend.
to_backend(A::HPCSparseMatrix, backend::HPCBackend) -> HPCSparseMatrixConvert a HPCSparseMatrix to use a different backend. The nzval and target structure arrays are converted; CPU structure arrays remain on CPU.
Type Mappings
Native to Distributed Conversions
| Native Type | Distributed Type | Description |
|---|---|---|
Vector{T} | HPCVector{T,B} | Distributed vector |
Matrix{T} | HPCMatrix{T,B} | Distributed dense matrix |
SparseMatrixCSC{T,Ti} | HPCSparseMatrix{T,Ti,B} | Distributed sparse matrix |
The B<:HPCBackend type parameter specifies the backend configuration (device, communication, solver). Use pre-constructed backends like BACKEND_CPU_MPI or factory functions like backend_cuda_mpi(comm).
Distributed to Native Conversions
| Distributed Type | Native Type | Function |
|---|---|---|
HPCVector{T,B} | Vector{T} | Vector(v) |
HPCMatrix{T,B} | Matrix{T} | Matrix(A) |
HPCSparseMatrix{T,Ti,B} | SparseMatrixCSC{T,Ti} | SparseMatrixCSC(A) |
Supported Operations
HPCVector Operations
- Arithmetic:
+,-,*(scalar) - Linear algebra:
norm,dot,conj - Indexing:
v[i](global index) - Conversion:
Vector(v)
HPCMatrix Operations
- Arithmetic:
*(scalar), matrix-vector product - Transpose:
transpose(A) - Indexing:
A[i, j](global indices) - Conversion:
Matrix(A)
HPCSparseMatrix Operations
- Arithmetic:
+,-,*(scalar, matrix-vector, matrix-matrix) - Transpose:
transpose(A) - Linear solve:
A \ b,Symmetric(A) \ b - Utilities:
nnz,norm,issymmetric - Conversion:
SparseMatrixCSC(A)
Factorization Types
HPCSparseArrays uses MUMPS for sparse direct solves:
lu(A): LU factorization (general matrices)ldlt(A): LDLT factorization (symmetric matrices, faster)
Both return factorization objects that support:
F \ b: Solve with factorizationfinalize!(F): Release MUMPS resources