API Reference

This page provides detailed documentation for all exported types and functions in HPCSparseArrays.jl.

MPI Collective Operations

Unless otherwise noted, all functions are MPI collective operations. Every MPI rank must call these functions together.

Distributed Types

HPCVector

HPCSparseArrays.HPCVector — Type

HPCVector{T, B<:HPCBackend{T}}

A distributed dense vector partitioned across MPI ranks.

Type Parameters

T: Element type (e.g., Float64, ComplexF64)
B<:HPCBackend{T}: Backend configuration with matching element type

Fields

structural_hash::Blake3Hash: 256-bit Blake3 hash of the partition
partition::Vector{Int}: Partition boundaries, length = nranks + 1 (always on CPU)
v::AbstractVector{T}: Local vector elements owned by this rank
backend::B: The HPC backend configuration

source

HPCMatrix

HPCSparseArrays.HPCMatrix — Type

HPCMatrix{T, B<:HPCBackend{T}}

A distributed dense matrix partitioned by rows across MPI ranks.

Type Parameters

T: Element type (e.g., Float64, ComplexF64)
B<:HPCBackend{T}: Backend configuration with matching element type

Fields

structural_hash::Blake3Hash: 256-bit Blake3 hash of the structural pattern
row_partition::Vector{Int}: Row partition boundaries, length = nranks + 1 (always on CPU)
col_partition::Vector{Int}: Column partition boundaries, length = nranks + 1 (always on CPU)
A::AbstractMatrix{T}: Local rows (NOT transposed), size = (local_nrows, ncols)
backend::B: The HPC backend configuration

Invariants

row_partition and col_partition are sorted
row_partition[nranks+1] = total number of rows + 1
col_partition[nranks+1] = total number of columns + 1
size(A, 1) == row_partition[rank+2] - row_partition[rank+1]
size(A, 2) == col_partition[end] - 1

source

HPCSparseMatrix

HPCSparseArrays.HPCSparseMatrix — Type

HPCSparseMatrix{T,Ti,AV}

A distributed sparse matrix partitioned by rows across MPI ranks.

Type Parameters

T: Element type (e.g., Float64, ComplexF64)
Ti: Index type (e.g., Int, Int32), defaults to Int
AV<:AbstractVector{T}: Storage type for nonzero values (Vector{T} for CPU, MtlVector{T} for GPU)

Fields

structural_hash::Blake3Hash: 256-bit Blake3 hash of the structural pattern
row_partition::Vector{Int}: Row partition boundaries, length = nranks + 1
col_partition::Vector{Int}: Column partition boundaries, length = nranks + 1 (placeholder for transpose)
col_indices::Vector{Int}: Global column indices that appear in the local part (local→global mapping)
rowptr::Vector{Ti}: Row pointers for CSR format (always CPU)
colval::Vector{Ti}: LOCAL column indices 1:length(col_indices) for each nonzero (always CPU)
nzval::AV: Nonzero values (CPU or GPU)
nrows_local::Int: Number of local rows
ncols_compressed::Int: Number of unique columns = length(col_indices)
cached_transpose: Cached materialized transpose (bidirectionally linked)

Invariants

col_indices, row_partition, and col_partition are sorted
row_partition[nranks+1] = total number of rows
col_partition[nranks+1] = total number of columns
nrows_local == row_partition[rank+1] - row_partition[rank] (number of local rows)
ncols_compressed == length(col_indices) (compressed column dimension)
colval contains local indices in 1:ncols_compressed
rowptr has length nrows_local + 1

Storage Details

The local rows are stored in CSR format (Compressed Sparse Row), which enables efficient row-wise iteration - essential for a row-partitioned distributed matrix.

The CSR storage consists of:

rowptr: Row pointers where row i has nonzeros at positions rowptr[i]:(rowptr[i+1]-1)
colval: LOCAL column indices (1:ncols_compressed), not global indices
nzval: Nonzero values
col_indices[local_idx] maps local→global column indices

This compression avoids "hypersparse" storage where the column dimension would be the global number of columns even if only a few columns have nonzeros locally.

GPU Support

Structure arrays (rowptr, colval) always stay on CPU for MPI communication and indexing. Only nzval can live on GPU, with type determined by the backend device:

DeviceCPU: Vector{T} storage
DeviceMetal: MtlVector{T} storage (macOS)
DeviceCUDA: CuVector{T} storage

Use to_backend(A, target_backend) to convert between backends.

source

SparseMatrixCSR

HPCSparseArrays.SparseMatrixCSR — Type

SparseMatrixCSR{Tv,Ti} = Transpose{Tv, SparseMatrixCSC{Tv,Ti}}

Type alias for CSR (Compressed Sparse Row) storage format.

The Dual Life of Transpose{SparseMatrixCSC}

In Julia, the type Transpose{Tv, SparseMatrixCSC{Tv,Ti}} has two interpretations:

Semantic interpretation: A lazy transpose wrapper around a CSC matrix. When you call transpose(A) on a SparseMatrixCSC, you get this wrapper that represents A^T without copying data.
Storage interpretation: CSR (row-major) access to sparse data. The underlying CSC stores columns contiguously, but through the transpose wrapper, we can iterate efficiently over rows instead of columns.

This alias clarifies intent: use SparseMatrixCSR when you want row-major storage semantics, and transpose(A) when you want the mathematical transpose.

CSR vs CSC Storage

CSC (Compressed Sparse Column): Julia's native sparse format. Efficient for column-wise operations, matrix-vector products with column access.
CSR (Compressed Sparse Row): Efficient for row-wise operations, matrix-vector products with row access, and row-partitioned distributed matrices.

For SparseMatrixCSR, the underlying parent::SparseMatrixCSC stores the transposed matrix. If B = SparseMatrixCSR(A) represents matrix M, then B.parent is a CSC storing M^T. This means:

B.parent.colptr acts as row pointers for M
B.parent.rowval contains column indices for M
B.parent.nzval contains values in row-major order

Usage Note

Julia will still display this type as Transpose{Float64, SparseMatrixCSC{...}}, not as SparseMatrixCSR. The alias improves code clarity but doesn't affect type printing.

source

Local Constructors

These constructors create distributed types from local data without global communication.

HPCSparseArrays.HPCVector_local — Function

HPCVector_local(v_local::AbstractVector{T}, backend::HPCBackend) where T

Create a HPCVector from a local vector on each rank.

Unlike HPCVector(v_global, backend) which takes a global vector and partitions it, this constructor takes only the local portion of the vector that each rank owns. The partition is computed by gathering the local sizes from all ranks.

Arguments

v_local: Local vector portion owned by this rank
backend: The HPCBackend configuration (determines communication)

Example

backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Rank 0 has [1.0, 2.0], Rank 1 has [3.0, 4.0, 5.0]
v = HPCVector_local([1.0, 2.0], backend)  # on rank 0
v = HPCVector_local([3.0, 4.0, 5.0], backend)  # on rank 1
# Result: distributed vector [1.0, 2.0, 3.0, 4.0, 5.0] with partition [1, 3, 6]

source

HPCSparseArrays.HPCMatrix_local — Function

HPCMatrix_local(A_local::AbstractMatrix{T}, backend::HPCBackend; col_partition=...) where T

Create a HPCMatrix from a local matrix on each rank.

Unlike HPCMatrix(M_global, backend) which takes a global matrix and partitions it, this constructor takes only the local rows of the matrix that each rank owns. The row partition is computed by gathering the local row counts from all ranks.

The type of A_local determines the storage backend (CPU Matrix, GPU MtlMatrix, etc.).

All ranks must have local matrices with the same number of columns. A collective error is raised if the column counts don't match.

Arguments

A_local::AbstractMatrix{T}: Local rows owned by this rank
backend::HPCBackend: The HPC backend configuration

Keyword Arguments

col_partition::Vector{Int}: Column partition boundaries (default: uniform_partition(size(A_local,2), nranks))

Example

backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Rank 0 has 2×3 matrix, Rank 1 has 3×3 matrix
A = HPCMatrix_local(randn(2, 3), backend)  # on rank 0
A = HPCMatrix_local(randn(3, 3), backend)  # on rank 1
# Result: 5×3 distributed matrix with row_partition [1, 3, 6]

source

HPCSparseArrays.HPCSparseMatrix_local — Function

HPCSparseMatrix_local(A_local::SparseMatrixCSR{T,Ti}, backend::HPCBackend; col_partition=...) where {T,Ti}
HPCSparseMatrix_local(A_local::Adjoint{T,SparseMatrixCSC{T,Ti}}, backend::HPCBackend; col_partition=...) where {T,Ti}

Create a HPCSparseMatrix from a local sparse matrix on each rank.

Unlike HPCSparseMatrix{T}(A_global, backend) which takes a global matrix and partitions it, this constructor takes only the local rows of the matrix that each rank owns. The row partition is computed by gathering the local row counts from all ranks.

The input A_local must be a SparseMatrixCSR{T,Ti} (or Adjoint of SparseMatrixCSC{T,Ti}) where:

A_local.parent.n = number of local rows on this rank
A_local.parent.m = global number of columns (must match on all ranks)
A_local.parent.rowval = global column indices

All ranks must have local matrices with the same number of columns (block widths must match). A collective error is raised if the column counts don't match.

Note: For Adjoint inputs, the values are conjugated to match the adjoint semantics.

Arguments

A_local: Local sparse matrix in CSR format
backend::HPCBackend: Backend configuration (device, comm, solver)

Keyword Arguments

col_partition::Vector{Int}: Column partition boundaries (default: uniform_partition(A_local.parent.m, nranks))

Example

backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Create local rows in CSR format
# Rank 0 owns rows 1-2 of a 5×3 matrix, Rank 1 owns rows 3-5
local_csc = sparse([1, 1, 2], [1, 2, 3], [1.0, 2.0, 3.0], 2, 3)  # 2 local rows, 3 cols
A = HPCSparseMatrix_local(SparseMatrixCSR(local_csc), backend)

source

Row-wise Operations

HPCSparseArrays.map_rows — Function

map_rows(f, A...)

Apply function f to corresponding rows of distributed arrays, with CPU fallback.

This is the safe version that handles functions with arbitrary closures by converting GPU arrays to CPU, applying the function, and converting back. Use map_rows_gpu for performance-critical inner loops where f is isbits-compatible.

Arguments

f: Function to apply to each row (can capture non-isbits data)
A...: One or more distributed arrays (HPCVector or HPCMatrix)

Returns

HPCVector or HPCMatrix depending on the return type of f

See also: map_rows_gpu for GPU-native version (requires isbits closures)

source

HPCSparseArrays.map_rows_gpu — Function

map_rows_gpu(f, A...)

Apply function f to corresponding rows of distributed vectors/matrices (GPU-native).

Each argument in A... must be either a HPCVector or HPCMatrix. All inputs are repartitioned to match the partition of the first argument before applying f.

This implementation uses GPU-friendly broadcasting: matrices are converted to Vector{SVector} via transpose+reinterpret, then f is broadcast over all arguments. This avoids GPU->CPU->GPU round-trips when the underlying arrays are on GPU.

Important: The function f must be isbits-compatible (no captured non-isbits data) for GPU execution. Use map_rows for functions with arbitrary closures.

For each row index i, f is called with:

For HPCVector: the scalar element at index i
For HPCMatrix: an SVector containing the i-th row

Result Type

The result type depends on what f returns:

`f` returns	Result
scalar (`Number`)	`HPCVector` (one element per input row)
`SVector{K,T}`	`HPCMatrix` (K columns, one row per input row)

Examples

# Element-wise product of two vectors
u = HPCVector([1.0, 2.0, 3.0])
v = HPCVector([4.0, 5.0, 6.0])
w = map_rows_gpu((a, b) -> a * b, u, v)  # HPCVector([4.0, 10.0, 18.0])

# Row norms of a matrix
A = HPCMatrix(randn(5, 3))
norms = map_rows_gpu(r -> norm(r), A)  # HPCVector of row norms

# Return SVector to build a matrix
A = HPCMatrix(randn(3, 2))
result = map_rows_gpu(r -> SVector(sum(r), prod(r)), A)  # 3×2 HPCMatrix

# Mixed inputs: matrix rows combined with vector elements
A = HPCMatrix(randn(4, 3))
w = HPCVector([1.0, 2.0, 3.0, 4.0])
result = map_rows_gpu((row, wi) -> sum(row) * wi, A, w)  # HPCVector

See also: map_rows for CPU fallback version (handles arbitrary closures)

source

Linear System Solvers

HPCSparseArrays.solve — Function

solve(F::MUMPSFactorization{Tin, Bin, Tinternal}, b::HPCVector) where {Tin, Bin, Tinternal}

Solve the linear system A*x = b using the precomputed MUMPS factorization.

The input vector b can have any compatible element type and backend. The result is returned with the same element type and backend as the original matrix used for factorization.

source

HPCSparseArrays.solve! — Function

solve!(x::HPCVector, F::MUMPSFactorization{Tin, Bin, Tinternal}, b::HPCVector) where {Tin, Bin, Tinternal}

Solve A*x = b in-place using MUMPS factorization.

Automatically converts inputs to CPU Float64/ComplexF64 for MUMPS, then converts results back to the element type and backend of x.

source

HPCSparseArrays.finalize! — Function

finalize!(F::MUMPSFactorization)

Release MUMPS resources. Must be called on all ranks together.

This properly finalizes the MUMPS object and frees all associated memory. After calling finalize!, the factorization should not be used.

source

Partition Utilities

HPCSparseArrays.uniform_partition — Function

uniform_partition(n::Int, nranks::Int) -> Vector{Int}

Compute a balanced partition of n elements across nranks ranks. Returns a vector of length nranks + 1 with 1-indexed partition boundaries.

The first mod(n, nranks) ranks get div(n, nranks) + 1 elements, the remaining ranks get div(n, nranks) elements.

Example

partition = uniform_partition(10, 4)  # [1, 4, 7, 9, 11]
# Rank 0: 1:3 (3 elements)
# Rank 1: 4:6 (3 elements)
# Rank 2: 7:8 (2 elements)
# Rank 3: 9:10 (2 elements)

source

HPCSparseArrays.repartition — Function

repartition(x::HPCVector{T}, p::Vector{Int}) where T

Redistribute a HPCVector to a new partition p.

The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == length(x) + 1.

Returns a new HPCVector with the same data but partition == p.

Example

v = HPCVector([1.0, 2.0, 3.0, 4.0])  # uniform partition
new_partition = [1, 2, 5]  # rank 0 gets 1 element, rank 1 gets 3
v_repart = repartition(v, new_partition)

source

repartition(A::HPCMatrix{T}, p::Vector{Int}) where T

Redistribute a HPCMatrix to a new row partition p. The col_partition remains unchanged.

The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == size(A, 1) + 1.

Returns a new HPCMatrix with the same data but row_partition == p.

Example

A = HPCMatrix(randn(6, 4))  # uniform partition
new_partition = [1, 2, 4, 5, 7]  # 1, 2, 1, 2 rows per rank
A_repart = repartition(A, new_partition)

source

repartition(A::HPCSparseMatrix{T}, p::Vector{Int}) where T

Redistribute a HPCSparseMatrix to a new row partition p. The col_partition remains unchanged.

The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == size(A, 1) + 1.

Returns a new HPCSparseMatrix with the same data but row_partition == p.

Example

A = HPCSparseMatrix{Float64}(sprand(6, 4, 0.5))  # uniform partition
new_partition = [1, 2, 4, 5, 7]  # 1, 2, 1, 2 rows per rank
A_repart = repartition(A, new_partition)

source

Cache Management

HPCSparseArrays.clear_plan_cache! — Function

clear_plan_cache!()

Clear all memoized plan caches, including the MUMPS analysis cache. This is a collective operation that must be called on all MPI ranks together.

source

HPCSparseArrays.clear_mumps_analysis_cache! — Function

clear_mumps_backslash_cache!()

Clear the MUMPS backslash cache. This is a collective operation that must be called on all MPI ranks together.

source

IO Utilities

HPCSparseArrays.io0 — Function

io0(io=stdout; r::Set{Int}=Set{Int}([0]), dn=devnull)

Return io if the current MPI rank is in set r, otherwise return dn (default: devnull).

This is useful for printing only from specific ranks:

println(io0(), "Hello from rank 0!")
println(io0(r=Set([0,1])), "Hello from ranks 0 and 1!")

With string interpolation:

println(io0(), "Matrix A = $A")

source

Backend Types

HPCSparseArrays.HPCBackend — Type

HPCBackend{T, Ti<:Integer, D<:AbstractDevice, C<:AbstractComm, S<:AbstractSolver}

Unified backend type that encapsulates element type, index type, device, communication, and solver configuration.

Type Parameters

T: Element type for array values (e.g., Float64, ComplexF64)
Ti: Index type for sparse matrix indices (e.g., Int32, Int64)
D: Device type (DeviceCPU, DeviceMetal, DeviceCUDA)
C: Communication type (CommSerial, CommMPI)
S: Solver type (SolverMUMPS, or cuDSS variants)

Fields

device::D: The compute device
comm::C: The communication backend
solver::S: The linear solver

Example

# CPU with MPI and MUMPS, Float64 values and Int32 indices
backend = backend_cpu_mpi(Float64, Int32)

# CPU serial (single process) with default types (Float64, Int)
backend = backend_cpu_serial()

source

HPCSparseArrays.DeviceCPU — Type

DeviceCPU <: AbstractDevice

CPU device - uses standard Julia arrays (Vector, Matrix).

source

HPCSparseArrays.DeviceMetal — Type

DeviceMetal <: AbstractDevice

Metal GPU device (macOS) - uses Metal.jl arrays (MtlVector, MtlMatrix). Defined here as a placeholder; actual Metal support is in the Metal extension.

source

HPCSparseArrays.DeviceCUDA — Type

DeviceCUDA <: AbstractDevice

CUDA GPU device - uses CUDA.jl arrays (CuVector, CuMatrix). Defined here as a placeholder; actual CUDA support is in the CUDA extension.

source

HPCSparseArrays.CommSerial — Type

CommSerial <: AbstractComm

Serial (single-process) communication - all collective operations become no-ops.

source

HPCSparseArrays.CommMPI — Type

CommMPI <: AbstractComm

MPI-based distributed communication.

Fields

comm::MPI.Comm: The MPI communicator to use for all operations.

source

HPCSparseArrays.SolverMUMPS — Type

SolverMUMPS <: AbstractSolver

MUMPS sparse direct solver. Works on CPU with MPI or serial communication.

source

HPCSparseArrays.BACKEND_CPU_MPI — Constant

BACKEND_CPU_MPI

Pre-constructed CPU backend with MPI communication, Float64 values, and Int indices.

Deprecated

Use backend_cpu_mpi(T, Ti) instead for explicit type control.

source

HPCSparseArrays.BACKEND_CPU_SERIAL — Constant

BACKEND_CPU_SERIAL

Pre-constructed CPU backend with serial communication, Float64 values, and Int indices.

Deprecated

Use backend_cpu_serial(T, Ti) instead for explicit type control.

source

HPCSparseArrays.backend_metal_mpi — Function

backend_metal_mpi(comm::MPI.Comm) -> HPCBackend

Create a Metal GPU backend with MPI communication. Requires the Metal extension to be loaded.

source

HPCSparseArrays.backend_cuda_mpi — Function

backend_cuda_mpi(comm::MPI.Comm) -> HPCBackend

Create a CUDA GPU backend with MPI communication and cuDSS solver. Requires the CUDA extension to be loaded.

source

HPCSparseArrays.to_backend — Function

to_backend(v::HPCVector, backend::HPCBackend) -> HPCVector

Convert a HPCVector to use a different backend.

source

to_backend(A::HPCMatrix, backend::HPCBackend) -> HPCMatrix

Convert a HPCMatrix to use a different backend.

source

to_backend(A::HPCSparseMatrix, backend::HPCBackend) -> HPCSparseMatrix

Convert a HPCSparseMatrix to use a different backend. The nzval and target structure arrays are converted; CPU structure arrays remain on CPU.

source

Type Mappings

Native to Distributed Conversions

Native Type	Distributed Type	Description
`Vector{T}`	`HPCVector{T,B}`	Distributed vector
`Matrix{T}`	`HPCMatrix{T,B}`	Distributed dense matrix
`SparseMatrixCSC{T,Ti}`	`HPCSparseMatrix{T,Ti,B}`	Distributed sparse matrix

The B<:HPCBackend type parameter specifies the backend configuration (device, communication, solver). Use pre-constructed backends like BACKEND_CPU_MPI or factory functions like backend_cuda_mpi(comm).

Distributed to Native Conversions

Distributed Type	Native Type	Function
`HPCVector{T,B}`	`Vector{T}`	`Vector(v)`
`HPCMatrix{T,B}`	`Matrix{T}`	`Matrix(A)`
`HPCSparseMatrix{T,Ti,B}`	`SparseMatrixCSC{T,Ti}`	`SparseMatrixCSC(A)`

Supported Operations

HPCVector Operations

Arithmetic: +, -, * (scalar)
Linear algebra: norm, dot, conj
Indexing: v[i] (global index)
Conversion: Vector(v)

HPCMatrix Operations

Arithmetic: * (scalar), matrix-vector product
Transpose: transpose(A)
Indexing: A[i, j] (global indices)
Conversion: Matrix(A)

HPCSparseMatrix Operations

Arithmetic: +, -, * (scalar, matrix-vector, matrix-matrix)
Transpose: transpose(A)
Linear solve: A \ b, Symmetric(A) \ b
Utilities: nnz, norm, issymmetric
Conversion: SparseMatrixCSC(A)

Factorization Types

HPCSparseArrays uses MUMPS for sparse direct solves:

lu(A): LU factorization (general matrices)
ldlt(A): LDLT factorization (symmetric matrices, faster)

Both return factorization objects that support:

F \ b: Solve with factorization
finalize!(F): Release MUMPS resources