API Reference

This page provides detailed documentation for all exported types and functions in HPCSparseArrays.jl.

MPI Collective Operations

Unless otherwise noted, all functions are MPI collective operations. Every MPI rank must call these functions together.

Distributed Types

HPCVector

HPCSparseArrays.HPCVectorType
HPCVector{T, B<:HPCBackend{T}}

A distributed dense vector partitioned across MPI ranks.

Type Parameters

  • T: Element type (e.g., Float64, ComplexF64)
  • B<:HPCBackend{T}: Backend configuration with matching element type

Fields

  • structural_hash::Blake3Hash: 256-bit Blake3 hash of the partition
  • partition::Vector{Int}: Partition boundaries, length = nranks + 1 (always on CPU)
  • v::AbstractVector{T}: Local vector elements owned by this rank
  • backend::B: The HPC backend configuration
source

HPCMatrix

HPCSparseArrays.HPCMatrixType
HPCMatrix{T, B<:HPCBackend{T}}

A distributed dense matrix partitioned by rows across MPI ranks.

Type Parameters

  • T: Element type (e.g., Float64, ComplexF64)
  • B<:HPCBackend{T}: Backend configuration with matching element type

Fields

  • structural_hash::Blake3Hash: 256-bit Blake3 hash of the structural pattern
  • row_partition::Vector{Int}: Row partition boundaries, length = nranks + 1 (always on CPU)
  • col_partition::Vector{Int}: Column partition boundaries, length = nranks + 1 (always on CPU)
  • A::AbstractMatrix{T}: Local rows (NOT transposed), size = (local_nrows, ncols)
  • backend::B: The HPC backend configuration

Invariants

  • row_partition and col_partition are sorted
  • row_partition[nranks+1] = total number of rows + 1
  • col_partition[nranks+1] = total number of columns + 1
  • size(A, 1) == row_partition[rank+2] - row_partition[rank+1]
  • size(A, 2) == col_partition[end] - 1
source

HPCSparseMatrix

HPCSparseArrays.HPCSparseMatrixType
HPCSparseMatrix{T,Ti,AV}

A distributed sparse matrix partitioned by rows across MPI ranks.

Type Parameters

  • T: Element type (e.g., Float64, ComplexF64)
  • Ti: Index type (e.g., Int, Int32), defaults to Int
  • AV<:AbstractVector{T}: Storage type for nonzero values (Vector{T} for CPU, MtlVector{T} for GPU)

Fields

  • structural_hash::Blake3Hash: 256-bit Blake3 hash of the structural pattern
  • row_partition::Vector{Int}: Row partition boundaries, length = nranks + 1
  • col_partition::Vector{Int}: Column partition boundaries, length = nranks + 1 (placeholder for transpose)
  • col_indices::Vector{Int}: Global column indices that appear in the local part (local→global mapping)
  • rowptr::Vector{Ti}: Row pointers for CSR format (always CPU)
  • colval::Vector{Ti}: LOCAL column indices 1:length(col_indices) for each nonzero (always CPU)
  • nzval::AV: Nonzero values (CPU or GPU)
  • nrows_local::Int: Number of local rows
  • ncols_compressed::Int: Number of unique columns = length(col_indices)
  • cached_transpose: Cached materialized transpose (bidirectionally linked)

Invariants

  • col_indices, row_partition, and col_partition are sorted
  • row_partition[nranks+1] = total number of rows
  • col_partition[nranks+1] = total number of columns
  • nrows_local == row_partition[rank+1] - row_partition[rank] (number of local rows)
  • ncols_compressed == length(col_indices) (compressed column dimension)
  • colval contains local indices in 1:ncols_compressed
  • rowptr has length nrows_local + 1

Storage Details

The local rows are stored in CSR format (Compressed Sparse Row), which enables efficient row-wise iteration - essential for a row-partitioned distributed matrix.

The CSR storage consists of:

  • rowptr: Row pointers where row i has nonzeros at positions rowptr[i]:(rowptr[i+1]-1)
  • colval: LOCAL column indices (1:ncols_compressed), not global indices
  • nzval: Nonzero values
  • col_indices[local_idx] maps local→global column indices

This compression avoids "hypersparse" storage where the column dimension would be the global number of columns even if only a few columns have nonzeros locally.

GPU Support

Structure arrays (rowptr, colval) always stay on CPU for MPI communication and indexing. Only nzval can live on GPU, with type determined by the backend device:

  • DeviceCPU: Vector{T} storage
  • DeviceMetal: MtlVector{T} storage (macOS)
  • DeviceCUDA: CuVector{T} storage

Use to_backend(A, target_backend) to convert between backends.

source

SparseMatrixCSR

HPCSparseArrays.SparseMatrixCSRType
SparseMatrixCSR{Tv,Ti} = Transpose{Tv, SparseMatrixCSC{Tv,Ti}}

Type alias for CSR (Compressed Sparse Row) storage format.

The Dual Life of Transpose{SparseMatrixCSC}

In Julia, the type Transpose{Tv, SparseMatrixCSC{Tv,Ti}} has two interpretations:

  1. Semantic interpretation: A lazy transpose wrapper around a CSC matrix. When you call transpose(A) on a SparseMatrixCSC, you get this wrapper that represents A^T without copying data.

  2. Storage interpretation: CSR (row-major) access to sparse data. The underlying CSC stores columns contiguously, but through the transpose wrapper, we can iterate efficiently over rows instead of columns.

This alias clarifies intent: use SparseMatrixCSR when you want row-major storage semantics, and transpose(A) when you want the mathematical transpose.

CSR vs CSC Storage

  • CSC (Compressed Sparse Column): Julia's native sparse format. Efficient for column-wise operations, matrix-vector products with column access.
  • CSR (Compressed Sparse Row): Efficient for row-wise operations, matrix-vector products with row access, and row-partitioned distributed matrices.

For SparseMatrixCSR, the underlying parent::SparseMatrixCSC stores the transposed matrix. If B = SparseMatrixCSR(A) represents matrix M, then B.parent is a CSC storing M^T. This means:

  • B.parent.colptr acts as row pointers for M
  • B.parent.rowval contains column indices for M
  • B.parent.nzval contains values in row-major order

Usage Note

Julia will still display this type as Transpose{Float64, SparseMatrixCSC{...}}, not as SparseMatrixCSR. The alias improves code clarity but doesn't affect type printing.

source

Local Constructors

These constructors create distributed types from local data without global communication.

HPCSparseArrays.HPCVector_localFunction
HPCVector_local(v_local::AbstractVector{T}, backend::HPCBackend) where T

Create a HPCVector from a local vector on each rank.

Unlike HPCVector(v_global, backend) which takes a global vector and partitions it, this constructor takes only the local portion of the vector that each rank owns. The partition is computed by gathering the local sizes from all ranks.

Arguments

  • v_local: Local vector portion owned by this rank
  • backend: The HPCBackend configuration (determines communication)

Example

backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Rank 0 has [1.0, 2.0], Rank 1 has [3.0, 4.0, 5.0]
v = HPCVector_local([1.0, 2.0], backend)  # on rank 0
v = HPCVector_local([3.0, 4.0, 5.0], backend)  # on rank 1
# Result: distributed vector [1.0, 2.0, 3.0, 4.0, 5.0] with partition [1, 3, 6]
source
HPCSparseArrays.HPCMatrix_localFunction
HPCMatrix_local(A_local::AbstractMatrix{T}, backend::HPCBackend; col_partition=...) where T

Create a HPCMatrix from a local matrix on each rank.

Unlike HPCMatrix(M_global, backend) which takes a global matrix and partitions it, this constructor takes only the local rows of the matrix that each rank owns. The row partition is computed by gathering the local row counts from all ranks.

The type of A_local determines the storage backend (CPU Matrix, GPU MtlMatrix, etc.).

All ranks must have local matrices with the same number of columns. A collective error is raised if the column counts don't match.

Arguments

  • A_local::AbstractMatrix{T}: Local rows owned by this rank
  • backend::HPCBackend: The HPC backend configuration

Keyword Arguments

  • col_partition::Vector{Int}: Column partition boundaries (default: uniform_partition(size(A_local,2), nranks))

Example

backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Rank 0 has 2×3 matrix, Rank 1 has 3×3 matrix
A = HPCMatrix_local(randn(2, 3), backend)  # on rank 0
A = HPCMatrix_local(randn(3, 3), backend)  # on rank 1
# Result: 5×3 distributed matrix with row_partition [1, 3, 6]
source
HPCSparseArrays.HPCSparseMatrix_localFunction
HPCSparseMatrix_local(A_local::SparseMatrixCSR{T,Ti}, backend::HPCBackend; col_partition=...) where {T,Ti}
HPCSparseMatrix_local(A_local::Adjoint{T,SparseMatrixCSC{T,Ti}}, backend::HPCBackend; col_partition=...) where {T,Ti}

Create a HPCSparseMatrix from a local sparse matrix on each rank.

Unlike HPCSparseMatrix{T}(A_global, backend) which takes a global matrix and partitions it, this constructor takes only the local rows of the matrix that each rank owns. The row partition is computed by gathering the local row counts from all ranks.

The input A_local must be a SparseMatrixCSR{T,Ti} (or Adjoint of SparseMatrixCSC{T,Ti}) where:

  • A_local.parent.n = number of local rows on this rank
  • A_local.parent.m = global number of columns (must match on all ranks)
  • A_local.parent.rowval = global column indices

All ranks must have local matrices with the same number of columns (block widths must match). A collective error is raised if the column counts don't match.

Note: For Adjoint inputs, the values are conjugated to match the adjoint semantics.

Arguments

  • A_local: Local sparse matrix in CSR format
  • backend::HPCBackend: Backend configuration (device, comm, solver)

Keyword Arguments

  • col_partition::Vector{Int}: Column partition boundaries (default: uniform_partition(A_local.parent.m, nranks))

Example

backend = backend_cpu_mpi(MPI.COMM_WORLD)
# Create local rows in CSR format
# Rank 0 owns rows 1-2 of a 5×3 matrix, Rank 1 owns rows 3-5
local_csc = sparse([1, 1, 2], [1, 2, 3], [1.0, 2.0, 3.0], 2, 3)  # 2 local rows, 3 cols
A = HPCSparseMatrix_local(SparseMatrixCSR(local_csc), backend)
source

Row-wise Operations

HPCSparseArrays.map_rowsFunction
map_rows(f, A...)

Apply function f to corresponding rows of distributed arrays, with CPU fallback.

This is the safe version that handles functions with arbitrary closures by converting GPU arrays to CPU, applying the function, and converting back. Use map_rows_gpu for performance-critical inner loops where f is isbits-compatible.

Arguments

  • f: Function to apply to each row (can capture non-isbits data)
  • A...: One or more distributed arrays (HPCVector or HPCMatrix)

Returns

  • HPCVector or HPCMatrix depending on the return type of f

See also: map_rows_gpu for GPU-native version (requires isbits closures)

source
HPCSparseArrays.map_rows_gpuFunction
map_rows_gpu(f, A...)

Apply function f to corresponding rows of distributed vectors/matrices (GPU-native).

Each argument in A... must be either a HPCVector or HPCMatrix. All inputs are repartitioned to match the partition of the first argument before applying f.

This implementation uses GPU-friendly broadcasting: matrices are converted to Vector{SVector} via transpose+reinterpret, then f is broadcast over all arguments. This avoids GPU->CPU->GPU round-trips when the underlying arrays are on GPU.

Important: The function f must be isbits-compatible (no captured non-isbits data) for GPU execution. Use map_rows for functions with arbitrary closures.

For each row index i, f is called with:

  • For HPCVector: the scalar element at index i
  • For HPCMatrix: an SVector containing the i-th row

Result Type

The result type depends on what f returns:

f returnsResult
scalar (Number)HPCVector (one element per input row)
SVector{K,T}HPCMatrix (K columns, one row per input row)

Examples

# Element-wise product of two vectors
u = HPCVector([1.0, 2.0, 3.0])
v = HPCVector([4.0, 5.0, 6.0])
w = map_rows_gpu((a, b) -> a * b, u, v)  # HPCVector([4.0, 10.0, 18.0])

# Row norms of a matrix
A = HPCMatrix(randn(5, 3))
norms = map_rows_gpu(r -> norm(r), A)  # HPCVector of row norms

# Return SVector to build a matrix
A = HPCMatrix(randn(3, 2))
result = map_rows_gpu(r -> SVector(sum(r), prod(r)), A)  # 3×2 HPCMatrix

# Mixed inputs: matrix rows combined with vector elements
A = HPCMatrix(randn(4, 3))
w = HPCVector([1.0, 2.0, 3.0, 4.0])
result = map_rows_gpu((row, wi) -> sum(row) * wi, A, w)  # HPCVector

See also: map_rows for CPU fallback version (handles arbitrary closures)

source

Linear System Solvers

HPCSparseArrays.solveFunction
solve(F::MUMPSFactorization{Tin, Bin, Tinternal}, b::HPCVector) where {Tin, Bin, Tinternal}

Solve the linear system A*x = b using the precomputed MUMPS factorization.

The input vector b can have any compatible element type and backend. The result is returned with the same element type and backend as the original matrix used for factorization.

source
HPCSparseArrays.solve!Function
solve!(x::HPCVector, F::MUMPSFactorization{Tin, Bin, Tinternal}, b::HPCVector) where {Tin, Bin, Tinternal}

Solve A*x = b in-place using MUMPS factorization.

Automatically converts inputs to CPU Float64/ComplexF64 for MUMPS, then converts results back to the element type and backend of x.

source
HPCSparseArrays.finalize!Function
finalize!(F::MUMPSFactorization)

Release MUMPS resources. Must be called on all ranks together.

This properly finalizes the MUMPS object and frees all associated memory. After calling finalize!, the factorization should not be used.

source

Partition Utilities

HPCSparseArrays.uniform_partitionFunction
uniform_partition(n::Int, nranks::Int) -> Vector{Int}

Compute a balanced partition of n elements across nranks ranks. Returns a vector of length nranks + 1 with 1-indexed partition boundaries.

The first mod(n, nranks) ranks get div(n, nranks) + 1 elements, the remaining ranks get div(n, nranks) elements.

Example

partition = uniform_partition(10, 4)  # [1, 4, 7, 9, 11]
# Rank 0: 1:3 (3 elements)
# Rank 1: 4:6 (3 elements)
# Rank 2: 7:8 (2 elements)
# Rank 3: 9:10 (2 elements)
source
HPCSparseArrays.repartitionFunction
repartition(x::HPCVector{T}, p::Vector{Int}) where T

Redistribute a HPCVector to a new partition p.

The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == length(x) + 1.

Returns a new HPCVector with the same data but partition == p.

Example

v = HPCVector([1.0, 2.0, 3.0, 4.0])  # uniform partition
new_partition = [1, 2, 5]  # rank 0 gets 1 element, rank 1 gets 3
v_repart = repartition(v, new_partition)
source
repartition(A::HPCMatrix{T}, p::Vector{Int}) where T

Redistribute a HPCMatrix to a new row partition p. The col_partition remains unchanged.

The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == size(A, 1) + 1.

Returns a new HPCMatrix with the same data but row_partition == p.

Example

A = HPCMatrix(randn(6, 4))  # uniform partition
new_partition = [1, 2, 4, 5, 7]  # 1, 2, 1, 2 rows per rank
A_repart = repartition(A, new_partition)
source
repartition(A::HPCSparseMatrix{T}, p::Vector{Int}) where T

Redistribute a HPCSparseMatrix to a new row partition p. The col_partition remains unchanged.

The partition p must be a valid partition vector of length nranks + 1 with p[1] == 1 and p[end] == size(A, 1) + 1.

Returns a new HPCSparseMatrix with the same data but row_partition == p.

Example

A = HPCSparseMatrix{Float64}(sprand(6, 4, 0.5))  # uniform partition
new_partition = [1, 2, 4, 5, 7]  # 1, 2, 1, 2 rows per rank
A_repart = repartition(A, new_partition)
source

Cache Management

HPCSparseArrays.clear_plan_cache!Function
clear_plan_cache!()

Clear all memoized plan caches, including the MUMPS analysis cache. This is a collective operation that must be called on all MPI ranks together.

source

IO Utilities

HPCSparseArrays.io0Function
io0(io=stdout; r::Set{Int}=Set{Int}([0]), dn=devnull)

Return io if the current MPI rank is in set r, otherwise return dn (default: devnull).

This is useful for printing only from specific ranks:

println(io0(), "Hello from rank 0!")
println(io0(r=Set([0,1])), "Hello from ranks 0 and 1!")

With string interpolation:

println(io0(), "Matrix A = $A")
source

Backend Types

HPCSparseArrays.HPCBackendType
HPCBackend{T, Ti<:Integer, D<:AbstractDevice, C<:AbstractComm, S<:AbstractSolver}

Unified backend type that encapsulates element type, index type, device, communication, and solver configuration.

Type Parameters

  • T: Element type for array values (e.g., Float64, ComplexF64)
  • Ti: Index type for sparse matrix indices (e.g., Int32, Int64)
  • D: Device type (DeviceCPU, DeviceMetal, DeviceCUDA)
  • C: Communication type (CommSerial, CommMPI)
  • S: Solver type (SolverMUMPS, or cuDSS variants)

Fields

  • device::D: The compute device
  • comm::C: The communication backend
  • solver::S: The linear solver

Example

# CPU with MPI and MUMPS, Float64 values and Int32 indices
backend = backend_cpu_mpi(Float64, Int32)

# CPU serial (single process) with default types (Float64, Int)
backend = backend_cpu_serial()
source
HPCSparseArrays.DeviceMetalType
DeviceMetal <: AbstractDevice

Metal GPU device (macOS) - uses Metal.jl arrays (MtlVector, MtlMatrix). Defined here as a placeholder; actual Metal support is in the Metal extension.

source
HPCSparseArrays.DeviceCUDAType
DeviceCUDA <: AbstractDevice

CUDA GPU device - uses CUDA.jl arrays (CuVector, CuMatrix). Defined here as a placeholder; actual CUDA support is in the CUDA extension.

source
HPCSparseArrays.CommMPIType
CommMPI <: AbstractComm

MPI-based distributed communication.

Fields

  • comm::MPI.Comm: The MPI communicator to use for all operations.
source
HPCSparseArrays.BACKEND_CPU_MPIConstant
BACKEND_CPU_MPI

Pre-constructed CPU backend with MPI communication, Float64 values, and Int indices.

Deprecated

Use backend_cpu_mpi(T, Ti) instead for explicit type control.

source
HPCSparseArrays.BACKEND_CPU_SERIALConstant
BACKEND_CPU_SERIAL

Pre-constructed CPU backend with serial communication, Float64 values, and Int indices.

Deprecated

Use backend_cpu_serial(T, Ti) instead for explicit type control.

source
HPCSparseArrays.backend_cuda_mpiFunction
backend_cuda_mpi(comm::MPI.Comm) -> HPCBackend

Create a CUDA GPU backend with MPI communication and cuDSS solver. Requires the CUDA extension to be loaded.

source
HPCSparseArrays.to_backendFunction
to_backend(v::HPCVector, backend::HPCBackend) -> HPCVector

Convert a HPCVector to use a different backend.

source
to_backend(A::HPCMatrix, backend::HPCBackend) -> HPCMatrix

Convert a HPCMatrix to use a different backend.

source
to_backend(A::HPCSparseMatrix, backend::HPCBackend) -> HPCSparseMatrix

Convert a HPCSparseMatrix to use a different backend. The nzval and target structure arrays are converted; CPU structure arrays remain on CPU.

source

Type Mappings

Native to Distributed Conversions

Native TypeDistributed TypeDescription
Vector{T}HPCVector{T,B}Distributed vector
Matrix{T}HPCMatrix{T,B}Distributed dense matrix
SparseMatrixCSC{T,Ti}HPCSparseMatrix{T,Ti,B}Distributed sparse matrix

The B<:HPCBackend type parameter specifies the backend configuration (device, communication, solver). Use pre-constructed backends like BACKEND_CPU_MPI or factory functions like backend_cuda_mpi(comm).

Distributed to Native Conversions

Distributed TypeNative TypeFunction
HPCVector{T,B}Vector{T}Vector(v)
HPCMatrix{T,B}Matrix{T}Matrix(A)
HPCSparseMatrix{T,Ti,B}SparseMatrixCSC{T,Ti}SparseMatrixCSC(A)

Supported Operations

HPCVector Operations

  • Arithmetic: +, -, * (scalar)
  • Linear algebra: norm, dot, conj
  • Indexing: v[i] (global index)
  • Conversion: Vector(v)

HPCMatrix Operations

  • Arithmetic: * (scalar), matrix-vector product
  • Transpose: transpose(A)
  • Indexing: A[i, j] (global indices)
  • Conversion: Matrix(A)

HPCSparseMatrix Operations

  • Arithmetic: +, -, * (scalar, matrix-vector, matrix-matrix)
  • Transpose: transpose(A)
  • Linear solve: A \ b, Symmetric(A) \ b
  • Utilities: nnz, norm, issymmetric
  • Conversion: SparseMatrixCSC(A)

Factorization Types

HPCSparseArrays uses MUMPS for sparse direct solves:

  • lu(A): LU factorization (general matrices)
  • ldlt(A): LDLT factorization (symmetric matrices, faster)

Both return factorization objects that support:

  • F \ b: Solve with factorization
  • finalize!(F): Release MUMPS resources