Julia bindings for the NVIDIA CUSPARSE library. CUSPARSE is a high-performance sparse matrix linear algebra library.
CUSPARSE.jl proves bindings to a subset of the CUSPARSE library. It extends the amazing CUDArt.jl library to provide four new sparse matrix classes:
which implement compressed sparse row/column storage, block CSR, and NVIDIA hybrid (
ELL format on the GPU. Since the native sparse type in Julia is
CSC, and in CUSPARSE is
CSR, automatic format conversion is provided, so that when you write
A = sprand(10,8,0.2) d_A = CudaSparseMatrixCSR(A)
A is transformed into
CSC format moved to the GPU, then auto-converted to
CSR format for you. Thus,
d_A is not a transpose of
A! Similarly, if you have a matrix in dense format on the GPU (in a
CudaArray), you can simply call
sparse to turn it into a sparse representation. Right now
sparse by default turns the matrix it is given into
CSR format. It takes an optional argument that lets you select
d_A = CudaArray(rand(10,20)) d_A = sparse(d_A) #now in CSR format d_B = CudaArray(rand(10,20)) d_B = sparse(d_B,'C') #now in CSC format d_C = CudaArray(rand(10,20)) d_C = sparse(d_C,'H') #now in HYB format d_D = CudaArray(rand(10,20)) d_D = sparse(d_C,'B') #now in BSR format
CUSPARSE.jl currently supports a subset of all the CUSPARSE functionality. What is implemented right now:
This is a big, ugly looking list. The actual operations CUSPARSE.jl supports are:
CUSPARSE provides incomplete LU and Cholesky factorization. Often, for a sparse matrix, the full LU or Cholesky factorization is much less sparse than the original matrix.
This is a problem if the sparse matrix is very large, since GPU memory is limited. CUSPARSE provides incomplete versions of these factorizations, such that
approximatily equal to
L * U or
L* L. In particular, the incomplete factorizations have the same sparsity pattern as
A, so that they occupy the same amount of GPU
memory. They are preconditioners - we can solve the problem
y = A \ x by applying them iteratively. You should not expect
ic0 in CUSPARSE to have matrix elements
equal to those from Julia
cholfact, because the Julia factorizations are complete.
CUSPARSE.jl exports its matrix types, so you do not have to prepend them with anything. To use a CUSPARSE function, just
using CUSPARSE ### stuff happens here CUSPARSE.mv( #arguments! )
Important Note: CUSPARSE solvers (
sm) assume the matrix you are solving is triangular. If you pass them a general matrix you will get the wrong answer!
A simple example of creating two sparse matrices
B on the CPU, moving them to the GPU, adding them, and bringing the result back:
using CUSPARSE # dimensions and fill proportion N = 20 M = 10 p = 0.1 # create matrices A,B on the CPU A = sprand(N,M,p) B = sprand(N,M,p) # convert A,B to CSR format and # move them to the GPU - one step d_A = CudaSparseMatrixCSR(A) d_B = CudaSparseMatrixCSR(B) # generate scalar parameters alpha = rand() beta = rand() # perform alpha * A + beta * B d_C = CUSPARSE.geam(alpha, d_A, beta, d_B, 'O', 'O', 'O') # bring the result back to the CPU C = CUSPARSE.to_host(d_C) # observe a zero matrix alpha*A + beta*B - C
Some questions you might have:
'O's tell CUSPARSE that our matrices are one-based. If you had a zero-based matrix from an external library, you can tell CUSPARSE using
betato the GPU?
betafrom the host (CPU) memory. You can just pass them to the function and CUSPARSE.jl handles telling the CUDA functions where they are for you. If you have an array, like
B, you do need to move it to the GPU before CUSPARSE can work on it. Similarly, to see results, if they are in array form, you will need to move them back to the CPU with
Cthe transpose of what we want?
CSCformat for Julia, and not the transpose of the correct answer.
Moving data between the CPU and GPU memory is very time-intensive. In general, if you only do one operation on the GPU (e.g. one matrix-vector multiplication), the computation is dominated by the time spent copying data. However, if you do many operations with the data you have on the GPU, like doing twenty matrix-vector multiplications, then the GPU can easily beat the CPU. Below you can see some timing tests for the CPU vs the GPU for 20 operations:
The GPU does very well in these tests, but if we only did one operation, the GPU would do as well as or worse than the CPU. It is not worth it to use the GPU if most of your time will be spent copying data around!
Contributions are very welcome! If you write wrappers for one of the CUSPARSE functions, please include some tests in
test/runtests.jl for your wrapper. Ideally test each of the types the function you wrap can accept, e.g.
Float64, and possibly
4 months ago