dummy-link

0

0

0

1

# Block version of SparseMatrixCSC

Main idea: we divide the matrix A in block elements of size 2x2:

``````+--+--+
|11|33|
|11|33|
+--+--+
|22|44|
|22|44|
+--+--+
``````

The `nzval` array will always store cells of `4` floating point numbers -- even when there is a nonzero within a cell. This makes it possible to use SSE intrinsics when doing a sparse matrix-vector multiplication. Obviously the `nzval` array can be 1 to 4 times as large compared to the `SparseMatrixCSC` version. The `colptr` array will always be twice as small, and the `rowval` can be 1 to 2 times as small.

The main motivation is to trade additional computational work for SSE instructions. Since the sparse matrix-vector product is often memory-bound, this is perfectly feasible.

The routine in pseudo-code reads:

``````for j = 1 : n
load X <- x[2j-1 : 2j]
for i = colptr[j] : colptr[j+1]-1
row = rowval[i]
load Y <- y[2row-1 : 2row]
load A <- reshape(nzval[i : i+3], 2, 2)
y[2row-1 : 2row] <- Y + A * X # Here are the flops
end
end
``````

We have to optimize `Y + A * X`. In the case of `Float64`, we can pack things perfectly in a 128-bit registers:

``````# Outer loop:
X1 = load [x1, x1]      # vmovddup
X2 = load [x2, x2]      # vmovddup
``````
``````# Inner loop:
Y = load [y1, y2]
tmp1 = load [a11, a12]
tmp1 = tmp1 * X1 + Y    #vfmadd
tmp2 = load [a21, a22]
tmp2 = tmp2 * X2 + Y    #vfmadd
store Y in y
``````

# Results

Currently I've implemented the routine `Y <- Y + A * X` where X and Y are vectors or matrices of size `n x 2`. Benchmarks were run on a Macbook Air (Early 2015).

The first benchmark result is the BlockSparseMatrixCSC, second is SparseMatrixCSC. The relative storage costs are shown as well: `B_size / A_size` means the total storage costs of the block format over the storage costs for SparseMatrixCSC.

## Tridiagonal matrix of order 1_000_000

First with `x` and `y` vectors:

``````> block, scalar = benchmark_banded(1_000_000, 1)
length(B.nzval) / length(A.nzval) = 1.9999986666657779 # Twice as many values stored
length(B.colptr) / length(A.colptr) = 0.5000004999995
length(B.rowval) / length(A.rowval) = 0.49999966666644446
B_size / A_size = 1.1428563469384343 # Slightly larger total storage costs
(Trial(5.811 ms), Trial(8.393 ms))
``````

Speedup of 1.4x.

Now with `x` and `y` skinny matrices of size 1_000_000 by 2:

``````> block, scalar = benchmark2_banded(1_000_000, 1)
length(B.nzval) / length(A.nzval) = 1.9999986666657779
length(B.colptr) / length(A.colptr) = 0.5000004999995
length(B.rowval) / length(A.rowval) = 0.49999966666644446
B_size / A_size = 1.1428563469384343
(Trial(7.456 ms), Trial(15.718 ms))
``````

Speedup of 2.1x.

## Pentadiagonal matrix of order 1_000_000

With `x` and `y` vectors:

``````> block, scalar = benchmark_banded(1_000_000, 2)
length(B.nzval) / length(A.nzval) = 1.199999839999808
length(B.colptr) / length(A.colptr) = 0.5000004999995
length(B.rowval) / length(A.rowval) = 0.299999959999952
B_size / A_size = 0.7272726363635454 # More efficient storage format!
(Trial(5.797 ms), Trial(13.323 ms))
``````

Speedup of 2.2x.

With `x` and `y` matrices of size 1_000_000 by 2:

``````> block, scalar = benchmark2_banded(1_000_000, 2)
length(B.nzval) / length(A.nzval) = 1.199999839999808
length(B.colptr) / length(A.colptr) = 0.5000004999995
length(B.rowval) / length(A.rowval) = 0.299999959999952
B_size / A_size = 0.7272726363635454
(Trial(7.454 ms), Trial(23.956 ms))
``````

Speedup of 3.2x.

## Random matrix of order 1_000_000 with 3 nonzeros per row

Matrices with a random sparsity pattern store a lot of structural zeros, so they serve as a worst-case test case:

``````> benchmark_random(1_000_000, 3)
length(B.nzval) / length(A.nzval) = 3.9999826554215985 # We store 4x as many values
length(B.colptr) / length(A.colptr) = 0.5000004999995
length(B.rowval) / length(A.rowval) = 0.9999956638553996
B_size / A_size = 2.2141173860734913 # In the end we need ~2x as much memory
(Trial(44.003 ms), Trial(42.215 ms))
``````

Speedup of 0.96x.

Now with `x` and `y` matrices of size 1_000_000 by 2:

``````> benchmark2_random(1_000_000, 3)
length(B.nzval) / length(A.nzval) = 3.9999906660350684
length(B.colptr) / length(A.colptr) = 0.5000004999995
length(B.rowval) / length(A.rowval) = 0.9999976665087671
B_size / A_size = 2.214263968129585
(Trial(100.798 ms), Trial(91.912 ms))
``````

Speedup of 0.91x.

04/02/2018

#### Last Touched

about 3 years ago

10 commits