Home Cholesky (LDLT) Decomposition of a symmetric positive definite matrix using MPI
Post
Cancel

Cholesky (LDLT) Decomposition of a symmetric positive definite matrix using MPI

Introduction

This project was done as the final assignment for course 4810-1115 Parallel Numerical Computation. The course dealt with parallelizing numerical algorithms using OpenMP (shared memory parallelization) and MPI (distributed systems).

Problem statement

Given a symmetric positive definite matrix

\[A=\begin{bmatrix} a_{11} & a_{12} & a_{13} & \cdots & a_{1n}\\ a_{21} & a_{22} & a_{23} & \cdots & a_{2n}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ a_{n1} & a_{n2} & a_{n3} & \cdots & a_{nn} \end{bmatrix}\]

factorize it in two matrices

\[L = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0\\ l_{21} & 1 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ l_{n1} & l_{n2} & l_{n3} & \cdots & 1 \end{bmatrix}\]

and

\[D = \begin{bmatrix} d_{11} & 0 & 0 & \cdots & 0\\ 0 & d_{22} & 0 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & \cdots & d_{nn} \end{bmatrix}\]

such that \(A = LDL^T\)

Algorithm

Values of the $L$ and $D$ matrices can be calculated using the following formulae

\[l_{ij} = \frac{a_{ij} - \sum_{k=1}^{j-1}d_{kk}l_{jk}l_{ik}}{d_{jj}}\] \[d_{ii} = a_{ii} - \sum_{k=1}^{i-1}d_{kk}l_{ik}^2\]

Sequential Code

The formulae given above can be very easily implemented into code. The code that I used in my report is as follows

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void ldlt_sequential(const double* const A, double *L, double *d, const int n)
{
    // Fill upper part with zeroes
    for (int r=0; r < n - 1; ++r)
    {
        std::fill(L + r * n + (r + 1), L + (r + 1) * n, 0.0);
    }
    for (int j=0; j<n; ++j)
    {
        d[j] = A[j*n+j];
        for (int k=0; k<j; ++k)
            d[j] -= d[k] * L[j*n+k] * L[j*n+k];
        
        L[j*n+j] = 1;
        for (int i=j+1; i<n; ++i)
        {
            L[i*n+j] = A[i*n+j];
            for (int k=0; k<j; ++k)
                L[i*n+j] -= d[k] * L[j*n+k] * L[i*n+k];
            L[i*n+j] /= d[j];
        }
    }
}

Parallelization using MPI

Figure 1: Data needed from $L$ to calculate an element of $D$ Figure 1: Data needed from $L$ to calculate an element of $D$

Figure 2: Full rows $i$ and $j$ of $L$ are needed to calculate $L[i, j]$ Figure 2: Full rows $i$ and $j$ of $L$ are needed to calculate $L[i, j]$

Data Dependencies

We can see that in the calculations, every column of $L$ and $D$ needs the values of their previous columns already calculated, however calculation of rows can be done a bit independently.

$D$ Matrix

For $D$, we can see from the equations that to get any element of $D$ we need all its previous elements and the complete row of the matrix $L$. We can see the dependency in figure 1, where dark blue columns are the ones which are needed.

$L$ Matrix

If we are calculating say column $i$, than we need all the previous values of $D$ and the complete row $i$ of $L$. In the figure 2, we can see that for calculation of $L[r, 4]$ we need $L[4, :]$ and $L[r, :]$.

How the parallelization is done

Since every column needs the values of its previous columns so there is not an easy way to parallelize that, so I have gone through all columns in sequential manner but in every column I have divided the work for rows to different processes.

Calculating $d_{ij}$

Every rank is resposnsible for calculating values of rows that are multiple of rank. In this way the value of $D$ can also be divided among processes because it already has the previous computations from that row so it does not need any data to be communicated to it for calculation of $d_{ii}$. After calculation the values of $d_{ii}$ are sent to all other processes since it will be needed for calculation of $L_{ij}$.

Calculating $L_{ij}$

As soon as we start to calculate the values of a column, say column $j$, we broadcast the $j^{th}$ row to all the processes since it is needed as shown in figure 2. All the processes do their work independently and at the last they update the values when they receive the value of $d_{ii}$. They do not send the calculated values anywhere at this point of time, so there is no communication overhead.

Parallelized code using MPI

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
void ldlt_mpi(const double *const A, double *L, double *d, const int n)
{
    int rank, n_pr;
    MPI_Request req_for_row, req_for_dj;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &n_pr);

    // Fill with zeroes
    std::fill(L, L + n * n, 0.0);

    // Have to go column wise, can not be parallelized
    for (int j = 0; j < n; ++j)
    {
        L[j * n + j] = 1;

        // This row is needed by everyone so start sending it
        MPI_Ibcast(L + j * n, j, MPI_DOUBLE, j % n_pr, MPI_COMM_WORLD, &req_for_row);
        if (j % n_pr == rank)
        {
            d[j] = A[j * n + j];
            for (int k = 0; k < j; ++k) 
                d[j] -= d[k] * L[j * n + k] * L[j * n + k];
        }
        // Value for d[j] will be needed by every row so start sending it
        MPI_Ibcast(d + j, 1, MPI_DOUBLE, j % n_pr, MPI_COMM_WORLD, &req_for_dj);
        // Now the row is needed to compute values
        MPI_Wait(&req_for_row, MPI_STATUS_IGNORE);

        // Distribute the work for rows
        for (int i = j + 1; i < n; ++i)
        {
            if (i % n_pr == rank)
            {
                L[i * n + j] = A[i * n + j];
                for (int k = 0; k < j; ++k)
                    L[i * n + j] -= d[k] * L[j * n + k] * L[i * n + k];
            }
        } // work for each row ends

        MPI_Wait(&req_for_dj, MPI_STATUS_IGNORE);
        for (int i = j + 1; i < n; ++i)
            L[i * n + j] /= d[j];
    } // work for each column ends
} // function ldlt_mpi ends

Verification of Results of Parallel Code

For verification that the result obtained using parallel algorithm are correct, I have simply used the product $LDL^T$ and checked it against $A$, which was found correct for all the combination of MPI processes and matrix sizes. Matrix multiplication was done sequentially.

Performance

The code was run for various combination of matrix sizes and number of MPI processes and considerable speedup was seen both in terms of strong and weak scaling. Complete results are shown in table 1.

Table 1: Running time (in ms) for different values of MPI process and matrix sizes
Mat sizeSequential
Parallelized using MPI (number of processes)
2481632406480128256512
1000.71.11.2722.964.76.716.320.423.7527.428.2
2005.64.633.584.14.956.657.913.671326.8354.253.65
50086.850302422.223.852549.175.395.34101.7163
750293.4159896047.745.244848.274670.3214.7159
1000696.236719912385.975.37576.37281.4137332
150024901215638363227.8180178169.14148160159437
2000623429011477806477361348300257273.5261472
250012124605728761537905660622494412416397672
300020767104825062271515691056981763612618579940
350032733165888217435724251536139711118608697861241
4000485732445412376652736142239195615511193119710531604
50009334047194237731233566584015351228402248225117692504

Weak Scaling

From table 1 we see that upto a certain level (approximately till $64$ processes) that increasing the number of processes allows us to solve bigger and bigger problems. For example in the sequential algorithm we can solve the problem with matrix size of $750 \times 750$ in around $300ms$, while with $64$ processes we can solve the problem for matrix size of $2000 \times 2000$ in the same time. In the figure 3 it is clearly seen that execution time decreases with increasing parallelism.

Strong Scaling

It is clearly seen that offering more processors helps us solving the problems faster (if they are big enough to be benefitted from parallelization). For the matrix size of $5000 \times 5000$ the sequential algorithm takes $93340ms$, which is around $1.6$ minute, which using $80$ processes in parallel, the same problem is solved in less than $2.3s$. This can be seen in table 2 and figure 4.

Matrix Size
Speedup by parallelization for different number of MPI processes
248326480128256512
1000.640.550.350.150.040.030.030.030.02
2001.211.561.370.840.410.430.210.100.10
5001.742.893.623.641.771.150.910.850.53
7501.853.304.896.496.086.384.171.371.85
10001.903.505.669.259.129.678.555.082.10
15002.053.906.8613.8314.7216.8215.5615.665.70
20002.154.227.7317.2720.7824.2622.7923.8913.21
25002.004.227.8918.3724.5429.4329.1430.5418.04
30001.984.107.6519.6727.2233.9333.6035.8722.09
35001.973.987.5121.3129.4638.0637.6741.6526.38
40001.993.927.4421.6931.3240.7240.5846.1330.28
50001.983.937.5723.2532.8741.5241.4752.7637.28

Conclusion

The problem was successfully parallelized using MPI and almost linear speedup was seen for sufficiently large problem where computations were more than communication overheads.

Figure 3: Matrix size vs execution time for various number of MPI processes Figure 3: Matrix size vs execution time for various number of MPI processes

Figure 4: Speedup of $5000 \times 5000$ matrix with different number of MPI processes Figure 4: Speedup of $5000 \times 5000$ matrix with different number of MPI processes

Figure 5: Snapshot of submission using pjstat Figure 5: Snapshot of submission using pjstat

This post is licensed under CC BY 4.0 by the author.