Matrix multiplication

The other day I started wondering how the highly optimised routines of numpy would compare against a multithreaded matrix multiplication in C++ and, so, I decided to make my own comparison between apples and oranges. That is, compare different tools in different languages for performing said multiplications. Therefore, what follows is not a comparison between languages. Not even the algorithms used are the same (far from us the foolishness of comparing triple nested loops with optimised BLAS…). The focus is on comparing options available to someone who, for some reason, needs code that multiplies matrices fast. The preconditions were that the software had to be free and easily obtainable and usable. Moreover, the timing of the execution is just a crude estimate obtained by averaging elapsed times over multiple runs. A disciplined measurement of execution time that can be taken seriously for production code is delicate business; both art and science. And, to elaborate on the cliché, it’s art because you need intuition to navigate the complexity (of, for example, long listings of assembly code) and science because you need tools, measurements, and methodology. If I had to pick one example, from the abundance of material that is available online, to introduce the uninitiated to the subtlety of performance measurements that example would be Chandler Caruth’s talk at the CppCon 2015. The talk is eyeopening for the non-expert and instructive for the expert. However, if you don’t like precise quantification and arguments backed by careful measurement, don’t watch the CppCon talk; instead, go to stackoverflow and continue reading there.


The first step is to use numpy from the anaconda distribution to create a sequence of matrices of increasing size with the entries of the matrices being random numbers. In all cases (especially when multithreading was involved later on) I also used matrices with special structure (e.g., permutation matrices) as an elementary means of testing. That is, to make sure that there isn’t some blatant error in the code leading to something other than matrix multiplication being performed. At some point the results of timing were so surprising that I had to double-check.

Here is the python code that creates the sequence of matrices, multiplies them, and times the multiplications.

#!/usr/bin/env python

import numpy as np
import time

# A short script that, well, removes successive white spaces from strings
from remove_multiple_white_spaces import remove_multiple_white_spaces

# Create a list with the sizes of the arrays
array_sizes = [5] + list(range(10, 2001, 10))
# Pre-allocate a numpy array to store the timing results
times_elapsed = np.full((len(array_sizes), 1), 0.0)

# Average over 10 multiplications for each size -- ten is a small number
num_loops = 10
the_loop = np.array(range(1, num_loops + 1))

k = 0
for array_size in array_sizes:

    A = np.random.rand(array_size, array_size)
    B = np.random.rand(array_size, array_size)
    C = np.full((array_size, array_size), 0.0)

    # Or use the following matrices. They make possible the
# inspection of the matrix C possible, to check correctness.
    # A = np.full((array_size, array_size), 1/math.pi)
    # B = np.full((array_size, array_size), math.pi)
    # C = np.full((array_size, array_size), 0.0)

    start = time.process_time()
    for loop_counter in the_loop:

C =

    times_elapsed[k] = (time.process_time() - start)/num_loops*1000 # in msec
    k += 1

I skip the code for storing results to files or plotting. It’s even more trivial than the code above. The resulting (average) times are shown in the following chart.


Just for the fun of it and since it’s so easy with PyCharm, I also run the above code with the Intel Distribution for Python (IDP). Use of IDP leads to a minor only imprtovement in speed as the size of the matrices increases, but I did not explore further to find out if there is an underlying cause (why the improvement isn’t more dramatic). In any case, below are the results from both distributions in the same graph. Assuming that python is a symlink to the IDP, typing

python -m TBB

will run the python script using the IDP interpreter and will enable multithreading for select numpy operations using TBB.



If you are like me and you wonder at the marvel of technology, the results in the previous section are astonishing. Python, a high-level multi-paradigm interpreted language, multiplies two matrices of 4M entries each in a slightly more than 3.5 sec and puts the results in a new matrix. That is because python offers a paradigm that strikes a great balance between ease of use and fast execution. The code is short and readable and the computationally intensive work is outsourced to C when necessary (through numpy, in this case). I, therefore, started wondering if the Queen of fast languages could do better in a way that is as easy to implement. That is, without writing assembly and without installing libraries with endless dependencies and endless lists of linker flags. Just calling BLAS from C++ is of no interest because this is what numpy does internally. Writing code from scratch that would beat BLAS is an outlandish thought to entertain (it’s good to have confidence in one’s abilities, but being delusional is a whole different story). That lead me to the other commodity that is available for free and in abundance that would make switching from python to C++ worthwhile: multithreading. Intel generously offers excellent libraries for such tasks. The most prominent are Intel MKL and Intel TBB. So here is the code that does the same thing as the python code, but calls dgemm from BLAS [1] (unfortunately, wordpress does not preserve the indentation inside the code block).

#include <vector>
#include <iostream>
#include <fstream>
#include <string>

// Easy printing and timing facilities
#include "../../preprocessor_convenience.h"
#include "/opt/intel/compilers_and_libraries_2016.3.170/mac/mkl/include/mkl.h"

// Create alias for convenience
using vectord = std::vector<double>;

int main(int argc, const char * argv[]) {

// Number of times the same matrix multiplication is performed.
// Then, time = total_tile/loop_count
constexpr unsigned int loop_count = 10;

// Store the different array sizes in a vector.
vectord array_sizes {};

// Open the file that contains the array sizes used in python. [...]
// If the file could not be opened, release resource and return. [...]
// Read the files with the array sizes line by line [...]

// Store in a vector the time it takes to perform each multiplication.
vectord times_elapsed;

// Array sizes, loop counter, number of threads
int m, n, p, i, j, r, max_threads;

// Coefficients/arguments to be passed to cblas_dgemm
double alpha {1.0};
double beta {0.0};

// Auxiliary variable used in timing.
double s_initial {0.0};

// Main loop
for (auto& x : array_sizes) { // for each array size

// Initialise the array sizes.
m = x;
p = m;
n = m;

// Allocating memory for matrices aligned on 64-byte boundary for better performance
double *A = (double *)mkl_malloc( m*p*sizeof( double ), 64 );
double *B = (double *)mkl_malloc( p*n*sizeof( double ), 64 );
double *C = (double *)mkl_malloc( m*n*sizeof( double ), 64 );

if (A == NULL || B == NULL || C == NULL) {

std::cout << "\n ERROR: Can't allocate memory for matrices. Aborting... \n\n";
return 1;


// Intializing matrix data
for (i = 0; i < (m*p); i++) {
    A[i] = std::rand();

for (i = 0; i < (p*n); i++) {
    B[i] = std::rand();

for (i = 0; i < (m*n); i++) {
    C[i] = 0.0;

// Running Intel(R) MKL for max number of threads

// Set the elements of the matrix C equal to 0.0
for (j = 0; j < (m*n); j++) {
    C[j] = 0.0;

// Finding max number of threads Intel(R) MKL can use for parallel runs
max_threads = mkl_get_max_threads();

// Making the first run of matrix product using Intel(R) MKL dgemm
// function via CBLAS interface to get stable run time measurements.
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, p, alpha, A, p, B, n, beta, C, n);

// Measuring performance of matrix product using Intel(R) MKL dgemm
// function via CBLAS interface on max number of thread.
s_initial = dsecnd();

for (r = 0; r < loop_count; r++) {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
m, n, p, alpha, A, p, B, n, beta, C, n);

// Time elapsed in msec after running all loops for a given size of matrices
times_elapsed.push_back((dsecnd() - s_initial) / loop_count * 1000);

// Deallocating memory


// Write results to files [...]

return 0;


And here are the results:


Note: If the floats provide sufficient accuracy for an application, the C++ code should be approximately twice as fast when using floats instead of doubles, because of the utilisation of vector instructions.

If you have seen examples where Cython accelerates python code by a factor of a 100, a factor of 3 may not seem impressive. However, such a comparison would be flawed: here, we are not converting python for-loops to C++ code. Rather, the comparison is between the performance of a python extension module that calls a C library and C++ code that calls the same library, but also uses multithreading.

Of course, there are many factors to consider when pondering over the introduction of a completely different language into your code base—even of a new library in the same language, for that matter—so I will not jump to aphoristic conclusions of the type “you’d better be using C++” etc. In any case, there is more to the story.

You can have it all

Note: As Pierre indicated in the comments, dask performs lazy evaluation and, therefore, to compare actual computing times, one has to force the evaluation of the product of the matrices. Hopefully, there will be an updated version of this section, if and when I have the luxury of free time.

It doesn’t happen very often in life, but, yes, you can actually have it all.

So far we have ignored the memory aspect of matrix multiplication. Both in python and in C++ we allocated contiguous blocks of memory (as far as I know, numpy arrays are guaranteed to be contiguous, at least initially; that can change with transposes and other views) and just fiddled with efficient algorithms (BLAS) and multithreading (MKL and TBB). However, as the size of the arrays increases, cache friendliness, or lack thereof, dominates the overall performance (for example, this is the reason behind the sudden drops in performance for some of the operations and libraries benchmarked here eigen bechmark).

Long story made short, the dask python library allows the user to break the arrays into chunks. Even more, the syntax is reminiscent of the numpy syntax and, as such, reduces conversion of code that uses numpy to code that uses dask to minor modifications. It’s almost difficult to spot the differences (that is not true about the performance improvement). If you

import dask.array as da

then you, practically, just have to substitute




Here is the initial python code modified to use dask.:

#!/usr/bin/env python

import numpy as np
import math
import time
import dask
import dask.array as da

from remove_multiple_white_spaces import remove_multiple_white_spaces

array_sizes = [5] + list(range(10, 2001, 10))

times_elapsed = np.full((len(array_sizes), 1), 0.0)

num_loops = 10
the_loop = np.array(range(1, num_loops + 1))
chunk_size = 500

k = 0
for array_size in array_sizes:
A = da.random.random((array_size, array_size), chunks = chunk_size)
 B = da.random.random((array_size, array_size), chunks = chunk_size)
 C = da.full((array_size, array_size), 0.0, chunks = chunk_size) # np.full((array_size, array_size), 0.0)

 # Or:
# A = da.full((array_size, array_size), 1/math.pi, chunks = chunk_size) # use to check correctness of the calculation
 # B = da.full((array_size, array_size), math.pi, chunks = chuck_size) # use to check correctness of the calculation
 # C = da.full((array_size, array_size), 0.0, chunks = chunk_size) # np.full((array_size, array_size), 0.0)

 start = time.process_time()

 for loop_counter in the_loop:

 C =
 # del A
 # del B
 # del C

 times_elapsed[k] = (time.process_time() - start)/num_loops*1000 # in msec
 k += 1

And the results:


Zooming in because of the difference in scale:


This is quite impressive. Initially, it was hard to believe the performance boost, but several tests showed that the code is producing correct results. One conclusion is that with recent distributions and libraries in python you can have it all:  both out-of-the-box functionality and hard-to-beat performance. Of course, Intel MKL and Intel TBB provide the tools for performing the same task (decomposition of arrays into smaller memory chunks) and it would be interesting to make yet another comparison between python and C++ on a more equal ground. Maybe I will add such a comparison in the future. My guess is that the C++ ecosystem can deliver results at least as good as the ones delivered by dask, and the trade-off will be in code complexity and how steeper is one learning curve than the other. It’s true that the ease of transition from numpy to dask is hard to beat.


This entry was posted in . Bookmark the permalink.

6 thoughts on “Matrix multiplication

  1. I stumbled upon your post looking for resources to optimize matrix multiplication in my own library.

    You might be interested in Nim, speed of C, syntax of Python.

    And shameless ad, the library I wrote is 10x faster than Julia and 22x faster than Numpy on int64 1500×1500 matrix multiplication (integer so no BLAS/MKL cheat code) on my machine. Code is also generic, any type (complex, int64 …), no assembler, compilable to Javascript! Link here:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s