BLAS / LAPACK on Windows

Windows has no default BLAS / LAPACK library. By “default” we mean, installed with the operating system.

Numpy needs a BLAS library that has CBLAS C language wrappers.

Here is a list of the options that we know about.

ATLAS

The ATLAS libraries have been the default BLAS / LAPACK libraries for numpy binary installers on Windows to date (end of 2015).

ATLAS uses comprehensive tests of parameters on a particular machine to chose from a range of algorithms to optimize BLAS and some LAPACK routines. Modern versions (>= 3.9) perform reasonably well on BLAS benchmarks. Each ATLAS build is optimized for a particular machine (CPU capabilities, L1 / L2 cache size, memory speed), and ATLAS does not select routines at runtime but at build time, meaning that a default ATLAS build can be badly optimized for a particular processor. The main developer of ATLAS is Clint Whaley. His main priority is optimizing for HPC machines, and he does not give much time to supporting Windows builds. Not surprisingly, ATLAS is difficult to build on Windows, and is not well optimized for Windows 64 bit.

Advantages:

  • Very reliable;
  • BSD license;

Disadvantages:

  • By design, the compilation step of ATLAS tunes the output library to the exact architecture on which it is compiling. This means good performance for machines very like the build machine, but worse performance on other machines;
  • No runtime optimization for running CPU;
  • Has only one major developer (Clint Whaley);
  • Compilation is difficult, slow and error-prone on Windows;
  • Not optimized for Windows 64 bit

Because there is no run-time adaptation to the CPU, ATLAS built for a CPU with SSE3 instructions will likely crash on a CPU that does not have SSE3 instructions, and ATLAS built for a SSE2 CPU will not be able to use SSE3 instructions. Therefore, numpy installers on Windows use the “superpack” format, where we build three ATLAS libraries:

  • without CPU support for SSE instructions;
  • with support for SSE2 instructions;
  • with support for SSE3 instructions;

We make three Windows .exe installers, one for each of these ATLAS versions, and then build a “superpack” installer from these three installers, that first checks the machine on which the superpack installer is running, to find what instructions the CPU supports, and then installs the matching numpy / ATLAS package.

There is no way of doing this when installing from binary wheels, because the wheel installation process consists of unpacking files to given destinations, and does not allow pre-install or post-install scripts.

One option would be to build a binary wheel with ATLAS that depends on SSE2 instructions. It seems that 99.5% of Windows machines have SSE2 (see: Windows versions). It is not technically difficult to put a check in the numpy __init__.py file to give a helpful error message and die when the CPU does not have SSE2:

try:
    from ctypes import windll, wintypes
except (ImportError, ValueError):
    pass
else:
    has_feature = windll.kernel32.IsProcessorFeaturePresent
    has_feature.argtypes = [wintypes.DWORD]
    if not has_feature(10):
        msg = ("This version of numpy needs a CPU capable of SSE2, "
                "but Windows says - not so.\n",
                "Please reinstall numpy using a superpack installer")
        raise RuntimeError(msg)

Intel Math Kernel Library

The MKL has a reputation for being fast, particularly on Intel chips (see the MKL Wikipedia entry). It has good performance on BLAS / LAPACK benchmarks across the range, except on AMD processors.

It is closed-source, but available for free under the Community licensing program.

The MKL is covered by the Intel Simplified Software License (see the Intel license page). The Simplified Software License does allow us, the developers, to distribute copies of the MKL with our built binaries, where we include their terms of use in our distribution. These include:

YOU AGREE TO INDEMNIFIY AND HOLD INTEL HARMLESS AGAINST ANY CLAIMS AND EXPENSES RESULTING FROM YOUR USE OR UNAUTHORIZED USE OF THE SOFTWARE.

This clause appears to apply to the users of our binaries, not us, the authors of the binary. This is a change from Intel’s previous MKL license, which required us, the authors, to pay Intel’s legal fees of the user sued Intel.

See discussions about MKL on numpy mailing list and MKL on Julia issues.

Advantages:

  • At or near maximum speed;
  • Runtime processor selection, giving good performance on a range of different CPUs.

Disadvantages:

  • Closed source.

AMD Core Math Library

The ACML was AMD’s equivalent to the MKL, with similar or moderately worse performance. As of time of writing (December 2015), AMD has marked the ACML as “end of life”, and suggests using the AMD compute libraries instead.

The ACML does not appear to contain a CBLAS interface.

Binaries linked against ACML have to conform to the ACML license which, as for the older MKL license, requires software linked to the ACML to subject users to the ACML license terms including:

2. Restrictions. The Software contains copyrighted and patented material, trade secrets and other proprietary material. In order to protect them, and except as permitted by applicable legislation, you may not:

a) decompile, reverse engineer, disassemble or otherwise reduce the Software to a human-perceivable form;

b) modify, network, rent, lend, loan, distribute or create derivative works based upon the Software in whole or in part [...]

AMD compute libraries

AMD advertise the AMD compute libraries (ACL) as the successor to the ACML.

The ACL page points us to BLAS-like instantiation software framework for BLAS and libflame for LAPACK.

BLAS-like instantiation software framework

BLIS is “a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries.”

It provides a superset of BLAS, along with a CBLAS layer.

It can be compiled into a BLAS library. As of writing (December 2015) Windows builds are experimental. BLIS does not currently do run-time hardware detection.

As of December 2015 the developer mailing list was fairly quiet, with only a few emails since August 2015.

Advantages:

  • portable across platforms;
  • modern architecture.

Disadvantages:

  • Windows builds are experimental;
  • No runtime hardware detection.

libflame

libflame is an implementation of some LAPACK routines. See the libflame project page for more detail.

libflame can also be built to include a full LAPACK implementation. It is a sister project to BLIS.

COBLAS

COBLAS is a “Reference BLAS library in C99”, BSD license. A quick look at the code in April 2014 suggested it used very straightforward implementations that are not highly optimized.

Netlib reference implementation

See netlib BLAS and netlib LAPACK.

Most available benchmarks (e.g R benchmarks, BLAS LAPACK review) show the reference BLAS / LAPACK to be considerably slower than any optimized library.

Eigen

Eigen is “a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.”

Mostly covered by the Mozilla Public Licence 2, but some features covered by the LGPL. Non-MPL2 features can be disabled

It is technically possible to compile Eigen into a BLAS library, but there is currently no CBLAS interface.

See Eigen FAQ entry discussing BLAS / LAPACK.

GotoBLAS2

GotoBLAS2 is the predecessor to OpenBLAS. It was a library written by Kazushige Goto, and released under a BSD license, but is no longer maintained. Goto now works for Intel. It was at or near the top of benchmarks on which it has been tested (e.g BLAS LAPACK review Eigen benchmarks). Like MKL and ACML, GotoBLAS2 chooses routines at runtime according to the processor. It does not detect modern processors (after 2011).

OpenBLAS

OpenBLAS is a fork of GotoBLAS2 updated for newer processors. It uses the 3-clause BSD license.

Julia uses OpenBLAS by default.

See OpenBLAS on github for current code state. It appears to be actively merging pull requests. There have been some worries about bugs and lack of tests on the numpy mailing list and the octave list.

It appears to be fast on benchmarks.

OpenBLAS on Win32 seems to be quite stable. Some OpenBLAS issues on Win64 can be adressed with a single threaded version of that library.

Advantages:

  • at or near fastest implementation;
  • runtime hardware detection.

Disadvantages:

  • questions about quality control.