Close

Speed up Python with native modules using Cython

Todays post is all about one of the tools to speed up your Python code.

Cython can be installed with pip:

pip install cython

What the heck is Cython anyway?

From one point of view, Cython is a language that is very close to a superset of Python syntax. That is to say, most Python expressions are valid Cython. There are a very few cases where valid Python is not valid Cython, and that gap gets smaller and smaller for every Cython release.

Alternatively, Cython is a tool which takes Python like code and generates C (or C++) code which uses the CPython API. When compiled as a shared object, this library can be imported just like any standard Python module. For more details, read it from the horse’s mouth.

Let’s consider a Python module which exposes a function to compute Fibonacci numbers. The implementation is terrible (exponential complexity) but that’s convenient for benchmarking purposes.

This is the function (I put it in a file called fib_python.py):

def fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

Now let’s make sure it works correctly:

In [1]: from fib_python import fib

In [2]: [fib(n) for n in range(10)]
Out[2]: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

Looks good to me.

Ok, now to establish a baseline performance metric:

In [3]: %timeit fib(25)
10 loops, best of 3: 26.8 ms per loop

26.8 ms is a long time. The ‘right’ way to speed this up is to fix the implementation (see my memoization post), but let’s see what Cython can do for us.

As I said, most Python programs are valid Cython programs. Let’s transplant our function into a new file which we’ll call fib_cython.pyx (By convention, we use the pyx file extension for Cython source files.)

fib_cython.pyx contains:

def fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

To compile, we’ll use the cythonize command:

$ cythonize -i fib_cython.pyx

This produces the following output on my machine:

Compiling /Users/harding/Documents/blogs/cython/fib_cython.pyx because it changed.
[1/1] Cythonizing /Users/harding/Documents/blogs/cython/fib_cython.pyx
running build_ext
building 'fib_cython' extension
creating /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users
creating /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users/harding
creating /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users/harding/Documents
creating /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users/harding/Documents/blogs
creating /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users/harding/Documents/blogs/cython
gcc -fno-strict-aliasing -I/Users/harding/anaconda/include -arch x86_64 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/harding/anaconda/include/python2.7 -c /Users/harding/Documents/blogs/cython/fib_cython.c -o /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users/harding/Documents/blogs/cython/fib_cython.o
gcc -bundle -undefined dynamic_lookup -L/Users/harding/anaconda/lib -arch x86_64 -arch x86_64 /Users/harding/Documents/blogs/cython/tmpVVAe2F/Users/harding/Documents/blogs/cython/fib_cython.o -L/Users/harding/anaconda/lib -o /Users/harding/Documents/blogs/cython/fib_cython.so

As you can see, the command made a C file which was then compiled using gcc and
produces an output file fib_cython.so. This is the native module which we can
import and use in Python. Let’s give it a try:

In [4]: from fib_cython import fib as fib_cython

In [5]: [fib_cython(n) for n in range(10)]
Out[5]: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [6]: %timeit fib_cython(25)
100 loops, best of 3: 8.7 ms per loop

Holy cow!!! That’s about three times faster with no code changes!

The only thing we had to do was compile it! This is super promising right?!

It gets way better.

One of the coolest tools that comes with Cython is the annotation tool:
(note that we use cython and not cythonize here.)

cython -a fib_cython.pyx

This produces an html file called fib_cython.html which visualizes the Cython code and uses yellow highlighting with varying shades of yellow to indicate how much C code is generated for the given line. If you click on the line it shows you the C code associated with that line. You can use the brightness of the line as a heuristic for where you should focus your attention to add type annotations to speed thing up.

This is the rendered html for our Cython file:


Generated by Cython 0.25.2

Yellow lines hint at Python interaction.
Click on a line that starts with a “+” to see the C code that Cython generated for it.

Raw output: fib_cython.c

 1: 
+2: def fib(n):
/* Python wrapper */
static PyObject *__pyx_pw_10fib_cython_1fib(PyObject *__pyx_self, PyObject *__pyx_v_n); /*proto*/
static PyMethodDef __pyx_mdef_10fib_cython_1fib = {"fib", (PyCFunction)__pyx_pw_10fib_cython_1fib, METH_O, 0};
static PyObject *__pyx_pw_10fib_cython_1fib(PyObject *__pyx_self, PyObject *__pyx_v_n) {
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("fib (wrapper)", 0);
  __pyx_r = __pyx_pf_10fib_cython_fib(__pyx_self, ((PyObject *)__pyx_v_n));

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_10fib_cython_fib(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_n) {
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("fib", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_6);
  __Pyx_XDECREF(__pyx_t_7);
  __Pyx_XDECREF(__pyx_t_8);
  __Pyx_AddTraceback("fib_cython.fib", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple_ = PyTuple_Pack(1, __pyx_n_s_n); if (unlikely(!__pyx_tuple_)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple_);
  __Pyx_GIVEREF(__pyx_tuple_);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_10fib_cython_1fib, NULL, __pyx_n_s_fib_cython); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_fib, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
+3:     return 1 if n <= 1 else fib(n - 1) + fib(n - 2)
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_2 = PyObject_RichCompare(__pyx_v_n, __pyx_int_1, Py_LE); __Pyx_XGOTREF(__pyx_t_2); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 3, __pyx_L1_error)
  __pyx_t_3 = __Pyx_PyObject_IsTrue(__pyx_t_2); if (unlikely(__pyx_t_3 < 0)) __PYX_ERR(0, 3, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  if (__pyx_t_3) {
    __Pyx_INCREF(__pyx_int_1);
    __pyx_t_1 = __pyx_int_1;
  } else {
    __pyx_t_4 = __Pyx_GetModuleGlobalName(__pyx_n_s_fib); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 3, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_4);
    __pyx_t_5 = __Pyx_PyInt_SubtractObjC(__pyx_v_n, __pyx_int_1, 1, 0); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 3, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_5);
    __pyx_t_6 = NULL;
    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_4))) {
      __pyx_t_6 = PyMethod_GET_SELF(__pyx_t_4);
      if (likely(__pyx_t_6)) {
        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_4);
        __Pyx_INCREF(__pyx_t_6);
        __Pyx_INCREF(function);
        __Pyx_DECREF_SET(__pyx_t_4, function);
      }
    }
    if (!__pyx_t_6) {
      __pyx_t_2 = __Pyx_PyObject_CallOneArg(__pyx_t_4, __pyx_t_5); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 3, __pyx_L1_error)
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      __Pyx_GOTREF(__pyx_t_2);
    } else {
      #if CYTHON_FAST_PYCALL
      if (PyFunction_Check(__pyx_t_4)) {
        PyObject *__pyx_temp[2] = {__pyx_t_6, __pyx_t_5};
        __pyx_t_2 = __Pyx_PyFunction_FastCall(__pyx_t_4, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
        __Pyx_GOTREF(__pyx_t_2);
        __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      } else
      #endif
      #if CYTHON_FAST_PYCCALL
      if (__Pyx_PyFastCFunction_Check(__pyx_t_4)) {
        PyObject *__pyx_temp[2] = {__pyx_t_6, __pyx_t_5};
        __pyx_t_2 = __Pyx_PyCFunction_FastCall(__pyx_t_4, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
        __Pyx_GOTREF(__pyx_t_2);
        __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      } else
      #endif
      {
        __pyx_t_7 = PyTuple_New(1+1); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_7);
        __Pyx_GIVEREF(__pyx_t_6); PyTuple_SET_ITEM(__pyx_t_7, 0, __pyx_t_6); __pyx_t_6 = NULL;
        __Pyx_GIVEREF(__pyx_t_5);
        PyTuple_SET_ITEM(__pyx_t_7, 0+1, __pyx_t_5);
        __pyx_t_5 = 0;
        __pyx_t_2 = __Pyx_PyObject_Call(__pyx_t_4, __pyx_t_7, NULL); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_2);
        __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
      }
    }
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    __pyx_t_7 = __Pyx_GetModuleGlobalName(__pyx_n_s_fib); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 3, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    __pyx_t_5 = __Pyx_PyInt_SubtractObjC(__pyx_v_n, __pyx_int_2, 2, 0); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 3, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_5);
    __pyx_t_6 = NULL;
    if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_7))) {
      __pyx_t_6 = PyMethod_GET_SELF(__pyx_t_7);
      if (likely(__pyx_t_6)) {
        PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_7);
        __Pyx_INCREF(__pyx_t_6);
        __Pyx_INCREF(function);
        __Pyx_DECREF_SET(__pyx_t_7, function);
      }
    }
    if (!__pyx_t_6) {
      __pyx_t_4 = __Pyx_PyObject_CallOneArg(__pyx_t_7, __pyx_t_5); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 3, __pyx_L1_error)
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      __Pyx_GOTREF(__pyx_t_4);
    } else {
      #if CYTHON_FAST_PYCALL
      if (PyFunction_Check(__pyx_t_7)) {
        PyObject *__pyx_temp[2] = {__pyx_t_6, __pyx_t_5};
        __pyx_t_4 = __Pyx_PyFunction_FastCall(__pyx_t_7, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
        __Pyx_GOTREF(__pyx_t_4);
        __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      } else
      #endif
      #if CYTHON_FAST_PYCCALL
      if (__Pyx_PyFastCFunction_Check(__pyx_t_7)) {
        PyObject *__pyx_temp[2] = {__pyx_t_6, __pyx_t_5};
        __pyx_t_4 = __Pyx_PyCFunction_FastCall(__pyx_t_7, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_XDECREF(__pyx_t_6); __pyx_t_6 = 0;
        __Pyx_GOTREF(__pyx_t_4);
        __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      } else
      #endif
      {
        __pyx_t_8 = PyTuple_New(1+1); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_8);
        __Pyx_GIVEREF(__pyx_t_6); PyTuple_SET_ITEM(__pyx_t_8, 0, __pyx_t_6); __pyx_t_6 = NULL;
        __Pyx_GIVEREF(__pyx_t_5);
        PyTuple_SET_ITEM(__pyx_t_8, 0+1, __pyx_t_5);
        __pyx_t_5 = 0;
        __pyx_t_4 = __Pyx_PyObject_Call(__pyx_t_7, __pyx_t_8, NULL); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 3, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_4);
        __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
      }
    }
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    __pyx_t_7 = PyNumber_Add(__pyx_t_2, __pyx_t_4); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 3, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
    __pyx_t_1 = __pyx_t_7;
    __pyx_t_7 = 0;
  }
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;

This is where the Cython specific syntax comes in. In Python, everything is an object. From a C API perspective, that means that everything is an instance of the PyObject struct. Since every Python operation involves interfacing with the PyObject at runtime there is a bit of overhead that you can eliminate.

Cython allows you to specify the types of variables, function return values etc. which allows the resulting C code to use those types directly without any of the PyObject boxing.

One other super important point. Cython also provides alternatives for the def keyword: cdef and cpdef. cdef indicates that the function will only be called from within Cython and can use the C calling convention instead of the Python calling convention. This can be a huge speedup for problems which involve many function calls. cpdef provides a compromise. Functions declared as cpdef end up generating two versions. One with the C calling convention and one with the Python calling convention. Calls from Cython code will use the c version and calls from Python code will use the Python convention.

Let’s try that out. We’ll make a new file called fib_cpdef.pyx:

# fib_cpdef.pyx

cpdef fib(n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

Here’s the results:

In [7]: from fib_cpdef import fib as fib_cpdef

In [8]: [fib_cpdef(n) for n in range(10)]
Out[8]: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [9]: %timeit fib_cpdef(25)
100 loops, best of 3: 5.35 ms per loop

Yeeeeehaw! That’s one and a half times faster yet!

Now let’s annotate the types. We’ll use a file called fib_typed.pyx:

# fib_typed.pyx

cpdef int fib(int n):
    return 1 if n <= 1 else fib(n - 1) + fib(n - 2)

The only additions are the int specifying the return type and the int specifying the type of the parameter n.

In [10]: from fib_typed import fib as fib_typed

In [11]: [fib_typed(n) for n in range(10)]
Out[11]: [1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

In [12]: %timeit fib_typed(25)
1000 loops, best of 3: 255 µs per loop

Dear God. That’s more than twenty times faster than the cpdef version!

The final version is 105 times faster than the original Python version.

The real test is to compare this performance to an actual hand-written c implementation. Let’s write a little C library to test against (fib.c):

int fib(int n)
{
    if(n <= 1) {
        return 1;
    }
    return fib(n - 1) + fib(n - 2);
}

Make the lib by doing:

gcc -O3 -c fib.c
gcc -O3 -shared -fPIC fib.o -o fib.so

This produces fib.so which is a standard C dynamically linked library (i.e. not a Python native module)

Let’s use the Python C foreign function interface to test performance:

In [13]: from ctypes import CDLL

In [14]: fib_lib = CDLL('fib.so')

In [15]: fib_c = fib_lib.fib

In [16]: %timeit fib_c(25)
1000 loops, best of 3: 255 µs per loop

Hooray! We got the exact same performance as a straight C library compiled with -O3!

I think this is really amazing. Cython blends the fantastic syntax and rapid development features of Python with the performance of raw C.

I’ve barely scratched the surface of what’s possible with Cython, but hopefully I’ve inspired you to play around with this fantastic tool and hopefully you’ll consider using Cython instead of porting perfectly good Python code to C/C++.

Leave a Reply

Your email address will not be published. Required fields are marked *