Differences between revisions 2 and 3
Revision 2 as of 2006-10-01 00:03:42
Size: 1797
Editor: DavidHarvey
Comment:
Revision 3 as of 2006-10-01 03:27:42
Size: 5152
Editor: DavidHarvey
Comment: flesh out "cdef functions vs def functions"
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
Apart from the stuff on this page, by far the best way to learn how to make Pyrex code faster is to '''study the C code that Pyrex generates'''.
Line 15: Line 17:
 * When you give an example, make sure to constrast a ''fast'' way of doing things with a ''slow'' way, especially if the slow way is more obvious. Show both versions of the Pyrex code, and show the generated C code as well (if you think that this is useful to see).  * When you give an example, make sure to constrast a ''fast'' way of doing things with a ''slow'' way, especially if the slow way is more obvious. Show both versions of the Pyrex code, and show the generated C code as well, if you think that this is useful to see. (Remove the irrelevant parts of the C code, because it can get pretty verbose :-).)
Line 23: Line 25:
[ todo ] A {{{cdef}}} function in Pyrex is basically a C function, so calling it is basically a few lines of assembly language. A {{{def}}} function on the other hand is a ''Python function'', and incurs the following overhead:
Line 25: Line 27:
 * function call overhead -- constructing tuples
 * parseTuple stuff inside the def function
 * The caller needs to construct a tuple object that holds the arguments.
 * The call itself has to go via the Python API.
 * The function being called needs to call the Python API to decode the tuple.

Additionally, in most cases:

 * The caller has to do a name lookup to find the function. (But you can cache this if you're calling the same function many times.)

All of this overhead is incurred '''whether you are calling from Python, or from Pyrex, or from Mars.'''

=== Example ===

Here's the Pyrex code:

{{{
cdef class X:
    def def_func(X self):
        pass

    cdef cdef_func(X self):
        pass

def call_def_func(X x):
    cdef int i
    for i from 0 <= i < 10000000:
        x.def_func()

def call_cdef_func(X x):
    cdef int i
    for i from 0 <= i < 10000000:
        x.cdef_func()
}}}

Performance data:

{{{
sage: x = X()

sage: time call_def_func(x)
CPU times: user 1.82 s, sys: 0.00 s, total: 1.82 s
Wall time: 1.82

sage: time call_cdef_func(x)
CPU times: user 0.08 s, sys: 0.00 s, total: 0.08 s
Wall time: 0.08
}}}

Pretty striking difference.

Here's the C code for {{{def_func}}}, note the {{{PyArg_ParseTupleAndKeywords}}} API call:

{{{
static PyObject *__pyx_f_7integer_1X_def_func(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyObject *__pyx_r;
  static char *__pyx_argnames[] = {0};
  if (!PyArg_ParseTupleAndKeywords(__pyx_args, __pyx_kwds, "", __pyx_argnames)) return 0;
  Py_INCREF(__pyx_v_self);

  __pyx_r = Py_None; Py_INCREF(Py_None);
  Py_DECREF(__pyx_v_self);
  return __pyx_r;
}
}}}

In contrast, here's the C code for {{{cdef_func}}}:

{{{
static PyObject *__pyx_f_7integer_1X_cdef_func(struct __pyx_obj_7integer_X *__pyx_v_self) {
  PyObject *__pyx_r;
  Py_INCREF(__pyx_v_self);

  __pyx_r = Py_None; Py_INCREF(Py_None);
  Py_DECREF(__pyx_v_self);
  return __pyx_r;
}
}}}

Now here are the two versions of the loop that actually does the calling. First, {{{call_def_func}}}, with error handling suppressed:

{{{
  for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
    __pyx_1 = PyObject_GetAttr(((PyObject *)__pyx_v_x), __pyx_n_def_func);
    __pyx_2 = PyTuple_New(0);
    __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    Py_DECREF(__pyx_3); __pyx_3 = 0;
  }
}}}

Notice it needs to do (a) a name lookup for {{{def_func}}}, (b) construct a tuple, and (c) call the Python API to make the function call.

Here's the much much slicker {{{call_cdef_func}}} version:

{{{
  for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
    __pyx_1 = ((struct __pyx_vtabstruct_7integer_X *)__pyx_v_x->__pyx_vtab)->cdef_func(__pyx_v_x);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
  }
}}}
Line 50: Line 150:

== use void return type when possible ==

[ todo ]

== exception handling is not as bad as you'd think ==

[ todo ]

 * If no exception occurs, overhead is really tiny

[ this page is under construction ]

Introduction

This page describes some techniques for writing really fast Pyrex code. It is aimed at SAGE developers who are working on low-level SAGE internals, where performance is absolutely crucial.

Pyrex is a very unusual language. If you write Pyrex as if it were Python, it can end up running as slowly as Python. If you write it like you're writing C, it can run almost as fast as pure C. The amazing thing about Pyrex is that the programmer gets to choose, pretty much line by line, where along the Python-C spectrum they want to work.

HOWEVER... it is hard work to make your Pyrex as fast as C. It's very easy to get it wrong, essentially because Pyrex makes it all look so easy. This purpose of this document is to collect together the experiences of SAGE developers who have learned the hard way.

Apart from the stuff on this page, by far the best way to learn how to make Pyrex code faster is to study the C code that Pyrex generates.

How to contribute to this document

Yes, please do! Make sure to follow these guidelines:

  • When you give an example, make sure to constrast a fast way of doing things with a slow way, especially if the slow way is more obvious. Show both versions of the Pyrex code, and show the generated C code as well, if you think that this is useful to see. (Remove the irrelevant parts of the C code, because it can get pretty verbose :-).)

  • Let's keep this scientific: show some evidence of performance (e.g. timing data).

Examples

cdef functions vs def functions

A cdef function in Pyrex is basically a C function, so calling it is basically a few lines of assembly language. A def function on the other hand is a Python function, and incurs the following overhead:

  • The caller needs to construct a tuple object that holds the arguments.
  • The call itself has to go via the Python API.
  • The function being called needs to call the Python API to decode the tuple.

Additionally, in most cases:

  • The caller has to do a name lookup to find the function. (But you can cache this if you're calling the same function many times.)

All of this overhead is incurred whether you are calling from Python, or from Pyrex, or from Mars.

Example

Here's the Pyrex code:

cdef class X:
    def def_func(X self):
        pass

    cdef cdef_func(X self):
        pass

def call_def_func(X x):
    cdef int i
    for i from 0 <= i < 10000000:
        x.def_func()

def call_cdef_func(X x):
    cdef int i
    for i from 0 <= i < 10000000:
        x.cdef_func()

Performance data:

sage: x = X()

sage: time call_def_func(x)
CPU times: user 1.82 s, sys: 0.00 s, total: 1.82 s
Wall time: 1.82

sage: time call_cdef_func(x)
CPU times: user 0.08 s, sys: 0.00 s, total: 0.08 s
Wall time: 0.08

Pretty striking difference.

Here's the C code for def_func, note the PyArg_ParseTupleAndKeywords API call:

static PyObject *__pyx_f_7integer_1X_def_func(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyObject *__pyx_r;
  static char *__pyx_argnames[] = {0};
  if (!PyArg_ParseTupleAndKeywords(__pyx_args, __pyx_kwds, "", __pyx_argnames)) return 0;
  Py_INCREF(__pyx_v_self);

  __pyx_r = Py_None; Py_INCREF(Py_None);
  Py_DECREF(__pyx_v_self);
  return __pyx_r;
}

In contrast, here's the C code for cdef_func:

static PyObject *__pyx_f_7integer_1X_cdef_func(struct __pyx_obj_7integer_X *__pyx_v_self) {
  PyObject *__pyx_r;
  Py_INCREF(__pyx_v_self);

  __pyx_r = Py_None; Py_INCREF(Py_None);
  Py_DECREF(__pyx_v_self);
  return __pyx_r;
}

Now here are the two versions of the loop that actually does the calling. First, call_def_func, with error handling suppressed:

  for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
    __pyx_1 = PyObject_GetAttr(((PyObject *)__pyx_v_x), __pyx_n_def_func);
    __pyx_2 = PyTuple_New(0);
    __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    Py_DECREF(__pyx_3); __pyx_3 = 0;
  }

Notice it needs to do (a) a name lookup for def_func, (b) construct a tuple, and (c) call the Python API to make the function call.

Here's the much much slicker call_cdef_func version:

  for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
    __pyx_1 = ((struct __pyx_vtabstruct_7integer_X *)__pyx_v_x->__pyx_vtab)->cdef_func(__pyx_v_x);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
  }

python attributes vs cdef attributes

[ todo ]

avoid python name lookups

[ todo ]

type checking

[ todo ]

"==" vs "is"

[ todo ]

  • e.g. testing for None

use void return type when possible

[ todo ]

exception handling is not as bad as you'd think

[ todo ]

  • If no exception occurs, overhead is really tiny