Differences between revisions 1 and 20 (spanning 19 versions)

[ this page is permanently under construction ]

Contents

Prologue
Introduction
1. How to contribute to this document
cdef functions vs def functions
1. Example
python attributes vs cdef attributes
avoid python name lookups
Type checking
"==" vs "is"
tuple, list, and dict access
integer "for" loops
exception handling is not as bad as you'd think
a.__add__(b) is worse than a+b in Pyrex speed-wise
Offtopic: Malloc Replacements

Prologue

With the recent improvements in Cython/SageX, many of these suggestions are handled implicitly and give very little benefits for a large drawback in code readability.

Introduction

This page describes some techniques for writing really fast Pyrex code. It is aimed at SAGE developers who are working on low-level SAGE internals, where performance is absolutely crucial.

Pyrex is a very unusual language. If you write Pyrex as if it were Python, it can end up running more slowly than Python. If you write it like you're writing C, it can run almost as fast as pure C. The amazing thing about Pyrex is that the programmer gets to choose, pretty much line by line, where along the Python-C spectrum they want to work.

HOWEVER... it is hard work to make your Pyrex as fast as C. It's very easy to get it wrong, essentially because Pyrex makes it all look so easy. This purpose of this document is to collect together the experiences of SAGE developers who have learned the hard way. [Cython does a lot of this hard work for you.]

Apart from the stuff on this page, by far the best way to learn how to make Pyrex code faster is to study the C code that Pyrex generates.

How to contribute to this document

Yes, please do! Make sure to follow these guidelines:

When you give an example, make sure to constrast a fast way of doing things with a slow way, especially if the slow way is more obvious. Show both versions of the Pyrex code, and show the generated C code as well, if you think that this is useful to see. (Remove the irrelevant parts of the C code, because it can get pretty verbose :-).)
Let's keep this scientific: show some evidence of performance (e.g. timing data).
William Stein is writing a chapter for the programming guide: See http://sage.math.washington.edu/sage/pyrex_guide.

cdef functions vs def functions

A cdef function in Pyrex is basically a C function, so calling it is basically a few lines of assembly language. A def function on the other hand is a Python function, and incurs the following overhead:

The caller needs to construct a tuple object that holds the arguments.
The call itself has to go via the Python API.
The function being called needs to call the Python API to decode the tuple.

Additionally, in most cases:

The caller has to do a name lookup to find the function. (But you can cache this if you're calling the same function many times.)

All of this overhead is incurred whether you are calling from Python, or from Pyrex, or from Mars.

Example

Here's the Pyrex code:

   1 cdef class X:
   2     def def_func(X self):
   3         pass
   4 
   5     cdef cdef_func(X self):
   6         pass
   7 
   8 def call_def_func(X x):
   9     cdef int i
  10     for i from 0 <= i < 10000000:
  11         x.def_func()
  12 
  13 def call_cdef_func(X x):
  14     cdef int i
  15     for i from 0 <= i < 10000000:
  16         x.cdef_func()

Performance data:

   1 sage: x = X()
   2 
   3 sage: time call_def_func(x)
   4 CPU times: user 1.82 s, sys: 0.00 s, total: 1.82 s
   5 Wall time: 1.82
   6 
   7 sage: time call_cdef_func(x)
   8 CPU times: user 0.08 s, sys: 0.00 s, total: 0.08 s
   9 Wall time: 0.08

Pretty striking difference. (And by the way, the second one goes twice as fast again if you declare a void return type.)

Here's the C code for def_func, note the PyArg_ParseTupleAndKeywords API call:

   1 static PyObject *__pyx_f_7integer_1X_def_func(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
   2   PyObject *__pyx_r;
   3   static char *__pyx_argnames[] = {0};
   4   if (!PyArg_ParseTupleAndKeywords(__pyx_args, __pyx_kwds, "", __pyx_argnames)) return 0;
   5   Py_INCREF(__pyx_v_self);
   6 
   7   __pyx_r = Py_None; Py_INCREF(Py_None);
   8   Py_DECREF(__pyx_v_self);
   9   return __pyx_r;
  10 }

In contrast, here's the C code for cdef_func:

   1 static PyObject *__pyx_f_7integer_1X_cdef_func(struct __pyx_obj_7integer_X *__pyx_v_self) {
   2   PyObject *__pyx_r;
   3   Py_INCREF(__pyx_v_self);
   4 
   5   __pyx_r = Py_None; Py_INCREF(Py_None);
   6   Py_DECREF(__pyx_v_self);
   7   return __pyx_r;
   8 }

Now here are the two versions of the loop that actually does the calling. First, call_def_func, with error handling suppressed:

   1   for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
   2     __pyx_1 = PyObject_GetAttr(((PyObject *)__pyx_v_x), __pyx_n_def_func);
   3     __pyx_2 = PyTuple_New(0);
   4     __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2);
   5     Py_DECREF(__pyx_1); __pyx_1 = 0;
   6     Py_DECREF(__pyx_2); __pyx_2 = 0;
   7     Py_DECREF(__pyx_3); __pyx_3 = 0;
   8   }

Notice it needs to do (a) a name lookup for def_func, (b) construct a tuple, and (c) call the Python API to make the function call.

Here's the much much slicker call_cdef_func version:

   1   for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
   2     __pyx_1 = ((struct __pyx_vtabstruct_7integer_X *)__pyx_v_x->__pyx_vtab)->cdef_func(__pyx_v_x);
   3     Py_DECREF(__pyx_1); __pyx_1 = 0;
   4   }

python attributes vs cdef attributes

[ todo ]

avoid python name lookups

[ todo ]

sage.rings.integer.Integer
PyObject_HasAttrString

Type checking

Generally speaking, using isinstance is very slow. It's slow because it needs to cover many cases, and it's slow because it's a Python function. If you are checking for a particular Pyrex-generated type (or indeed any extension type), it's much faster to use the Python API function PyObject_TypeCheck. In fact, PyObject_TypeCheck is a macro, so there isn't even any C function call overhead. The prototype is declared in cdefs.pxi.

Example Pyrex code:

   1 cdef class X:
   2     pass
   3 
   4 def test1(x):
   5     cdef int i
   6     for i from 0 <= i < 5000000:
   7         if isinstance(x, X):
   8             pass
   9 
  10 def test2(x):
  11     cdef int i
  12     for i from 0 <= i < 5000000:
  13         if PyObject_TypeCheck(x, X):
  14             pass

Here's the performance comparison:

   1 sage: time test1(47)
   2 CPU times: user 3.89 s, sys: 0.00 s, total: 3.89 s
   3 Wall time: 3.89
   4 
   5 sage: time test2(47)
   6 CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
   7 Wall time: 0.16

(Note: there is also some overhead in the name lookup for isinstance, but in this case it's a fairly small part of the time, about 10% or so.)

Here's the C code for the first loop (error checking suppressed):

   1   for (__pyx_v_i = 0; __pyx_v_i < 5000000; ++__pyx_v_i) {
   2     __pyx_1 = __Pyx_GetName(__pyx_b, __pyx_n_isinstance);
   3     __pyx_2 = PyTuple_New(2);
   4     Py_INCREF(__pyx_v_x);
   5     PyTuple_SET_ITEM(__pyx_2, 0, __pyx_v_x);
   6     Py_INCREF(((PyObject*)__pyx_ptype_7integer_X));
   7     PyTuple_SET_ITEM(__pyx_2, 1, ((PyObject*)__pyx_ptype_7integer_X));
   8     __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2);
   9     Py_DECREF(__pyx_1); __pyx_1 = 0;
  10     Py_DECREF(__pyx_2); __pyx_2 = 0;
  11     __pyx_4 = PyObject_IsTrue(__pyx_3);
  12     Py_DECREF(__pyx_3); __pyx_3 = 0;
  13     if (__pyx_4) {
  14       goto __pyx_L4;
  15     }
  16     __pyx_L4:;
  17   }

And here's the much tighter code for the second loop:

   1   for (__pyx_v_i = 0; __pyx_v_i < 5000000; ++__pyx_v_i) {
   2     __pyx_1 = PyObject_TypeCheck(__pyx_v_x,((PyObject*)__pyx_ptype_7integer_X));
   3     if (__pyx_1) {
   4       goto __pyx_L4;
   5     }
   6     __pyx_L4:;
   7   }

"==" vs "is"

"a is b" checks whether a and b are the same objects while "a == b" returns the result of either a.__class__().__richcmp__(a,b,2) or b.__class__().__richcmp__(a,b,2) depending on if a.class() can handle comparison with b.

This gets used quite often to e.g., check if an attribute of a class has been computed already, like this:

   1 cdef class MyClass
   2     cdef object             #cached result
   3     ...
   4     def calculate_result( MyClass self ):
   5         if self.__cached_result is not None :
   6            return self.__cached_result
   7         else:
   8            self.__cached_result = self.calculate_result()
   9            return self.__cached_result

If the cached result has not been computed the default value for any cdef class attribute is None. There is only one single None object in Python so it's safe to check for identity rather equality. Consider these timing examples.

   1 from sage.rings.integer_ring import ZZ
   2 
   3 n = 600000
   4 
   5 def test1():
   6   one = ZZ(1)
   7   for i in range(n):
   8     one is None
   9 
  10 
  11 def test2():
  12   one = ZZ(1)
  13   for i in range(n):
  14     one == None

Now consider the times these functions take:

   1 %time test1()
   2 # CPU time: 0.18 s,  Wall time: 0.40 s
   3 
   4 %time test2()
   5 # CPU time: 14.85 s,  Wall time: 44.80 s

So it's way faster to test for identity with None than comparing with None.

This is also true for pure python.

tuple, list, and dict access

Pyrex plays safe when it comes to list, tuple, dict access:

   1 def test():
   2     t = tuple([1,2])
   3     t[0]

t[0] gets translated to:

   1 __pyx_1 = PyInt_FromLong(0); 
   2 
   3 if (!__pyx_1) {
   4   __pyx_filename = __pyx_f[0]; 
   5   __pyx_lineno = 10; 
   6   goto __pyx_L1;
   7 }
   8 __pyx_3 = PyObject_GetItem(__pyx_v_t, __pyx_1); 
   9 
  10 if (!__pyx_3) {
  11   __pyx_filename = __pyx_f[0];
  12   __pyx_lineno = 10;
  13   goto __pyx_L1;
  14 }
  15 Py_DECREF(__pyx_1); __pyx_1 = 0;
  16 Py_DECREF(__pyx_3); __pyx_3 = 0;

This is faster:

   1  cdef extern from "Python.h":
   2       void* PyTuple_GET_ITEM(object p, int pos)
   3 
   4 def test2():
   5      cdef object w
   6      t = tuple([1,2])
   7      w = <object> PyTuple_GET_ITEM(t,0)
   8      return 0

As it gets translated to:

   1 _4 = (PyObject *)PyTuple_GET_ITEM(t,0);
   2 Py_INCREF(_4);
   3 Py_DECREF(w);
   4 w = _4;
   5 _4 = 0;

Cython Update There is still a slight speed difference, but it is almost negligable. The code automatically detects lists/tuples at runtime and uses the PyXxx_GET_ITEM code if it can. This is much safer to, as calling PyTuple_GET_ITEM on a non-tuple or with bad bounds results in a segfault. (Not to mention easier to read).

integer "for" loops

   1 from sage.rings.integer_ring import ZZ
   2 
   3 cdef int n
   4 n = 3000000
   5 
   6 def test0():
   7   cdef int j,k
   8   j = 1
   9   k = 2
  10   for i in range(n):
  11     j = k * j**2
  12   return j
  13 
  14 def test0a():
  15   for i in range(n):
  16     pass
  17 
  18 def test1():
  19   cdef int j,k
  20   j = 1
  21   k = 2
  22   for i from 0 <= i < n:
  23     j = k * j**2
  24   return j
  25 
  26 def test1a():
  27   for i from 0 <= i < n:
  28     pass
  29 
  30 def test2():
  31   cdef int i,j,k
  32   j = 1
  33   k = 2
  34   for i from 0 <= i < n:
  35     j = k * j**2
  36   return j

   1 %time s = test0()
   2 # CPU time: 1.34 s,  Wall time: 3.38 s
   3 
   4 %time s = test0a()
   5 # CPU time: 0.93 s,  Wall time: 1.99 s
   6 
   7 %time s = test1()
   8 # CPU time: 0.61 s,  Wall time: 1.26 s
   9 
  10 %time s = test1a()
  11 # CPU time: 0.15 s,  Wall time: 0.43 s
  12 
  13 %time s = test2()
  14 # CPU time: 0.44 s,  Wall time: 1.50 s

exception handling is not as bad as you'd think

[ todo ]

If no exception occurs, overhead is really tiny.

a.add(b) is worse than a+b in Pyrex speed-wise

[ todo ] but headline says it all

Offtopic: Malloc Replacements

Malloc Replacements

-  ⇤ ← Revision 1 as of 2006-09-30 03:40:47 → 
  Size: 1735
  Editor: DavidHarvey
  Comment: started page
+   ← Revision 20 as of 2009-05-04 19:36:24 → ⇥
  Size: 11436
  Editor: robertwb
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-[ this page is under construction ]
+[ this page is permanently under construction ]

<<TableOfContents>>

= Prologue =

With the recent improvements in Cython/SageX, many of these suggestions are handled implicitly and give very little benefits for a large drawback in code readability.
-Line 7:
+Line 13:
-Pyrex is a very unusual language. If you write Pyrex as if it were Python, it can end up running as slowly as Python. If you write it like you're writing C, it can run almost as fast as pure C. The amazing thing about Pyrex is that the programmer gets to choose, pretty much line by line, where along the Python-C spectrum they want to work.

HOWEVER... it is ''hard work'' to make your Pyrex as fast as C. It's very easy to get it wrong, essentially ''because'' Pyrex makes it all look so easy. This purpose of this document is to collect together the experiences of SAGE developers who have learned the hard way.

=== How to contribute to this document ===
+Pyrex is a very unusual language. If you write Pyrex as if it were Python, it can end up running ''more'' slowly than Python. If you write it like you're writing C, it can run almost as fast as pure C. The amazing thing about Pyrex is that the programmer gets to choose, pretty much line by line, where along the Python-C spectrum they want to work.

HOWEVER... it is ''hard work'' to make your Pyrex as fast as C. It's very easy to get it wrong, essentially ''because'' Pyrex makes it all look so easy. This purpose of this document is to collect together the experiences of SAGE developers who have learned the hard way. ['''Cython''' does a lot of this hard work for you.]

Apart from the stuff on this page, by far the best way to learn how to make Pyrex code faster is to '''study the C code that Pyrex generates'''.

== How to contribute to this document ==
-Line 15:
+Line 23:
- * When you give an example, make sure to constrast a ''fast'' way of doing things with a ''slow'' way, especially if the slow way is more obvious. Show both versions of the Pyrex code, and show the generated C code as well (if you think that this is useful to see).
+ * When you give an example, make sure to constrast a ''fast'' way of doing things with a ''slow'' way, especially if the slow way is more obvious. Show both versions of the Pyrex code, and show the generated C code as well, if you think that this is useful to see. (Remove the irrelevant parts of the C code, because it can get pretty verbose :-).)
-Line 18:
+Line 26:
-= Examples =

== cdef functions vs def functions ==
+ * William Stein is writing a chapter for the programming guide: See [[http://sage.math.washington.edu/sage/pyrex_guide]]. 

= cdef functions vs def functions =

A {{{cdef}}} function in Pyrex is basically a C function, so calling it is basically a few lines of assembly language. A {{{def}}} function on the other hand is a ''Python function'', and incurs the following overhead:

 * The caller needs to construct a tuple object that holds the arguments.
 * The call itself has to go via the Python API.
 * The function being called needs to call the Python API to decode the tuple.

Additionally, in most cases:

 * The caller has to do a name lookup to find the function. (But you can cache this if you're calling the same function many times.)

All of this overhead is incurred '''whether you are calling from Python, or from Pyrex, or from Mars.'''

== Example ==

Here's the Pyrex code:

{{{#!python
cdef class X:
    def def_func(X self):
        pass

    cdef cdef_func(X self):
        pass

def call_def_func(X x):
    cdef int i
    for i from 0 <= i < 10000000:
        x.def_func()

def call_cdef_func(X x):
    cdef int i
    for i from 0 <= i < 10000000:
        x.cdef_func()
}}}

Performance data:

{{{#!python
sage: x = X()

sage: time call_def_func(x)
CPU times: user 1.82 s, sys: 0.00 s, total: 1.82 s
Wall time: 1.82

sage: time call_cdef_func(x)
CPU times: user 0.08 s, sys: 0.00 s, total: 0.08 s
Wall time: 0.08
}}}

Pretty striking difference. (And by the way, the second one goes twice as fast again if you declare a {{{void}}} return type.)

Here's the C code for {{{def_func}}}, note the {{{PyArg_ParseTupleAndKeywords}}} API call:

{{{#!cplusplus
static PyObject *__pyx_f_7integer_1X_def_func(PyObject *__pyx_v_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyObject *__pyx_r;
  static char *__pyx_argnames[] = {0};
  if (!PyArg_ParseTupleAndKeywords(__pyx_args, __pyx_kwds, "", __pyx_argnames)) return 0;
  Py_INCREF(__pyx_v_self);

  __pyx_r = Py_None; Py_INCREF(Py_None);
  Py_DECREF(__pyx_v_self);
  return __pyx_r;
}
}}}

In contrast, here's the C code for {{{cdef_func}}}:

{{{#!cplusplus
static PyObject *__pyx_f_7integer_1X_cdef_func(struct __pyx_obj_7integer_X *__pyx_v_self) {
  PyObject *__pyx_r;
  Py_INCREF(__pyx_v_self);

  __pyx_r = Py_None; Py_INCREF(Py_None);
  Py_DECREF(__pyx_v_self);
  return __pyx_r;
}
}}}

Now here are the two versions of the loop that actually does the calling. First, {{{call_def_func}}}, with error handling suppressed:

{{{#!cplusplus
  for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
    __pyx_1 = PyObject_GetAttr(((PyObject *)__pyx_v_x), __pyx_n_def_func);
    __pyx_2 = PyTuple_New(0);
    __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    Py_DECREF(__pyx_3); __pyx_3 = 0;
  }
}}}

Notice it needs to do (a) a name lookup for {{{def_func}}}, (b) construct a tuple, and (c) call the Python API to make the function call.

Here's the much much slicker {{{call_cdef_func}}} version:

{{{#!cplusplus
  for (__pyx_v_i = 0; __pyx_v_i < 10000000; ++__pyx_v_i) {
    __pyx_1 = ((struct __pyx_vtabstruct_7integer_X *)__pyx_v_x->__pyx_vtab)->cdef_func(__pyx_v_x);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
  }
}}}

= python attributes vs cdef attributes =
-Line 25:
+Line 137:
- * function call overhead -- constructing tuples
 * parseTuple stuff inside the def function

== python attributes vs cdef attributes ==

[ todo ]

== avoid python name lookups ==
+= avoid python name lookups =
-Line 39:
+Line 144:
-== type checking ==
+= Type checking =

Generally speaking, using {{{isinstance}}} is very slow. It's slow because it needs to cover many cases, and it's slow because it's a ''Python function''. If you are checking for a particular Pyrex-generated type (or indeed any extension type), it's much faster to use the Python API function {{{PyObject_TypeCheck}}}. In fact, {{{PyObject_TypeCheck}}} is a macro, so there isn't even any C function call overhead. The prototype is declared in {{{cdefs.pxi}}}.

Example Pyrex code:

{{{#!python
cdef class X:
    pass

def test1(x):
    cdef int i
    for i from 0 <= i < 5000000:
        if isinstance(x, X):
            pass

def test2(x):
    cdef int i
    for i from 0 <= i < 5000000:
        if PyObject_TypeCheck(x, X):
            pass
}}}

Here's the performance comparison:

{{{#!python
sage: time test1(47)
CPU times: user 3.89 s, sys: 0.00 s, total: 3.89 s
Wall time: 3.89

sage: time test2(47)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16
}}}

(Note: there is also some overhead in the name lookup for {{{isinstance}}}, but in this case it's a fairly small part of the time, about 10% or so.)

Here's the C code for the first loop (error checking suppressed):

{{{#!cplusplus
  for (__pyx_v_i = 0; __pyx_v_i < 5000000; ++__pyx_v_i) {
    __pyx_1 = __Pyx_GetName(__pyx_b, __pyx_n_isinstance);
    __pyx_2 = PyTuple_New(2);
    Py_INCREF(__pyx_v_x);
    PyTuple_SET_ITEM(__pyx_2, 0, __pyx_v_x);
    Py_INCREF(((PyObject*)__pyx_ptype_7integer_X));
    PyTuple_SET_ITEM(__pyx_2, 1, ((PyObject*)__pyx_ptype_7integer_X));
    __pyx_3 = PyObject_CallObject(__pyx_1, __pyx_2);
    Py_DECREF(__pyx_1); __pyx_1 = 0;
    Py_DECREF(__pyx_2); __pyx_2 = 0;
    __pyx_4 = PyObject_IsTrue(__pyx_3);
    Py_DECREF(__pyx_3); __pyx_3 = 0;
    if (__pyx_4) {
      goto __pyx_L4;
    }
    __pyx_L4:;
  }
}}}

And here's the much tighter code for the second loop:

{{{#!cplusplus
  for (__pyx_v_i = 0; __pyx_v_i < 5000000; ++__pyx_v_i) {
    __pyx_1 = PyObject_TypeCheck(__pyx_v_x,((PyObject*)__pyx_ptype_7integer_X));
    if (__pyx_1) {
      goto __pyx_L4;
    }
    __pyx_L4:;
  }
}}}

= "==" vs "is" =

"a is b" checks whether a and b are the same objects while "a == b" returns the result of either {{{a.__class__().__richcmp__(a,b,2)}}} or {{{b.__class__().__richcmp__(a,b,2)}}} depending on if a.__class__() can handle comparison with b.

This gets used quite often to e.g., check if an attribute of a class has been computed already, like this:

{{{#!python
cdef class MyClass
    cdef object             #cached result
    ...
    def calculate_result( MyClass self ):
        if self.__cached_result is not None :
           return self.__cached_result
        else:
           self.__cached_result = self.calculate_result()
           return self.__cached_result
}}}

If the cached result has not been computed the default value for any cdef class attribute is None. There is only one single None object in Python so it's safe to check for identity rather equality. Consider these timing examples.

{{{#!python
from sage.rings.integer_ring import ZZ

n = 600000

def test1():
  one = ZZ(1)
  for i in range(n):
    one is None


def test2():
  one = ZZ(1)
  for i in range(n):
    one == None
}}}

Now consider the times these functions take:
{{{#!python
%time test1()
# CPU time: 0.18 s,  Wall time: 0.40 s

%time test2()
# CPU time: 14.85 s,  Wall time: 44.80 s
}}}

So it's way faster to test for identity with None than comparing with None.

This is also true for pure python.

= tuple, list, and dict access =
Pyrex plays safe when it comes to list, tuple, dict access:
{{{#!python
def test():
    t = tuple([1,2])
    t[0]
}}}
{{{t[0]}}} gets translated to:

{{{#!cplusplus

__pyx_1 = PyInt_FromLong(0); 

if (!__pyx_1) {
  __pyx_filename = __pyx_f[0]; 
  __pyx_lineno = 10; 
  goto __pyx_L1;
}
__pyx_3 = PyObject_GetItem(__pyx_v_t, __pyx_1); 

if (!__pyx_3) {
  __pyx_filename = __pyx_f[0];
  __pyx_lineno = 10;
  goto __pyx_L1;
}
Py_DECREF(__pyx_1); __pyx_1 = 0;
Py_DECREF(__pyx_3); __pyx_3 = 0;
}}}

This is faster:
{{{#!python
 cdef extern from "Python.h":
      void* PyTuple_GET_ITEM(object p, int pos)

def test2():
     cdef object w
     t = tuple([1,2])
     w = <object> PyTuple_GET_ITEM(t,0)
     return 0
}}}
As it gets translated to:
{{{#!python
_4 = (PyObject *)PyTuple_GET_ITEM(t,0);
Py_INCREF(_4);
Py_DECREF(w);
w = _4;
_4 = 0;
}}}

'''Cython Update''' There is still a slight speed difference, but it is almost negligable. The code automatically detects lists/tuples at runtime and uses the PyXxx_GET_ITEM code if it can. This is much safer to, as calling PyTuple_GET_ITEM on a non-tuple or with bad bounds results in a segfault. (Not to mention easier to read). 

= integer "for" loops =
{{{#!python
from sage.rings.integer_ring import ZZ

cdef int n
n = 3000000

def test0():
  cdef int j,k
  j = 1
  k = 2
  for i in range(n):
    j = k * j**2
  return j

def test0a():
  for i in range(n):
    pass

def test1():
  cdef int j,k
  j = 1
  k = 2
  for i from 0 <= i < n:
    j = k * j**2
  return j

def test1a():
  for i from 0 <= i < n:
    pass

def test2():
  cdef int i,j,k
  j = 1
  k = 2
  for i from 0 <= i < n:
    j = k * j**2
  return j
}}}

{{{#!python
%time s = test0()
# CPU time: 1.34 s,  Wall time: 3.38 s

%time s = test0a()
# CPU time: 0.93 s,  Wall time: 1.99 s

%time s = test1()
# CPU time: 0.61 s,  Wall time: 1.26 s

%time s = test1a()
# CPU time: 0.15 s,  Wall time: 0.43 s

%time s = test2()
# CPU time: 0.44 s,  Wall time: 1.50 s
}}}

= exception handling is not as bad as you'd think =
-Line 43:
+Line 377:
- * PyObject_CheckType vs isinstance
+ * If no exception occurs, overhead is really tiny. 

= a.__add__(b) is worse than a+b in Pyrex speed-wise =
[ todo ] but headline says it all

= Offtopic: Malloc Replacements =
[[MallocReplacements|Malloc Replacements]]

Diff for "WritingFastPyrexCode"

Prologue

Introduction

How to contribute to this document

cdef functions vs def functions

Example

python attributes vs cdef attributes

avoid python name lookups

Type checking

"==" vs "is"

tuple, list, and dict access

integer "for" loops

exception handling is not as bad as you'd think

a.__add__(b) is worse than a+b in Pyrex speed-wise

Offtopic: Malloc Replacements

a.add(b) is worse than a+b in Pyrex speed-wise