HotPy: June 2011

Friday 17 June 2011

Simplifying Type and Frame Objects

Having made binary operations faster for the common case and simplified builtin-functions, it is possible to simplify the implementation of the type class.

Conceptually, for modern of versions of Python (2.4 onwards) a type object is little more than a wrapper around its dictionary.
However, the CPython implementation still has many aspects it inherited from earlier versions of Python. It has a large number of internal "slots" which are C functions of differing types, that sometimes don't quite match the apparent type of the Python objects they are supposed to represent.

For example, each class has a sq_item slot which, supports item lookup on sequences, provided that the index can be passed as a C integer. However, the Python [ ] operator can take arbitrary objects as indices, making the slot useless.

These slots can be replaced with a smaller, regular array which will hold regular Python objects.

The original slots must be retained, so that third party extensions can define types. It is just that they will not be used internally by the VM. Each original slot will be used to create a builtin-function.

By replacing the old slots with a more regular alternative, the code for updating and modifying these slots can be much simplified, making it more amenable to subsequent optimisation.

The frame class is another class whose objects can be slimmed down. Frames are allocated and deallocated at a very high rate, which puts a lot of pressure on the garbage collection subsystem.
CPython maintains a "zombie" frame for each function, so that frames do not have to be created every time a function is called (unless it is called recursively). This creates problems of its own, as all the local variables have to be set to NULL when the function returns.

A CPython frame object (on a 32 bit machine) occupies 326 bytes, plus 4 bytes per local variable. A frame object in HotPy takes 44 bytes, plus 4 per local variable. Why is the CPython frame so enormous?
The largest part of the CPython frame is the block stack, which is implemented as a fixed size array, which takes up 244 bytes. Reimplementing it as linked list of objects would reduce the amount of memory used by a large amount. The remaining overhead is the evaluation stack and exception state which can be moved to the thead-state, and the scopes (builtins, globals and locals) which could be moved to a separate object, retained by the owning function.

Once the CPython frame has been shunk to a reasonable size, it can be allocated and deallocated like any other object.

Thursday 16 June 2011

Enhancing Builtin Functions.

A Builtin Function is a function written in a low-level programming langauge.
In CPython, they are written in C. In Jython, its Java, IronPython uses C# and PyPy uses RPython.

Currently, in CPython, builtin functions come in four different types:

>>> type(list.append)
<class 'method_descriptor'>
>>> type(list.__add__)
<class 'wrapper_descriptor'>
>>> type(list.__new__)
<class 'builtin_function_or_method'>
>>> type(int.__add__)
<class 'wrapper_descriptor'>

The third one of these, 'builtin_function_or_method', doesn't seem to know what whether it is a function or a method.

Only two types are actually required: builtin-functions and builtin-methods.
The difference between the two is how they act when used as a descriptor.
Builtin-methods act like Python functions, in that they return a bound-method when used a descriptor.
Builtin-functions are not descriptors and are not bound to class instances.

>>> class MyList(list): pass

>>> MyList.m0 = print
>>> MyList.m1 = list.append
>>> l = MyList()

>>> l.m0
<built-in function print>
>>> l.m1
<built-in method append of MyList object at 0xb7ae6a04

>>> MyList.m0 == l.m0
True
>>> MyList.m1 == l.m1
False

There is no requirement for l.m1 to be a "built-in method", a normal bound-method would be fine.
Bizarrely, in CPython the type of the bound builtin method, l.m1 is the same as the type of the builtin function print.

>>> type(l.m1) == type(print)
True

First of all these types need to be rationalised a bit:

Rename method_descriptor to builtin_method.
Rename builtin_function_or_method to builtin_function.
When a builtin_method is bound to a class instance it should produce a (bound) method not a builtin_function_or_method.

In order to optimise calls to builtin function, we need to know something about them. Currently a builtin function can take a limited range of parameter formats.
The allowed formats are:

f(self)
f(self, other)
f(self, *pos, **kws)

The type of its parameters cannot be specified.
For example, the __add__ method of int will fail if its first parameter is not an int, but this information is not available to the VM or the programmer.

I propose allowing a wider range of parameter formats and to specify the allowed types. Any number of parameters between 0 and 3 (maybe 4) should be allowed, with or without * parameters or ** keyword parameters.

For example the int.__add__ builtin_method would take 2 parameters,
with the parameter types (int, object).
The list.__setitem__ builtin_method would take 3 parameters, with the parameter types (list, object, object).
The print builtin_function would take 0 parameters plus * and ** parameters: print(*args, **kws).

The wrapper_descriptor class can then be deleted as builtin_method can
fulfil its role.

In summary, this change would allow all builtin-functions to have a consistent interface to the VM, which would assist optimisation. It would also reduce code size by removing a number of classes.

Improving the Garbage Collector

It is often stated that "CPython does not have Garbage Collection".
Of course it does, otherwise Python programs would die from lack of memory almost before they started.
What people mean is that CPython does not have a very good garbage collection system.

So what's wrong with the system that does CPython have?
"Its reference counting, and reference counting is slow" is the standard answer.
But that's not all there is to it.

It is possible to make a high-performance GC using Reference Counting:
http://www.cs.umass.edu/~steveb/pubs/papers/urc-oopsla-2003.pdf
but not with the sort of simple reference counting that CPython uses.

So what exactly is wrong with Python's memory management system?

First, it maintains reference counts for references from the stack, as well as from global variables and object heap objects. This is very expensive, constantly incrementing and then decrementing the reference counts for short-lived objects.

The second problem is that the garbage collection algorithm determines how an object can be allocated. Fast allocation is impossible with a reference counted GC (this is also true for a mark & sweep collector).

Third of all, its non-generational.A generational collector divides all the objects on the heap into young and old objects and treats them differently. The young objects are collected frequently whereas the older objects are collected infrequently.
Generational collectors are generally faster.

The fourth, and final, problem is that reference counting is not complete,
meaning that it does not collect all garbage. It must be enhanced with a cycle collector. The CPython cycle collector has quite a high overhead, both in terms of time and memory used.

So how to change the Garbage Collector?
First we need to define an interface between the memory management sub-system and the rest of the VM.
The current interface is superficially simple: the rest of the system modifies the ref-counts for objects, when the ref-count drops to zero, the garbage collector frees the object.

Unfortunately its not that simple.
Objects in Python can be finalised and it is possible to have weakrefs to an object.

So once an object has zero ref-count its weakrefs must be set to None, possibly calling call-backs. Once this is done the finaliser __del__ should be called. In order to pass a reference to an object to a finaliser then there must be a reference to it, in which case it ref-count should be non-zero.
To free an object, there can be no reference to it, so its ref-count should be zero.

CPython muddles this up a bit, it fails to keep finalization and deallocation separate.

So before a new GC can be implemented, finalisation needs sorting out:

So when an object has no references to it do the following
If object has a finalizer or a weakref:
      finalize_list.append(obj)
      while finalize_list:
o = finalize_list.pop()
            process_weakrefs(o)
call_finalizer(o)
Otherwise:
      free_memory(obj)

A final detail is that objects must be finalized in topological order.
(PyPy handles this issue explicitly, it happens implicitly in ref-counted system)
This means that if object a refers to object b then a must be finalized before b. What happens when a and b form a cycle? CPython finalizes neither, PyPy choses an arbitrary order. I think PyPy is correct, but that's just my opinion, it hasn't been formalized.

Once that is done, we can implement a new GC. The new GC needs to provide ref-counting in order to support 3rd party code, but it shouldn't use ref-counting internally.

It is possible to have a mixed mode GC that supports ref-counting and tracing.
This paper explains how the ref-counting and tracing GC can work together:
http://www.research.ibm.com/people/d/dfb/papers/Bacon04Unified.pdf

The proposed implementation would work as follows:
It would be a generational collector.
Objects allocated internally by the VM are allocated using a bump-pointer allocator (the fastest possible allocator) in the nursery.
We will call these objects T-objects (traced-objects).

During a minor collection, live T-objects in the nursery are copied to the mature space.
In order to support generational collection, all T-objects must use a write-barrier when then are modified. A code analysis tool will be required to do this correctly.

Objects allocated by 3rd party code are allocated using a malloc-style allocator. We will call these R-objects (reference-counted objects). R-objects do not have a generation, they are always deallocated in the first minor collection that they become unreachable.

If any T-object has its reference count set, it is added to the T-counted set.
At collection, all objects in the T-counted set are removed from the T-counted set if their ref-count is 0, otherwise they are treated as roots by the garbage collector.

If an R-object has its reference count set to zero, it is added to a R-zero set.
At collection, if there are no traced references to an object in the R-zero set,
then it is freed, taking care to handle finalizers properly.

Collections would work as follows.

Minor Collection:

Scan C stack to find (conservatively) all T-objects that may be referred to (this includes internal pointers), appending to C-list.
Remove all objects from T-counted set that have 0 ref-count.
Trace heap (excluding any mature-space T-objects or any R-objects), from the following roots:

The T-counted set.
The C-list.
Global variables.
Modified objects in the mature space.

Reachable T-objects in the nursery are copied if they have a zero ref-count, otherwise they are promoted inplace.

Major Collection:

Scan C stack to find (conservatively) all T-objects that may be referred to (this includes internal pointers), appending to C-list.
Remove all objects from T-counted set that have 0 ref-count.
Trace heap, but do not trace any R-objects, with the following roots:

The T-counted set.
The C-list.
Global variables

In order to promote inplace, the heap must be divided into pages. This division can be used to rapidly determine if an object is a T-object or a R-object, by marking the page.