HotPy

Friday, 17 June 2011

Simplifying Type and Frame Objects

Having made binary operations faster for the common case and simplified builtin-functions, it is possible to simplify the implementation of the type class.

Conceptually, for modern of versions of Python (2.4 onwards) a type object is little more than a wrapper around its dictionary.
However, the CPython implementation still has many aspects it inherited from earlier versions of Python. It has a large number of internal "slots" which are C functions of differing types, that sometimes don't quite match the apparent type of the Python objects they are supposed to represent.

For example, each class has a sq_item slot which, supports item lookup on sequences, provided that the index can be passed as a C integer. However, the Python [ ] operator can take arbitrary objects as indices, making the slot useless.

These slots can be replaced with a smaller, regular array which will hold regular Python objects.

The original slots must be retained, so that third party extensions can define types. It is just that they will not be used internally by the VM. Each original slot will be used to create a builtin-function.

By replacing the old slots with a more regular alternative, the code for updating and modifying these slots can be much simplified, making it more amenable to subsequent optimisation.

The frame class is another class whose objects can be slimmed down. Frames are allocated and deallocated at a very high rate, which puts a lot of pressure on the garbage collection subsystem.
CPython maintains a "zombie" frame for each function, so that frames do not have to be created every time a function is called (unless it is called recursively). This creates problems of its own, as all the local variables have to be set to NULL when the function returns.

A CPython frame object (on a 32 bit machine) occupies 326 bytes, plus 4 bytes per local variable. A frame object in HotPy takes 44 bytes, plus 4 per local variable. Why is the CPython frame so enormous?
The largest part of the CPython frame is the block stack, which is implemented as a fixed size array, which takes up 244 bytes. Reimplementing it as linked list of objects would reduce the amount of memory used by a large amount. The remaining overhead is the evaluation stack and exception state which can be moved to the thead-state, and the scopes (builtins, globals and locals) which could be moved to a separate object, retained by the owning function.

Once the CPython frame has been shunk to a reasonable size, it can be allocated and deallocated like any other object.

Thursday, 16 June 2011

Enhancing Builtin Functions.

A Builtin Function is a function written in a low-level programming langauge.
In CPython, they are written in C. In Jython, its Java, IronPython uses C# and PyPy uses RPython.

Currently, in CPython, builtin functions come in four different types:

>>> type(list.append)
<class 'method_descriptor'>
>>> type(list.__add__)
<class 'wrapper_descriptor'>
>>> type(list.__new__)
<class 'builtin_function_or_method'>
>>> type(int.__add__)
<class 'wrapper_descriptor'>

The third one of these, 'builtin_function_or_method', doesn't seem to know what whether it is a function or a method.

Only two types are actually required: builtin-functions and builtin-methods.
The difference between the two is how they act when used as a descriptor.
Builtin-methods act like Python functions, in that they return a bound-method when used a descriptor.
Builtin-functions are not descriptors and are not bound to class instances.

>>> class MyList(list): pass

>>> MyList.m0 = print
>>> MyList.m1 = list.append
>>> l = MyList()

>>> l.m0
<built-in function print>
>>> l.m1
<built-in method append of MyList object at 0xb7ae6a04

>>> MyList.m0 == l.m0
True
>>> MyList.m1 == l.m1
False

There is no requirement for l.m1 to be a "built-in method", a normal bound-method would be fine.
Bizarrely, in CPython the type of the bound builtin method, l.m1 is the same as the type of the builtin function print.

>>> type(l.m1) == type(print)
True

First of all these types need to be rationalised a bit:

Rename method_descriptor to builtin_method.
Rename builtin_function_or_method to builtin_function.
When a builtin_method is bound to a class instance it should produce a (bound) method not a builtin_function_or_method.

In order to optimise calls to builtin function, we need to know something about them. Currently a builtin function can take a limited range of parameter formats.
The allowed formats are:

f(self)
f(self, other)
f(self, *pos, **kws)

The type of its parameters cannot be specified.
For example, the __add__ method of int will fail if its first parameter is not an int, but this information is not available to the VM or the programmer.

I propose allowing a wider range of parameter formats and to specify the allowed types. Any number of parameters between 0 and 3 (maybe 4) should be allowed, with or without * parameters or ** keyword parameters.

For example the int.__add__ builtin_method would take 2 parameters,
with the parameter types (int, object).
The list.__setitem__ builtin_method would take 3 parameters, with the parameter types (list, object, object).
The print builtin_function would take 0 parameters plus * and ** parameters: print(*args, **kws).

The wrapper_descriptor class can then be deleted as builtin_method can
fulfil its role.

In summary, this change would allow all builtin-functions to have a consistent interface to the VM, which would assist optimisation. It would also reduce code size by removing a number of classes.

Improving the Garbage Collector

It is often stated that "CPython does not have Garbage Collection".
Of course it does, otherwise Python programs would die from lack of memory almost before they started.
What people mean is that CPython does not have a very good garbage collection system.

So what's wrong with the system that does CPython have?
"Its reference counting, and reference counting is slow" is the standard answer.
But that's not all there is to it.

It is possible to make a high-performance GC using Reference Counting:
http://www.cs.umass.edu/~steveb/pubs/papers/urc-oopsla-2003.pdf
but not with the sort of simple reference counting that CPython uses.

So what exactly is wrong with Python's memory management system?

First, it maintains reference counts for references from the stack, as well as from global variables and object heap objects. This is very expensive, constantly incrementing and then decrementing the reference counts for short-lived objects.

The second problem is that the garbage collection algorithm determines how an object can be allocated. Fast allocation is impossible with a reference counted GC (this is also true for a mark & sweep collector).

Third of all, its non-generational.A generational collector divides all the objects on the heap into young and old objects and treats them differently. The young objects are collected frequently whereas the older objects are collected infrequently.
Generational collectors are generally faster.

The fourth, and final, problem is that reference counting is not complete,
meaning that it does not collect all garbage. It must be enhanced with a cycle collector. The CPython cycle collector has quite a high overhead, both in terms of time and memory used.

So how to change the Garbage Collector?
First we need to define an interface between the memory management sub-system and the rest of the VM.
The current interface is superficially simple: the rest of the system modifies the ref-counts for objects, when the ref-count drops to zero, the garbage collector frees the object.

Unfortunately its not that simple.
Objects in Python can be finalised and it is possible to have weakrefs to an object.

So once an object has zero ref-count its weakrefs must be set to None, possibly calling call-backs. Once this is done the finaliser __del__ should be called. In order to pass a reference to an object to a finaliser then there must be a reference to it, in which case it ref-count should be non-zero.
To free an object, there can be no reference to it, so its ref-count should be zero.

CPython muddles this up a bit, it fails to keep finalization and deallocation separate.

So before a new GC can be implemented, finalisation needs sorting out:

So when an object has no references to it do the following
If object has a finalizer or a weakref:
      finalize_list.append(obj)
      while finalize_list:
o = finalize_list.pop()
            process_weakrefs(o)
call_finalizer(o)
Otherwise:
      free_memory(obj)

A final detail is that objects must be finalized in topological order.
(PyPy handles this issue explicitly, it happens implicitly in ref-counted system)
This means that if object a refers to object b then a must be finalized before b. What happens when a and b form a cycle? CPython finalizes neither, PyPy choses an arbitrary order. I think PyPy is correct, but that's just my opinion, it hasn't been formalized.

Once that is done, we can implement a new GC. The new GC needs to provide ref-counting in order to support 3rd party code, but it shouldn't use ref-counting internally.

It is possible to have a mixed mode GC that supports ref-counting and tracing.
This paper explains how the ref-counting and tracing GC can work together:
http://www.research.ibm.com/people/d/dfb/papers/Bacon04Unified.pdf

The proposed implementation would work as follows:
It would be a generational collector.
Objects allocated internally by the VM are allocated using a bump-pointer allocator (the fastest possible allocator) in the nursery.
We will call these objects T-objects (traced-objects).

During a minor collection, live T-objects in the nursery are copied to the mature space.
In order to support generational collection, all T-objects must use a write-barrier when then are modified. A code analysis tool will be required to do this correctly.

Objects allocated by 3rd party code are allocated using a malloc-style allocator. We will call these R-objects (reference-counted objects). R-objects do not have a generation, they are always deallocated in the first minor collection that they become unreachable.

If any T-object has its reference count set, it is added to the T-counted set.
At collection, all objects in the T-counted set are removed from the T-counted set if their ref-count is 0, otherwise they are treated as roots by the garbage collector.

If an R-object has its reference count set to zero, it is added to a R-zero set.
At collection, if there are no traced references to an object in the R-zero set,
then it is freed, taking care to handle finalizers properly.

Collections would work as follows.

Minor Collection:

Scan C stack to find (conservatively) all T-objects that may be referred to (this includes internal pointers), appending to C-list.
Remove all objects from T-counted set that have 0 ref-count.
Trace heap (excluding any mature-space T-objects or any R-objects), from the following roots:

The T-counted set.
The C-list.
Global variables.
Modified objects in the mature space.

Reachable T-objects in the nursery are copied if they have a zero ref-count, otherwise they are promoted inplace.

Major Collection:

Scan C stack to find (conservatively) all T-objects that may be referred to (this includes internal pointers), appending to C-list.
Remove all objects from T-counted set that have 0 ref-count.
Trace heap, but do not trace any R-objects, with the following roots:

The T-counted set.
The C-list.
Global variables

In order to promote inplace, the heap must be divided into pages. This division can be used to rapidly determine if an object is a T-object or a R-object, by marking the page.

Thursday, 14 April 2011

Separating the Python frame stack from the hardware stack

First of all, what do I mean by Python frame stack and hardware stack?

Well, the hardware stack is the stack used by C, C++, etc to store local variables that can't fit in registers, track return addresses and the like.
Strictly, it is the operating system, not the hardware, that determines the stack layout and calling conventions. C uses this stack, and for that reason this stack is also commonly known as the C stack.

Since CPython is written in C, every time CPython makes a call internally (not necessarily corresponding to a Python-level call) then the C stack grows in size.

When your Python program makes a call, the Python VM creates a new a frame object to hold local variables (all Python VMs do this, not just CPython). The frame object is allocated on the heap (not the C stack) and is linked to the previous frame to form a singly-linked list. In fact, the frame object contains not only local variables, but pointers to the currently executing code object and the VM instruction pointer. This means that the stack of frames contains the complete execution state of the current thread.

When a Python-level call is made in CPython, a corresponding call is made in the C code, so that Python stack is tied to the C stack. If a return statement is executed in the Python program then a return statement must be executed in the C code of the CPython VM.

Breaking this direct link between the two stacks, that of the Python program and that of the hardware, would allow a great deal of freedom in implementing the VM.
Once the stacks are disconnected, the Python frame stack can be freely passed around and manipulated as data, like any other object. This allows interesting and useful approaches to executing the Python program:

Arbitrary chunks of code can be executed, not just whole functions.
Sequences of bytecodes that cross function call boundaries can be optimized, massively reducing the overhead of function calls.
Sequences of bytecodes can be dynamically replaced with optimized versions or even machine code.
Execution of code can be transferred from unoptimized code to optimized code (and vice versa) at arbitrary points in the bytecode.

There are also two beneficial side-effects of this transformation.

Stack overflow terminates gracefully with an exception, rather than crashing the interpreter.
Coroutines (aka Greenlets) can be implemented in a portable fashion.

Possible CPython Implementation

Currently CPython (ignoring, for now, shortcuts in the interpreter for calling function objects) calls objects via the PyObject_Call function, which delegates to the tp_call slot in the type of the callable object:

PyObject_Call(callable, args, kws) 

is equivalent to (ignoring error handling)
Py_TYPE(callable)->tp_call(callable, args, kws)

By adding a new slot, tp_prepare_call, the interpreter, in the CALL_FUNCTION bytecode, could call
Py_TYPE(callable)->tp_prepare_call(threadstate, callable, args, kws)
before continuing execution.
In order to reduce the number of empty dicts required, NULL can represent an empty dict for kws.
Callable objects implemented in C would not provide tp_prepare_call, but function objects would. The tp_prepare_call of the function type would create a new frame, initializing it from args and kws push it to the frame stack, and set the instruction pointer to the first bytecode of the called function.

Bound methods should also use the tp_prepare_call slot. A little trickery is required, the tp_prepare_call of a bound method would do the following:

Push callable->method to the evaluation stack
Push (callable->self,) + args to the evaluation stack
Push kws to the evaluation stack
Set the instruction back to the original call :)

The interpreter would need to record which frame it was first called with, so that when executing a RETURN bytecode, it would either pop the frame (if current_frame != entry_frame) or perform a return at the C level (if current_frame = entry_frame).

The FOR_ITER and YIELD bytecodes will also need special treatment when dealing with generator objects.

Wednesday, 13 April 2011

Faster Binary Operators

Before moving on to the more complex modifications, we will start with an optimization that will speed up almost all code, and requires only small changes to the code base.

Python has a unique way of implementing binary operators, which works as follows; using the addition a + b as the example:

Call the special method __add__ on the left hand operator a.__add__(b)
If that fails, either because a does not have an __add__ method or because a.__add__(b) returns NotImplemented, then:
Call the special method __radd__ on the right hand operator b.__radd__(a).

There is a special case where the type of b is a subtype of the type of a, in which case the __radd__ method is tried first.

The optimization we wish to implement relies on the fact that built-in types such as int and list cannot be monkey-patched (altered at runtime) and that they are commonly used in binary expressions.

Binary operators can be sped up by attaching an index to all types.
Common built-in types are given an index between 1 to 15 inclusive, all other types have index 0.
Each binary operator consists of a table of 256 functions, indexed from 0 to 255. A binary expression is evaluated by generating an index from the types of a and b: indexof(a) * 16 + indexof(b). If a function is present in the table at that location then it is called, if not then the standard implementation is used. For example, suppose that int has an index of 3 and float has an index of 5, then the expression 1 + 1.0 will be resolved by calculating
indexof(int) * 16 + indexof(float) = 3 * 16 + 5 = 53.
A pointer to the function for adding ints to floats is stored at index 53, and the addition is performed with no additional look-up.

This will work perfectly correctly if all entries are NULL, functions can then be added and tested one at a time. Operator functions do not need to perform type checking, since they can only be accessed through the table.

Although the speed up will not be large, it will be measurable and has two other benefits.
First, it makes specializing (modification number 14) binary operators easier and faster.
Second, it prevents degradation of the performance of unspecialized binary operators that might occur due to enhanced builtin-functions (modification number 4).

Possible CPython Implementation

The type index could be stored in the tp_flags field of the type object, using bits 0 to 3. This would mean that subtype flags would need to be masked when inheriting, to prevent subtypes of built-in types from having non-zero indices.

Function look up can be performed with the expression:
operator_table[((Py_TYPE(a)->tp_flags&15)<<4)+(Py_TYPE(b)->tp_flags&15)]

Rather than storing NULL in unused slots, the original function, eg. PyNumber_Add, can be stored instead.

The following types are probably worth giving no-zero indices:

int
float
complex
bool
list
dict
set
frozenset
tuple
str
bytes
bytearray

Leaving three spare.

Thursday, 7 April 2011

Introduction

Welcome to the HotPy blog.

The original HotPy virtual machine was a VM for the Python language,
which I developed as part of my PhD. It can be found here.

It is my intention in this blog to explain how the techniques and
optimizations that I developed for the HotPy VM can be applied to
the CPython virtual machine.

I believe that, eventually, CPython can be made as fast or faster than PyPy without any changes to the Python language or the API. Of course PyPy will get faster as well, but the big improvements in PyPy have already been made (or maybe the PyPy people have something big up their sleeves).

Quite a few changes are required, but these can be done incrementally.
Each change should result in a fully working and testable VM.
The changes to be made are listed below, roughly in order of proposed implementation:

Faster binary operators, using table lookup for common type pairings.
Separate the VM frame stack from the hardware stack.
Implement a "proper" (tracing rather than reference counting) garbage collector.
Enhance builtin-function, incorporating various slot wrappers, to handle a wider range of C functions. This will allow better optimization of calling C functions from Python.
Refactor some standard types, notably type and frame to better match execution model.
Add lower-level bytecodes. This will allow complex bytecodes to be expressed in terms of a core of simpler bytecodes.
Centralize management of execution in a execution supervisor. Supervisor will delegate execution to the interpreter or compiled code.
Implement tagged integers. Tagged integers are widely used in VMs for dynamic languages. Tagged integers will improve several aspects of performance.
Add extra bytecodes for tracing; guard and exit instructions.
Implement Tracing Interpreter. Tracing interpreter records bytecodes as they are executed, producing bytecode sequences that represent the dynamic structure of the program. Although this amke little difference to performance, it is the key to following optimizations.
Add specialised bytecodes for int and float operations, fast member access and fast C function calls.
Implement guard mechanism, for deoptimisation of traces.
Modify dict to internals to allow sharing of keys and adding guards.
Implement Specialiser. This is the heart of the optimisations. It will generate significant speedups and unlocks the following optimizations.
Implement Escape Analysis. This works by lazily deferring the instantiation of objects, many of which are never required. This also generates notable speedups.
Implement JIT Compilation. JIT compilation is assumed to be complex, due to the number of optimisations involved. But by this point the interesting optimisations have been done. Only stack erasure and register allocation is required. This can be done comparitively easily using LLVM, libJIT or similar.

Most of these changes depend, to varying degrees, on previous changes.
Here is a diagram showing the dependencies.

In upcoming posts I will explain the rationale for each of these of changes and describe in detail a possible implementation.