Thursday, 14 April 2011

Separating the Python frame stack from the hardware stack

First of all, what do I mean by Python frame stack and hardware stack?

Well, the hardware stack is the stack used by C, C++, etc to store local variables that can't fit in registers, track return addresses and the like.
Strictly, it is the operating system, not the hardware, that determines the stack layout and calling conventions. C uses this stack, and for that reason this stack is also commonly known as the C stack.

Since CPython is written in C, every time CPython makes a call internally (not necessarily corresponding to a Python-level call) then the C stack grows in size.

When your Python program makes a call, the Python VM creates a new a frame object to hold local variables (all Python VMs do this, not just CPython). The frame object is allocated on the heap (not the C stack) and is linked to the previous frame to form a singly-linked list. In fact, the frame object contains not only local variables, but pointers to the currently executing code object and the VM instruction pointer. This means that the stack of frames contains the complete execution state of the current thread.

When a Python-level call is made in CPython, a corresponding call is made in the C code, so that Python stack is tied to the C stack. If a return statement is executed in the Python program then a return statement must be executed in the C code of the CPython VM.

Breaking this direct link between the two stacks, that of the Python program and that of the hardware, would allow a great deal of freedom in implementing the VM.
Once the stacks are disconnected, the Python frame stack can be freely passed around and manipulated as data, like any other object. This allows interesting and useful approaches to executing the Python program:
  • Arbitrary chunks of code can be executed, not just whole functions.
  • Sequences of bytecodes that cross function call boundaries can be optimized, massively reducing the overhead of function calls.
  • Sequences of bytecodes can be dynamically replaced with optimized versions or even machine code.
  • Execution of code can be transferred from unoptimized code to optimized code (and vice versa) at arbitrary points in the bytecode.
There are also two beneficial side-effects of this transformation.
  • Stack overflow terminates gracefully with an exception, rather than crashing the interpreter.
  • Coroutines (aka Greenlets) can be implemented in a portable fashion.
Possible CPython Implementation 

Currently CPython (ignoring, for now, shortcuts in the interpreter for calling function objects) calls objects via the PyObject_Call function, which delegates to the tp_call slot in the type of the callable object:
PyObject_Call(callable, args, kws) 
is equivalent to (ignoring error handling)
Py_TYPE(callable)->tp_call(callable, args, kws)

By adding a new slot, tp_prepare_call, the interpreter, in the CALL_FUNCTION bytecode, could call
Py_TYPE(callable)->tp_prepare_call(threadstate, callable, args, kws)
before continuing execution.
In order to reduce the number of empty  dicts required, NULL can represent an empty dict for kws.
Callable objects implemented in C would not provide tp_prepare_call, but function objects would. The  tp_prepare_call of the function type would create a new frame, initializing it from args and kws push it to the frame stack, and set the instruction pointer to the first bytecode of the called function.

Bound methods should also use the tp_prepare_call slot. A little trickery is required, the tp_prepare_call of a bound method would do the following:
  • Push callable->method to the evaluation stack
  • Push (callable->self,) + args to the evaluation stack
  • Push kws to the evaluation stack
  • Set the instruction back to the original call :)
The interpreter would need to record which frame it was first called with, so that when executing a RETURN bytecode, it would either pop the frame (if current_frame != entry_frame) or perform a return at the C level (if current_frame = entry_frame).

The FOR_ITER and YIELD bytecodes will also need special treatment when dealing with generator objects.

Wednesday, 13 April 2011

Faster Binary Operators

Before moving on to the more complex modifications, we will start with an optimization that will speed up almost all code, and requires only small changes to the code base.

Python has a unique way of implementing binary operators, which works as follows; using the addition a + b as the example:
  • Call the special method __add__ on the left hand operator a.__add__(b)
  • If that fails, either because a does not have an __add__ method or because a.__add__(b) returns NotImplemented, then:
  • Call the special method __radd__ on the right hand operator b.__radd__(a).
There is a special case where the type of b is a subtype of the type of a, in which case the __radd__ method is tried first.

The optimization we wish to implement relies on the fact that built-in types such as int and list cannot be monkey-patched (altered at runtime) and that they are commonly used in binary expressions.

Binary operators can be sped up by attaching an index to all types.
Common built-in types are given an index between 1 to 15 inclusive, all other types have index 0.
Each binary operator consists of a table of 256 functions, indexed from 0 to 255. A binary expression is evaluated by generating an index from the types of a and b: indexof(a) * 16 + indexof(b). If a function is present in the table at that location then it is called, if not then the standard implementation is used. For example, suppose that int has an index of 3 and float has an index of 5, then the expression 1 + 1.0 will be resolved by calculating
indexof(int) * 16 + indexof(float) = 3 * 16 + 5 = 53.
A pointer to the function for adding ints to floats is stored at index 53, and the addition is performed with no additional look-up.

This will work perfectly correctly if all entries are NULL, functions can then be added and tested one at a time. Operator functions do not need to perform type checking, since they can only be accessed through the table.

Although the speed up will not be large, it will be measurable and has two other benefits.
First, it makes specializing (modification number 14) binary operators easier and faster.
Second, it prevents degradation of the performance of unspecialized binary operators that might occur due to enhanced builtin-functions (modification number 4).

Possible CPython Implementation

The type index could be stored in the tp_flags field of the type object, using bits 0 to 3. This would mean that subtype flags would need to be masked when inheriting, to prevent subtypes of built-in types from having non-zero indices.

Function look up can be performed with the expression:
operator_table[((Py_TYPE(a)->tp_flags&15)<<4)+(Py_TYPE(b)->tp_flags&15)]

Rather than storing NULL in unused slots, the original function, eg. PyNumber_Add, can be stored instead.

The following types are probably worth giving no-zero indices:
  1. int
  2. float
  3. complex
  4. bool
  5. list
  6. dict
  7. set
  8. frozenset
  9. tuple
  10. str
  11. bytes
  12. bytearray
Leaving three spare.

Thursday, 7 April 2011

Introduction

Welcome to the HotPy blog.

The original HotPy virtual machine was a VM for the Python language,
which I developed as part of my PhD. It can be found here.

It is my intention in this blog to explain how the techniques and
optimizations that I developed for the HotPy VM can be applied to
the CPython virtual machine.

I believe that, eventually, CPython can be made as fast or faster than PyPy  without any changes to the Python language or the API. Of course PyPy will get faster as well, but the big improvements in PyPy have already been made (or maybe the PyPy people have something big up their sleeves).

Quite a few changes are required, but these can be done incrementally.
Each change should result in a fully working and testable VM.
The changes to be made are listed below, roughly in order of proposed implementation:
  1. Faster binary operators, using table lookup for common type pairings.
  2. Separate the VM frame stack from the hardware stack.
  3. Implement a "proper" (tracing rather than reference counting) garbage collector.
  4. Enhance builtin-function, incorporating various slot wrappers, to handle a wider range of C functions. This will allow better optimization of calling C functions from Python.
  5. Refactor some standard types, notably type and frame to better match execution model.
  6. Add lower-level bytecodes. This will allow complex bytecodes to be expressed in terms of a core of simpler bytecodes.
  7. Centralize management of execution in a execution supervisor. Supervisor will delegate execution to the interpreter or compiled code.
  8. Implement tagged integers. Tagged integers are widely used in VMs for dynamic languages. Tagged integers will improve several aspects of performance.
  9. Add extra bytecodes for tracing; guard and exit instructions.
  10. Implement Tracing Interpreter. Tracing interpreter records bytecodes as they are executed, producing bytecode sequences that represent the dynamic structure of the program. Although this amke little difference to performance, it is the key to following optimizations.
  11. Add specialised bytecodes for int and float operations, fast member access and fast C function calls.
  12. Implement guard mechanism, for deoptimisation of traces.
  13. Modify dict to internals to allow sharing of keys and adding guards.
  14. Implement Specialiser. This is the heart of the optimisations. It will generate significant speedups and unlocks the following optimizations.
  15. Implement Escape Analysis. This works by lazily deferring the instantiation of objects, many of which are never required. This also generates notable speedups.
  16. Implement JIT Compilation. JIT compilation is assumed to be complex, due to the number of optimisations involved. But by this point the interesting optimisations have been done. Only stack erasure and register allocation is required. This can be done comparitively easily using LLVM, libJIT or similar.
Most of these changes depend, to varying degrees, on previous changes.
Here is a diagram showing the dependencies.

In upcoming posts I will explain the rationale for each of these of changes and describe in detail a possible implementation.