The original HotPy virtual machine was a VM for the Python language,
which I developed as part of my PhD. It can be found here.
It is my intention in this blog to explain how the techniques and
optimizations that I developed for the HotPy VM can be applied to
the CPython virtual machine.
I believe that, eventually, CPython can be made as fast or faster than PyPy without any changes to the Python language or the API. Of course PyPy will get faster as well, but the big improvements in PyPy have already been made (or maybe the PyPy people have something big up their sleeves).
Quite a few changes are required, but these can be done incrementally.
Each change should result in a fully working and testable VM.
The changes to be made are listed below, roughly in order of proposed implementation:
- Faster binary operators, using table lookup for common type pairings.
- Separate the VM frame stack from the hardware stack.
- Implement a "proper" (tracing rather than reference counting) garbage collector.
- Enhance builtin-function, incorporating various slot wrappers, to handle a wider range of C functions. This will allow better optimization of calling C functions from Python.
- Refactor some standard types, notably type and frame to better match execution model.
- Add lower-level bytecodes. This will allow complex bytecodes to be expressed in terms of a core of simpler bytecodes.
- Centralize management of execution in a execution supervisor. Supervisor will delegate execution to the interpreter or compiled code.
- Implement tagged integers. Tagged integers are widely used in VMs for dynamic languages. Tagged integers will improve several aspects of performance.
- Add extra bytecodes for tracing; guard and exit instructions.
- Implement Tracing Interpreter. Tracing interpreter records bytecodes as they are executed, producing bytecode sequences that represent the dynamic structure of the program. Although this amke little difference to performance, it is the key to following optimizations.
- Add specialised bytecodes for int and float operations, fast member access and fast C function calls.
- Implement guard mechanism, for deoptimisation of traces.
- Modify dict to internals to allow sharing of keys and adding guards.
- Implement Specialiser. This is the heart of the optimisations. It will generate significant speedups and unlocks the following optimizations.
- Implement Escape Analysis. This works by lazily deferring the instantiation of objects, many of which are never required. This also generates notable speedups.
- Implement JIT Compilation. JIT compilation is assumed to be complex, due to the number of optimisations involved. But by this point the interesting optimisations have been done. Only stack erasure and register allocation is required. This can be done comparitively easily using LLVM, libJIT or similar.
Here is a diagram showing the dependencies.
In upcoming posts I will explain the rationale for each of these of changes and describe in detail a possible implementation.