Well, the hardware stack is the stack used by C, C++, etc to store local variables that can't fit in registers, track return addresses and the like.
Strictly, it is the operating system, not the hardware, that determines the stack layout and calling conventions. C uses this stack, and for that reason this stack is also commonly known as the C stack.
Since CPython is written in C, every time CPython makes a call internally (not necessarily corresponding to a Python-level call) then the C stack grows in size.
When your Python program makes a call, the Python VM creates a new a frame object to hold local variables (all Python VMs do this, not just CPython). The frame object is allocated on the heap (not the C stack) and is linked to the previous frame to form a singly-linked list. In fact, the frame object contains not only local variables, but pointers to the currently executing code object and the VM instruction pointer. This means that the stack of frames contains the complete execution state of the current thread.
When a Python-level call is made in CPython, a corresponding call is made in the C code, so that Python stack is tied to the C stack. If a return statement is executed in the Python program then a return statement must be executed in the C code of the CPython VM.
Breaking this direct link between the two stacks, that of the Python program and that of the hardware, would allow a great deal of freedom in implementing the VM.
Once the stacks are disconnected, the Python frame stack can be freely passed around and manipulated as data, like any other object. This allows interesting and useful approaches to executing the Python program:
- Arbitrary chunks of code can be executed, not just whole functions.
- Sequences of bytecodes that cross function call boundaries can be optimized, massively reducing the overhead of function calls.
- Sequences of bytecodes can be dynamically replaced with optimized versions or even machine code.
- Execution of code can be transferred from unoptimized code to optimized code (and vice versa) at arbitrary points in the bytecode.
- Stack overflow terminates gracefully with an exception, rather than crashing the interpreter.
- Coroutines (aka Greenlets) can be implemented in a portable fashion.
Currently CPython (ignoring, for now, shortcuts in the interpreter for calling function objects) calls objects via the PyObject_Call function, which delegates to the tp_call slot in the type of the callable object:
PyObject_Call(callable, args, kws)
is equivalent to (ignoring error handling)Py_TYPE(callable)->tp_call(callable, args, kws)
By adding a new slot, tp_prepare_call, the interpreter, in the CALL_FUNCTION bytecode, could call
Py_TYPE(callable)->tp_prepare_call(threadstate, callable, args, kws)
before continuing execution.
In order to reduce the number of empty dicts required, NULL can represent an empty dict for kws.
Callable objects implemented in C would not provide tp_prepare_call, but function objects would. The tp_prepare_call of the function type would create a new frame, initializing it from args and kws push it to the frame stack, and set the instruction pointer to the first bytecode of the called function.
Bound methods should also use the tp_prepare_call slot. A little trickery is required, the tp_prepare_call of a bound method would do the following:
- Push callable->method to the evaluation stack
- Push (callable->self,) + args to the evaluation stack
- Push kws to the evaluation stack
- Set the instruction back to the original call :)
The FOR_ITER and YIELD bytecodes will also need special treatment when dealing with generator objects.