There are a number of stages involved in turning a plain text PHP file into executable code, which are namely:
The IL is then transformed into a phix opcode stream:
Depending on the final target, at this point the code may be written out to file as an assembly, fed into a JIT (or interpreter) to execute, or both if the executer performs caching.
The JIT is also capable of generating runtime code in native format without executing it.
The source PHP code in read in and tokenized. The tokens are split up into a number of types:
| Type |
|---|
| Operators |
| Integers |
| Floats |
| Strings |
| Identifiers |
| Variables |
| Comments |
| EOL |
| EOF |
As the tokens are read in, they are added into a database created for each type, and assigned an identifier.
The following information is stored for every token:
| Name | Description |
|---|---|
| filePos | Position within the file where token was found |
| text | The actual text in the source |
| id | The identifier assigned to this token |
Keywords and operators are treated identically to identifiers except they have a predefined identifier value. for a complete list of keywords, see keywords and operators.
The syntax tree (aka CodeDom) is built from the lex token stream which directly represents the source code. Creating a CodeDom and then outputting it as PHP should generate identical code minus a few small differences:
We now perform a number of checks to ensure that the syntax tree is semantically valid. for example, that a left hand value in an assignment is not a constant value.
We also modify the syntax tree to add implied casts, for example:
$moo = $x . $y;
becomes:
$moo = (string)((string)$x . (string)$y);
There are a number of cases where we can’t know the type at compile time, for example:
$res = $x * $y;
In the above example, we can’t tell if the result is going to be an integer or a float, as it depends on the types $x and $y. however, we can convert this to:
if (is_float($x) === true || is_float($y) === true) { $res = (float)((float)$x * (float)$y); } else { $res = (int)((int)$x * (int)$y); }
Although doing this seems like it will cause a huge slowdown at runtime, we do this because then the optimizer can hopefully eliminate a lot of the branches at runtime, and we can use profile feedback to prefer the most frequently hit branch.
Because class properties are completly dynamic in PHP (i.e, they ca be assigned even when not declared), and because a class declaration can be held back until runtime, object property access gets changed from:
$x = $moo->some_property; $moo->some_property = $x;
to:
$x = phix_internal_object_get_property($moo, 'some_property'); phix_internal_object_set_property($moo, $x);
As a variable is converted to point to an object if it is undefined, we need to pass as a reference.
for the same reason, class methods are re-written like this:
$x = $moo->func(1, 2, 3);
to:
$x = phix_internal_object_call_function($moo, 'some_property', 1, 2, 3);
Sadly this means we loose any chance of in-lining, although that might be fixable at some point in the future.
When accessing something using an array operator([]), it doesn’t mean it’s an array. It could be:
Also, the expression can be a read only one (accessing by value), or a read/write one (accessing by reference).
Let’s look at how these different cases are handled:
When the lvalue in an empty variable, PHP automatically turns it into an array and then performs the array operation:
<?php assert(is_null($a)); $x[] = 1; assert(is_array($a));
: Right now we error - SPL will allow ArrayAccess one day
$str = 'Theo'; $str[1] = 'w'; assert($str === 'Tweo');
When the lvalue is a string, the array access operator sets the character at that position to the rvalue, although note that you can’t (in PHP) use [] to append a char, which seems a bit mad :)
Arrays are handled as classes internally, which means we can do SPL style things but better in the future.
However, there are two differences between normal objects and arrays:
Note that strings can be accessed as arrays.
However, because of PHP‘s magic that creates an array when an un-initialised variable is first used as one, we need to handle it slightly differently.
Arrays are handled in C as well for now, although note that we must always pass by reference, as if we attempt to set or append an array element that does not exist, PHP converts the variable into an array.
Also note that using SPL we can make a class behave like an array
$x[] = $y; $x[$y] = $z; $x = $y[$z]; $x = &$y[$z];
the above code would actually be converted to this by the semantic analizer:
$x[] = $y; // essentially phix_internal_array_append_ref($x) = $y; $x[phix_internal_array_get_key($x, $y)] = $z; // essentially phix_internal_array_get_ref($x, $y) = $z; $x = $y[phix_internal_array_get_key($y, $z)]; // essentially phix_internal_array_get($y, $z); $x = &$y[phix_internal_array_get_key($y, $z)]; // essentially phix_internal_array_get_ref($y, $z);
The foreach() concept is actually just a for loop with an iterator helper:
foreach ($x as $key => &$val) { /* some code */ }
gets re-written when lowering to the frontend tree format to become:
for ($_i = phix_internal_iterator_make($x) ; (($_i !== null) && ($_i->valid())) ; $_i->next()) { $key = $_i->key(); $value = $_i->current_ref(); }
phix_internal_iterator_make($x) analizes the argument to detect it’s type. If it’s an array, it returns an ArrayObject, if it is an object, returns the result of a call to getIterator() (which is defined by default in the opaque object class to iterate through all of the properties).
// [InternalFunction()] function & phix_internal_iterator_make($val) { if (is_array($val)) { return new ArrayObject($val); } else if (is_object($val)) { return $val->getIterator(); } else if (is_resource($val)) { return phix_internal_resource_get_iterator($val); } else { trigger_error(Warning, "Invalid argument supplied for foreach()", __CALLER); } }
We now perform a number of passes on the tree at this point.
Each pass returns true if modifications were made, or false if none were made. Passes may be run multiple times until no modifications are made at all.
We need to be careful that one pass doesn’t change something that another pass moves - which would cause a potentially endless loop.
Eahc pass can be enabled or disabled - a CLI runtime would probably not want to do anything but the most cheap optimizations with the largest benefits - a compiler would probably want to do all of them.
if ($x || something() == false) { doSomething($y); doSomethingElse($z); } else { doSomething($z); doSomethingElse($y); }
Gets converted to:
if ($x || something() == false) goto label_1; else goto label_2; label_1: doSomething($y); doSomethingElse($z); goto label_3; label_2: doSomething($z); doSomethingElse($y); label_3:
switch (getValue()) { case getAnotherValue(): // 1 $z = 1; break; case 546555: // 2 doSomethingElse(); // Fall throught case 345: // 3 case 346: // 3 $z = 2; break; case 'moo': // 4 $z = 3; break; default: // 5 $z = 0; break; }
gets rewritten to:
$T1 = getValue(); if ($T1 == getAnotherValue()) goto label_1; else if ($T1 == 546555) goto label_2; else if ($T1 == 345 || $T1 == 346) goto label_3; else if ($T1 == 'moo') goto label_4; goto label_5; label_1: $z = 1; goto label_end; label_2: doSomethingElse(); label_3: $z = 2; goto label_end; label_4: $z = 3; goto label_end; label_5: $z = 0; goto label_end; label_end:
Note that we don’t worry about things like extra gotos on the end of of label_5 as it will get cleaned up later.
We also move constant values of a type without side effects into a seperate lookup table, as it means not only switch, but also if statements can be optimized in this way.
We now convert the parse tree into three address code format, which is the format used by the backend bytecode generator and most of the optimizers.
Three Address Code has it’s own page, as it’s a seperate subject.