Parsing and evaluating PHP in Haskell: Part 1

Tags: haskell, php

The other day I uploaded a new experimental project on GitHub - A Haskell PHP parser / evaluator. It doesn’t understand 100% of all PHP syntax, but it was an interesting experiment nevertheless.

Here’s some insights and other thoughts from working on the code.

Why?!

Why not?

I had wanted to experiment in writing a parser in Haskell for a while, and the thought of writing a small language of my own had crossed my mind in the past. I figured I might as well try parsing something that I already know pretty well.

The evaluator part really came in as an afterthought. “Hey wouldn’t it be cool if I could actually run PHP code instead of just parsing it?”

I also wrote a small code generator that actually generated valid Haskell code based on the parsed PHP code, which could be in turn be compiled using the GHC Haskell compiler.

How?

Haskell seems extremely suitable for parsing languages. This is probably partially thanks to the very good Parsec library, which makes writing parsers for this sort of purposes surprisingly straightforward.

When I think of parsing in most languages, I think of fiddling with characters… Imagine parsing PHP’s global keyword:

Parse a bunch of text checking if it matches “global”, then parse a space, a $ character and then a bunch of characters until a space, comma, end of line, or semicolon… yeah I’m not even going to try describing it accurately - essentially you’re dealing a lot with error-prone things like matching characters, backtracking if you encounter something unexpected, etc.

The great thing about Parsec is it handles this for you. In Haskell I just say “ok parse a reserved word global, then parse one or more variable names separated by commas, ending with a semicolon”. Of course for this I need to define a function that parses a variable name (which is also very easy) but the rest is handled by Parsec. It takes care of ignoring white space and newlines, matching the commas, etc. - and if a match happens to fail, it can backtrack automatically without requiring separate handling of whether or not it succeeded.

Obviously there were some issues along the way where it didn’t behave exactly like I wanted, but overall it was really much much easier than I expected.

Weak typing in a strongly typed language

PHP is a weakly typed language. You can join strings and ints, multiply floats with nulls, arrays can contain values of multiple types, etc.

Haskell is a strongly typed language. It doesn’t allow any of that nonsense!

So how do we deal with this in Haskell?

We create a new data type with different constructors for different types. For those unfamiliar with how Haskell types work, essentially a data type is strict: You can’t mix and match types, for example if you attempt to multiply a float and an int, it will complain. A Haskell list can only contain values of single type - a list containing a string can only hold other strings.

However, since Haskell’s type system is pretty flexible, we can define a data type such as PHPValue. We can then have different constructors to create different kinds of PHPValues - We can have a PHPFloat, PHPString and so on, which all belong to the same PHPValue type. We can have a float constructor take a float, or a string constructor take a string. Using this approach, we can have a single data type that contains different kinds of values. Sneaky!

In order to do things like multiplication or division between these kinds of values, we need a function to convert between them. We’ll need to convert PHPStrings into PHPFloats and so on. Thankfully it’s very straightforward by using pattern matching.

So with things like this, we can see that strong typing is not an issue even if we need flexibility like this.

How the parsing works

When parsing code you’ll obviously need to think of a format for storing it as data your program can then process. Typically this is called the Abstract Syntax Tree, or AST.

Parsing source into an AST can be tricky. When writing code, it may seem simple enough: If I want a for-loop, I just put down a for and the iteration variable stuff in the parentheses. But when you get down to it, it’s actually more complex than that: You can actually put any PHP expression inside the initialization block of a for. However, you can’t put another for loop in there…

You need a way to differentiate between different types of expressions. In my case, the parser differentiates between statements and expressions. A statement may contain an expression, but an expression may not contain a statement. A for-loop is a statement, but $foo + 1 is an expression. An expression becomes a statement when it ends in a semicolon. Confusing? :)

Much of the parsing involves this kind of work. Another great thing about Parsec is that you can easily combine different “parsers” together:

For example, I can write a parser that parses a variable, or a parser that parses an assignment. I can combine these (and others) to create a parser that parses a complete expression - an expression can be many things, like a plain value such as true, or a variable, or maybe an assignment - even things like function calls are expressions that produce some value.

In many cases you do the same parsing. A global statement has a list of variables after it, so you end up parsing variable names. Same applies to static - again it has a list of variables after it. Using parsec, you can easily break it into small pieces that parse specific kinds of things, and then combine them together to form bigger parsers. This makes it very easy to add new things once you have the basic blocks in place, and it also makes for easy bugfixing since the issues usually crop up in the small parsers and once you fix it there, you fix it everywhere.

For example, a parser for a while loop is kind of like this: parse reserved word “while”, parse an expression inside parentheses and then parse either one statement, or a list of statements inside curly braces. You can see the combining approach here: An expression parser and a statement parser is used to create the while parser. The expression parser in turn is comprised of other parsers like the variable or assignment parsers, and a statement parser works in much the same way. Parsec also makes it easy to write “options” like mentioned here with “either one statement, or list of statements inside braces”.

Representing a parsed PHP source as data

A parser produces an AST from a string of code. However, an AST can be represented in various ways depending on the programming language used for the parser.

If you take a look at how the code is represented in my Haskell module…

data PHPValue = PHPString String
              | PHPInt Integer
              | PHPFloat Double
              | PHPBool Bool
              | PHPNull

data PHPExpr = Literal PHPValue
             | Variable PHPVariable
             | Assign PHPVariable PHPExpr
             | BinaryExpr BinOp PHPExpr PHPExpr
             | Call FunctionCall [PHPExpr]
             | Isset [PHPVariable]
             | Print PHPExpr

data PHPStmt = Seq [PHPStmt]
             | Expression PHPExpr
             | If PHPExpr PHPStmt (Maybe ElseExpr)
             | Function String [FunctionArgumentDef] PHPStmt
             | Return PHPExpr
             | For [PHPExpr] [PHPExpr] [PHPExpr] PHPStmt
             | Echo [PHPExpr]
             | Global PHPVariable

…even if you’re not a Haskell programmer, you can see a structure. A PHPStmt can be a sequence of other statements, or a single expression, or an if-clause, a function definition, a return, and so on. A PHPExpr can be a literal value, a variable, an assignment, a binary operation (eg. 1 + 1) and so on.

This is one of the pros of using Haskell: It makes a difference between data and the operations on the data.

In Haskell, we define data types like seen above. We then separately define functions that operate on the data.

In comparison, an object-oriented language does not separate this: A class contains both data and the functions that operate on it. This can make it harder to see the big picture when it comes to more complicated data structures like this.

We can compare this to an object-oriented language: There’s a PHP parser written in PHP, called PHPPHP.

If we check out the PHPPHP project’s data structures, we won’t really get this kind of a clear picture: PHPPHP/Engine on GitHub

Do not take this as criticism against PHPPHP - it’s a great project. This is simply a comparison of how different the representation of data is between these two languages, and what advantages in my opinion Haskell has over object-oriented languages in a scenario like this. PHPPHP’s approach is also much closer to how PHP itself is implemented than my Haskell implementation, so it naturally adds some complexity to it.

Evaluating the AST

Now the other interesting thing… evaluating the code.

You can read about that in part 2