Jump to content

Building A Programming Language Compiler - Part 2 : Parsing And Code Generation


Helios

Recommended Posts

Building A Programming Language Compiler - Part 2 : Parsing And Code Generation

Warning: I assume at least a very basic understanding of programming in this tutorial, if you don't have it please turn back now.

Welcome back folks, it seems you've managed to get through the previous tutorial in this series

and have come back for more! Well your in luck this tutorial will cover the far more intresting

side of parsing and code generation, yes thats right you will finally be able to do something

with those pesky Tokens you spent so much time lexing in the last tutorial.

Parsing

So you might be asking yourself write about now - "What does parsing mean?" - well theres

a very simple answer to that. Parsing is the process where by chunks of data ( Tokens ) are processed

and converted into something usable, so in this case its the process of taking the Tokens making

up an SimpleScript source file and converting it into Assembly.

Now before we start writing our parser I must stress, its far easier than it looks, don't get

intimidated by this complexity ( if any ) of this source code, I've tried to keep it as simple as possible

- which is why there are some bad programming practices going on in the source :P - so that you don't have

trouble, bu there are a few areas which you may stuggle with, just PM me at the BinaryPhoenix.com forums

if you need help.

Corrections

The first thing we have to do before parsing our source code is modify a few bits of code that we created

in the last tutorial, namely the Compile method.

Please remove the following code from the bottom of the Compile method.

// Go through each token and print it out.
foreach (Token token in _tokenList)
Console.WriteLine(token.ID.ToString() + ": " + token.Mnemonic);

This code as you probably remember was responsible for printing the tokens to the console,

as we will be having a testable executable file by the end of this tutorial thats no longer needed.

Building The Parser

Right then, lets start! First thing we need to do is add the following variables into the Compile

class.

private string _compiledCode = null; // Contains the compiled assembly version of the source code.

private int _tokenIndex = 0; // Stores the index of the current token being parsed.
private Token _currentToken = null; // Stores the current token being parsed.

private string[] _stringTable = new string[0]; // Keeps track of all the string data inside our program.
private string[] _includeTable = new string[]  // Keeps track of all the include files that need to be included.
{
	"includewin32a.inc" // This includes the declarations for the whole WinApi.
};
private string[] _libraryTable = new string[] // Keeps track of all the librarys that need to be imported for this program to work.
{
	"kernel32", // This is mandatory as it includes our ExitProcess function.
	"user32",
	"winmm"
};
private Function[] _functionTable = new Function[] // Keeps track of all the functions that need to be imported for this program to work.
{
	new Function("user32", "MessageBoxA", "MessageBox",4), // This is just a simple message box function, its syntax is MessageBox 
	new Function("kernel32", "ExitProcess", "End",1), // Ends the current process.
	new Function("winmm", "mciSendStringA", "SendMessage",4) // Sends an MCI message. Fun to open the CD drive with :D.
};

Wow, that looks complex dosen't it? Well don't worry its all quite straight forward.

private string _compiledCode = null; // Contains the compiled assembly version of the source code.

This variable is used to store the compiled assembly form of our script, ready to be assembled

by FASM into an executable file.

private int _tokenIndex = 0; // Stores the index of the current token being parsed.
private Token _currentToken = null; // Stores the current token being parsed.

These 2 variables are used by the parser to keep track of the token currently being parsed

and the index into the _tokenList variable of said token. The _tokenIndex variable is also

used to check if we've read all the tokens or not and to illiterate through all the tokens correctly.

private string[] _stringTable = new string[0]; // Keeps track of all the string data inside our program.

Every string we use in our scripts gets added to this array so it can be added to the

compiled assembly code for our script before compilation.

private string[] _includeTable = new string[]  // Keeps track of all the include files that need to be included.
{
	"includewin32a.inc" // This includes the declarations for the whole WinApi.
};

This is an array of files that need to be included into our scripts to work, feel free to add more if you want

to use other functions, just have a poke around in the INCLUDE folder to see what other includable files you can use.

private string[] _libraryTable = new string[] // Keeps track of all the librarys that need to be imported for this program to work.
{
	"kernel32", // This is mandatory as it includes our ExitProcess function.
	"user32",
	"winmm"
};

This is the library array, it contains a list of all the .dll library files that need to be linked

to the compiled executable for it to run correctly.

private Function[] _functionTable = new Function[] // Keeps track of all the functions that need to be imported for this program to work.
{
	new Function("user32", "MessageBoxA", "MessageBox",4), // This is just a simple message box function, its syntax is MessageBox 
	new Function("kernel32", "ExitProcess", "End",1), // Ends the current process.
	new Function("winmm", "mciSendStringA", "SendMessage",4) // Sends an MCI message. Fun to open the CD drive with :P.
};

Owh, this is the most intresting array yet, it contains a list of functions that scripts can used, the Function class will be declared in a

minute so just hold on! Feel free to add any WinApi functions you wish to use into this array.

Well that was easy wasn't it? We just need to add a couple of other things in before we

start the parsing.

First thing we need to add is the Function class, you've already seen this before in the

_functionTable variable, all it does is store several details on a single function that

can be used from a script.

public class Function
{
public string Library;      // The library that this function needs to be imported from. 
public string InternalName; // The internal name of this function (eg. MessageBoxA).
public string Identifier;   // The name used in a programs source code to represent this function.
public int ParameterCount;  // How manys parameters this function takes.

public Function(string library, string internalName, string identifier, int parameterCount)
{
	Library = library;
	InternalName = internalName;
	Identifier = identifier;
	ParameterCount = parameterCount;
}
}

Much like the Token class isn't it? Very simple, all it does is store the library that the

function resides in ( the library must be declared in the _libraryTable ), the internal name of the

function inside the dll and the identifier which is the name used to call the function from a SimpleScript

program.

The ParameterCount variable is just used to make sure that a function is called with the correct

number of parameters, you may wish to add the ability to distiguish correct parameter data types at a later date,

to provide more security.

Now that we have the that out of the way you might be wondering how we are going to go about

parsing all those nosensical tokens created by the lexer, well what we are going to do is create

a loop that will keep calling a parsing function ( ParseStatement ) that will parse the next statement

in the source code until all tokens have been read ( As shown in Fiqure A. ).

a.png

Before we right the main parsing function though we need to create a couple of simple

functions to help with the parsing process.

public bool EndOfTokens()
{
if (_tokenIndex >= _tokenList.Count) return true;
return false;
}

public Token NextToken()
{
_currentToken = (Token)_tokenList[_tokenIndex++];
return _currentToken;
}

These functions are quite simple and require little explanation, but just because I'm so generous ...

The EndOfTokens function returns true if the _tokenIndex is above the size of the _tokenList, or in english

when we have run out of tokens to read.

The NextToken function reads in the next token in the list, places it in the _currentToken variable and returns

it to the caller.

Now first things first lets set up our parsing loop, place the following code at the end of your

Compile function.

// Keep parsing the code until we get to the end of the token list.
while (!EndOfTokens())
{
// Parse the next statement in the source.
ParseStatement();

// If an error has occured while parsing the last statement
// the abort compilation immediatly.
if (Error != null) return;
}

This code basically just keeps parsing until we run out of tokens or an error occurs.

The ParseStatement function parses the highest level of code possible in a script, a statement.

A statement can encompass just about anything in a scripting language from an If block to a Variable assignment, or in

the case of SimpleScript just the FunctionCall.

So lets start by creating a function call ParseStatement() and plcing the following code into it.

// Check the ID of the next token in the source and parse it depending on what it is.
switch (NextToken().ID)
{
// The token is invalid in this position so throw up an error.
default:
	Error = "Unknown statement '"+_currentToken.Mnemonic+"'";
	break;
}

This code basically gets the next token in the token list and allows you to check its ID

and parse it accordingly, if it finds an unhandled token it assumes it is out of place and gives

an error to the user.

Wow, we have a parser running! Its basic and it dosen't do anything yet but loop through our

tokens and emit errors for each one. So lets add something intresting to it so that it actuall

does something usefull, lets implement the Function Call statement, coincidently the only statement

supported by SimpleScript.

First thing we need to do before parsing the Function Call statement though is setup the code used

to produce the header and footer compiled assembly code required to create an executable.

So first things first add the following code just above the parsing loop in the Compile function.

// Adds the executable format declaration to the assembly code.
_compiledCode += "format PE GUI 4.0n";

// Adds the code to include all the relevent declaration files into the compiled code.
foreach (string s in _includeTable)
_compiledCode += "include '"+s+"'n";
_compiledCode += "n";

This code basically sets up the header of the assembly file, the output will look similar to this.

format PE GUI 4.0
include 'includewin32a.inc'
...

The first line tells the assembler what type of executable we want our code to be assembled to, if

you look at the FAQ section at the bottom of this page I explain how to change this to emit DLL's and

object files. The lines after the first tell the assembler to include all the files contained in the

_includeTable array.

You now need to add the following code after the parser loop.

// If an EndProcess invoke just to make sure the process ends incase the user forgot to add an End tag.
_compiledCode += "invoke ExitProcess,0n";

// Appends the string data to the end of the compiled assembly code.
_compiledCode += "n";
for (int i = 0; i < _stringTable.Length; i++)
// The .Replace method call in this function replaces the ' character with 
// its ascii code, as FASM dosen't allow escape sequences.
_compiledCode += "_s" + i + " db '" + _stringTable[i].Replace("'", "', 39 ,'") + "',0n";

// Emits the assembly code required to import the librarys and functions
// this program uses.
_compiledCode += "ndata importnn";

if (_libraryTable.Length > 0) _compiledCode += "library ";
int index = 0; // Used to store the index of the current library being outputted.
foreach (string s in _libraryTable)
_compiledCode += ((index++) == 0 ? "" : ", ") + s + ", '" + s + ".dll'";
_compiledCode += "n";

foreach(Function function in _functionTable)
_compiledCode += "import " + function.Library + ", " + function.InternalName + ", '" + function.InternalName + "'n";

_compiledCode += "nend datann";

Core, that looks like a lot dosen't it, well don't worry its quite simple and you don't need to understand it - just what it does -, it generates the companion to the header, the footer.

Which is required to assemble the assembly code to a executable. The output looks something similar to this.

invoke ExitProcess, 0

_s0 db 'hello world'

data import

library kernel32, "KERNEL32.dll"

import kernel32, ExitProcess, ExitProcess

end data

The first line of assembly calls the ExitProcess function, which is stored in the kernel32.dll library, basically all this function

does is close down the program, this is meant as a last resort, the user should optimaly end their code with an End; call.

The lines begining with _s are the string definitions, they just define a space in memory to store each string

of characters defined in the source code.

The last lines are the most intresting and important, they import in all the functions we are going to use from the dll librarys that

they reside in. The lines begining with library just define the dll to open, and the lines begining with import actuall import a function

from a previously defined dll library.

Righto, headers and footers are now being generated, we just need to save the assembly code

to a file and assemble it into an executable ... yep, you've guessed it add the following code to the bottom of the Compile function.

// If an error has occured then abort immediatly.
if (Error != null) return;

// Dump the compiled assembly code into the assembly file.
File.WriteAllText(assemblyFile, _compiledCode);

// Run FASM.exe so it can compile the assembly code to 
// an executable file.
Process process = new Process();
process.StartInfo.FileName = "FASM.exe";
process.StartInfo.Verb = "Open";
process.StartInfo.Arguments = """ + assemblyFile + "" "" + executableFile + """;
process.Start();
process.WaitForExit();

The second line just aborts compilation if an error has occured in any of the previous steps.

The fifth line saves the compiled assembly code into the designated assembly file, and the last part of

the code - the most important - runs the FASM.exe executable with command lines telling it to compile

the assembly file to the designated executable file, it also waits for FASM to finish compiling before moving on.

Yay, the compiler is now almost done, you can run it now and create a fully running executable file, not very intresting

though is it? What we need now is a function call parser so that we can actually do something intresting our files.

To implement the function call statement place the following code in the switch block inside the ParseStatement function.

// Its an identifier so it must be a function call.
case TokenID.Identifier:

// Keep the mnemonic of this token for later as we will be using it to 
// find the correct function to call.
string functionName = _currentToken.Mnemonic;

// Look through the function array to see if this function is valid.
Function function = null;
foreach (Function searchFunction in _functionTable)
	if (searchFunction.Identifier == functionName)
		function = searchFunction;

// If we have found the function continue on with the parsing.
if (function != null)
{
	int parameterCount = 0; // Used to keep track of the amount of parameters this function is called with.

	// Emit the first part of the assembly code used to invoke this function.
	_compiledCode += "invoke "+function.InternalName;

	// Keep parsing arguments until we come across a semi-colon.
	while (NextToken().ID != TokenID.SemiColon)
	{
		// Keep track of the amount of paramters passed to this function.
		parameterCount++;

		// If we are at the end of the token list then throw up an error as we 
		// haven't retrieved our semi-colon yet.
		if (EndOfTokens() == true)
		{
			Error = "Encountered end of file but expected semi-colon";
			return;
		}

		// And a comma to seperate this argument from the last.
		_compiledCode += ", ";

		// Generate argument code depending on what type the argument is.
		switch (_currentToken.ID)
		{
			// Check if its an integer argument, in which case we can just 
			// emit its raw value as the argument.
			case TokenID.Integer:

				_compiledCode += _currentToken.Mnemonic;

				break;

			// Check if its a string argument, in which case add the string to the 
			// string table and create a reference to it for the argument.
			case TokenID.String:

				// Insert the string into the string data table. 
				// And store its index for later use.
				int stringIndex = _stringTable.Length;
				Array.Resize<string>(ref _stringTable, _stringTable.Length + 1);
				_stringTable[stringIndex] = _currentToken.Mnemonic;

				// Add a reference to the string into the compiled code for the next argument.
				_compiledCode += "_s" + stringIndex;

				break;

			// Its not a valid data-type for an argument so throw up an error.
			default:
				Error = "Invalid argument data-type '"+_currentToken.Mnemonic+"'";
				return;
		}
	}

	// Check that the amount of parameters on this call are equal to the amount
	// the function wants.
	if (parameterCount < function.ParameterCount)
	{
		Error = "Function '"+function.Identifier+"' expects "+function.ParameterCount+" arguments but recieved "+parameterCount;
		return;
	}

	// Append a new-line character onto the end of the invoke function.
	_compiledCode += "n";
}

// We haven't found the function so we can throw up an error saying that it is invalid.
else
{
	Error = "Attempt to call unknown function '"+functionName+"'";
	return;
}

break;

Link to comment
Share on other sites

Holy mother of Jesus, thats a lot of code. Sorry to dump it all onto you but its simpler than putting it together

piece by piece.

Now lets explain how this work shall we.

Basically what happens is if it comes across a identifier at the start of a statement

it assumes the statement is a function call - you may wish to implement

keywords and other types of identifiers based statements later - and parses it accordingly.

// Keep the mnemonic of this token for later as we will be using it to 
// find the correct function to call.
string functionName = _currentToken.Mnemonic;

// Look through the function array to see if this function is valid.
Function function = null;
foreach (Function searchFunction in _functionTable)
if (searchFunction.Identifier == functionName)
	function = searchFunction;

// If we have found the function continue on with the parsing.
if (function != null)
{
...
}

// We haven't found the function so we can throw up an error saying that it is invalid.
else
{
Error = "Attempt to call unknown function '"+functionName+"'";
return;
}

This code looks through the function table array and attempts to find a function with the identifier

as the current tokens mnemonic, if it is found it continues parsing, if not it it throws up

and unknown function error.

int parameterCount = 0; // Used to keep track of the amount of parameters this function is called with.

This variable is used to keep track of how many parameters we have parsed and to make sure the user

dosen't try to call a function with to few or to many.

_compiledCode += "invoke "+function.InternalName;

This part of the code emits the first chunk (the mnemonic for the invoke function) of the assembly code needed to invoke the given function.

// Keep parsing arguments until we come across a semi-colon.
while (NextToken().ID != TokenID.SemiColon)
{
// Keep track of the amount of paramters passed to this function.
parameterCount++;

...
}

The while loop shown above is used to keep parsing the function call until all parameters have been read, which is assumed

when a semi-colon is found.

It also updates the parameterCount variable so we can keep track of how many parameters have been parsed so far which is used for error checking

in later code.

// If we are at the end of the token list then throw up an error as we 
// haven't retrieved our semi-colon yet.
if (EndOfTokens() == true)
{
Error = "Encountered end of file but expected semi-colon";
return;
}

This code as its pretty obvious throws up an error if we read to the end of the token list (effectivly the end of the source code)

and don't find a semi-colon.

// And a comma to seperate this argument from the last.
_compiledCode += ", ";

Adds the colon used to seperate the arguments in the assembly version of the script.

// Generate argument code depending on what type the argument is.
switch (_currentToken.ID)
{
...

// Its not a valid data-type for an argument so throw up an error.
default:
	Error = "Invalid argument data-type '"+_currentToken.Mnemonic+"'";
	return;
}

The switch block shown above is used to check what type the current token is an throw an error

if its one that is unexpected or unknown to the compiler. SimpleScript will only accept strings

and integer parameters in funciton calls as is shown next.

// Check if its an integer argument, in which case we can just 
// emit its raw value as the argument.
case TokenID.Integer:

_compiledCode += _currentToken.Mnemonic;

break;

This code is amazingly simple, if the parameter type is an integer it just adds it straight

into the assembly code as no processing needs to be done to it.

// Check if its a string argument, in which case add the string to the 
// string table and create a reference to it for the argument.
case TokenID.String:

// Insert the string into the string data table. 
// And store its index for later use.
int stringIndex = _stringTable.Length;
Array.Resize<string>(ref _stringTable, _stringTable.Length + 1);
_stringTable[stringIndex] = _currentToken.Mnemonic;

// Add a reference to the string into the compiled code for the next argument.
_compiledCode += "_s" + stringIndex;

break;

This is like the integer parsing except a little more complex, instead of adding the raw

string to the assembly code it creates a space in the string table to store the string then

adds a reference to the string (all string data in the assembly of simple script is prefixed with _s and the index of the data)

into the assembly code.

// Check that the amount of parameters on this call are equal to the amount
// the function wants.
if (parameterCount < function.ParameterCount)
{
Error = "Function '"+function.Identifier+"' expects "+function.ParameterCount+" arguments but recieved "+parameterCount;
return;
}

This code simply checks the parameterCount variable against the amount of parameters expected

by the function being called, if they don't tally it throws up an error.

And last ...

// Append a new-line character onto the end of the invoke function.
_compiledCode += "n";

This code basically just adds a new line character onto the end of the function invoking assembly code, without

it we would end up with all the code on the same line, which really wouldn't look very nice now would it?

All this function call parsing code should produce some assembly code similar to this if all

goes well.

invoke InternalFunctionName, param1, param2, ...

The invoke function (well its actually a mnemonic for its opcode, but who cares) essentially

just calls the functions with the given funcitons and sets up its stack frame (basically

the space where local variables are stored) on the runtime stack.

And thats about it for simple parsing.

Tadah! You know have a completely functional compiler, admitedly it dosen't do a whole

lot at the moment but its a start. Be pround you've just finished step 1 down the road

the Compiler Mastery, one of hte hardest topics in programming!

Heres a quick test you can use, it shows a message to the user then opens the CD drive.

MessageBox 0 "Hello World from SimpleScript. Now look at your computer's audio drive and press ok!" "Hello World" 48;
SendMessage "open cdaudio" 0 0 0;
SendMessage "set cdaudio door open" 0 0 0;
MessageBox 0 "Wow, the awesome power of assembly!" "Hello World" 48;
SendMessage "close cdaudio" 0 0 0;
End 0;

So where from here you ask? Well my advise is you read up on expression parsers next as thats the

heart of a real programming language, as you've probably noticed SimpleScript is very limiting as

it has very little support for any kind of memory minipulation such as variables.

Source Code

Download

FAQ

Q. Ahh, every time I try to compile a script it tells me it can't find FASM.exe?

A. Make sure the latest copy of FASM.exe (and its INCLUDE folder) is in your output folder (the folder where the exe gets emitted to) of your project. If you are using VS Express the output folders are contained in the bin folder, you will need to copy it into all of them or change the output folder in the project settings.

Q. Can I compile dll's? Or COFF object files?

A. Yep, For DLL's change the format line to "format PE GUI 4.0 DLL at 7000000h", and for COFF object files change it to "format MS COFF".

Q. Can I compile a console application?

A. Yep just remove the 'GUI' part of the format line.

~ Helios

http://www.binaryphoenix.com

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...