Building A Programming Language Compiler - Part 1 : Lexical Analysis

Helios · September 17, 2006

Building A Programming Language Compiler - Part 1 : Lexical Analysis

Warning: I assume at least a very basic understanding of programming in this tutorial, if you don't have it please turn back now.

Hey everyone, I've decided to write a in depth tutorial about writing a simple programming language compiler to try and teach the basics of lexical analysis and parsing to anyone who wishs to learn.

For this tutorial I'm going to be using C#, which you can download a - free - compiler for

Here, this shouldn't cause a problem with those of you who use other languages

as the theory is explained in detail, even so when I get the chance I'm going to port

it over to a few other languages.

Hopefully by the end of this tutorial you should be able to go out into the wide open world of compiler design and be capable of compiling simple languages.

SimpleScript

The language that we will be compiling in this tutorial is a very simple one, hence the name SimpleScript, which will do nothing more than allow the user to call several basic WinApi functions in a linear manor.

The language will be compiled in 2 steps. The first step is from the source language ( SimpleScript ) to the intermediate language ( x86 Assembly ) and then from the intermediate language to the target language ( Machine Code ).

The last step (Intermediate -> Target) will be done using the FASM assembler - which you can get Here - as creating a assembler is well beyond the scope of this tutorial, maybe another time :D.

The syntax for SimpleScript is as the name suggests very simple, each source file is a list of statements seperated by a semi-colon character.

The only statement that SimpleScript understands is the function call statement, though by the end of this tutorial you should easily be able to expand it to contain other statements such as variable assignment.

The only data types SimpleScript can understand are integer's ( numbers ) and string's ( strings of characters ).

SimpleScript Example:

messagebox "Hello World"
end

Setting Up The Project

Right now lets get onto the more intresting stuff shall we? First things first create a blank project with 1 source file ( SimpleScript.cs ) using your prefered C# IDE and save it into a folder called "SimpleScript".

Make sure the 'System.dll' dll is imported into your project.

Now past the FASM.exe file and INCLUDE folder from the FASM download ( Which you can get Here ) into the projects output folder.

Lexical Analysis

So, I bet your wondering, "What the hell does lexical analysis mean?". Well its quite simple, lexical analysis - or lexing as its more commonly known - is basically a complex name for the process of analysing a source language ( in this case SimpleScript ) and spliting it down into managable chunks ( known as tokens ).

A token is a simple structure that defines several characteristics of a section of code, the tokens in this compiler will only have 2 members, one describing what the token is ( Known as the Token ID ) and one containing the tokens data ( known as the Mnemonic ).

For example a the following source when lexed would be compiled into the following tokens.

Source

MessageBox "Hello World";

Tokens

ID: Identifier, Mnemonic: MessageBox
ID: String, Mnemonic: "Hello World"
ID: Semi-Colon, Mnemonic: ;

There are many different ways of lexing scripts the easiest and the one we will be using is a simple top-down approach that will read in each character in the script and process it according to what it is ( coincidently the same method LUA uses ) ( as shown in Fiqure A. ).

So lets start shall we? First place the following code at the top of your source file.

using System;
using System.IO;
using System.Threading;
using System.Diagnostics;
using System.Collections;

namespace SimpleScript
{

}

This basically just imports the parts of the .NET framework we are going to be using throughout this code, and creates

a namespace we can place all our symbols in so they don't clash with those in other source files.

Now that we have the framework setup its time to create an entry point for our compiler so create a new class named

Program and create a static Main(string[] args) method inside it and paste the following code into it.

// Make sure FASM is in this folder, if its not we can't compile.
if (File.Exists("FASM.exe") == false)
Console.WriteLine("Unable to find required file ("FASM.exe"), please make sure it is in the same folder as this executable.");
else
{
// Make sure we have all the correct command line arguments, if not
// then print out an error to the console.
if (args.Length == 3)
{
	// Create a new instance of our simple script compile and 
	// tell it to compile the script that was passed as 
	// a command line argument.
	SimpleScriptCompiler compiler = new SimpleScriptCompiler();
	compiler.Compile(args[0], args[1], args[2]);

	// If there was an error write it out to the console.
	if (compiler.Error != null)
		Console.WriteLine("Error: "" + compiler.Error + """);
	else
		Console.WriteLine("Compilation successfull");

}
else
{
	Console.WriteLine("Invalid command line arguments.");
	Console.WriteLine("Usage: <source> <assembly> <executable>");
}
}

// Write out the cliched 'Press any key' message and wait for a key press.
Console.WriteLine("Press any key to continue");
Console.ReadKey();

This method if you haven't already worked out is the entry point of our function, all it does is check that we have

passed the correct command line arguments to it ( The syntax is '<source> <assembly> <executable>' ), and creates

a new instance of our compiler class ( SimpleScriptCompiler ) and tells it to compile our source code.

Now that we have the base of the application sorted out its time to start writing the intresting stuff, thats right

the compiler.

So first things first create a new class named SimpleScriptCompiler and create a new method in it called Compile(string sourceFile, string assemblyFile, string executableFile)

and place the following code into it.

// Read in the source file and lexically analyse it.
_sourceCode = File.ReadAllText(sourceFile);
LexSource();

// If an error has occured then abort immediatly.
if (Error != null) return;

// Go through each token and print it out.
foreach (Token token in _tokenList)
Console.WriteLine(token.ID.ToString() + ": " + token.Mnemonic);

and add the following variable's into the class

public string Error = null;        // Used to store the description of the last message that occured.
private string _sourceCode = null;   // Contains the original source code.
private ArrayList _tokenList = new ArrayList(); // Stores every token that was lexed fromt the source code.

This code basically reads in the contents of the source code file into the variable named _sourceCode and then calls the LexSource() ( Added next ) method which lexically analyses the source code into a list of tokens. And stores them in the _tokenList variable. It then prints out the id and mnemonic of each token that has been lexed from the source file, this won't be in the finished compiler its just so you can see that it works.

The Error variable is just used to store a description of the last error that occurs, its made public so the Program class can access it ( you may wish to wrap it up in a property ).

Now comes the intresting part, the lexical analysis. First create a method inside the SimpleScriptCompiler class called LexSource(), and add the following code into the top of it.

int characterIndex = 0; // The index in the string of the current character being processed.
char character;    // The current character being processed.

while (characterIndex < _sourceCode.Length)
{

// Grab the character at the current character index.
character = _sourceCode[characterIndex];

// Have a look at the character and process it if its value.
switch (character)
{

}

}

Basically what this code is doing looping until the characterIndex is above the length of the _sourceCode string, or in english until we have lexed each character in the source code.

character = _sourceCode[characterIndex];

This code grabs the character in the source code at the position of characterIndex, the character that is grabbed is the one that is going to be lexed this loop, and as such is stored in the character variable.

switch (character)
{

}

This code is basically where our who lexing code goes. We check what the current character is and lex it depending on what it is.

Now the first type of character we are going to lex in our lexer is whitespace, being what it is its very simple to lex, all we have to do is more or less ignore it. So place the following code inside the switch block.

// If its whitespace (new lines, spaces ... etc) then just ignore it.
case ' ':
case 'n':
case 'r':
case 't':

// Increment the character index so we don't keep parsing this one.
characterIndex++; 

break;

All this does is move the characterIndex onto the next character, effectivley ignoring it.

Wow, how does that feel? You've just lexed your first piece of source code. If your anything like me your probably thinking that this is all well and good but it dosen't do anything usefull unless you have a large source file full of nothing! Well be happy for we are going to be adding in some more advanced lexing next.

So lets start by writing the Token class which is going to be used to store details on each token we lex from the source code.

public class Token
{
public TokenID ID;
public string Mnemonic;
public Token(TokenID id, string mnemonic)
{
	ID = id;
	Mnemonic = mnemonic;
}
}

This class is more or less self explanatory, it dosen't do anything but store 2 variables, one storing the id of the token the other storing the mnemonic of it. You will also need to place the following enumeration into your source code to.

public enum TokenID
{
Unknown,
SemiColon,
Identifier,
String,
Integer
}

This is just a list of possible token ID's that can be found inside a source file, it may not seem like many but its enough to create a simple compiler, feel free to add more in if you feel you need to.

Now that we have the basic lexing framework setup lets check everythings ok by adding in functionality for the most basic token type, a single character, in this case a semi-colon.

Add the following code into your LexSource switch block.

// Check if its a semi-colon if it is create a 
// token describing it.
case ';':

// Create a token describing this character and add it
// into the token list.
_tokenList.Add(new Token(TokenID.SemiColon, ";"));

// Increment the character index so we don't keep parsing this one.
characterIndex++; 

break;

Basically what this code does is if the character being lexed is a semi-colon ( it adds a new instance of the semi-colon token's class into the token list and moves the characterIndex onto the next character in the source code.

Got that? Right lets move on to string parsing, this is more advanced, but still relativly simple to understand.

Add the following code into your LexSource switch block.

// Check if its a quotation mark, if it is then its the start
// of a string of character. So read it all in and create a 
// token describing it.
case '"':
string stringMnemonic = ""; // Describes the data that this string contains.

// Increment the character index so we don't attempt to process the 
// starting quotation mark.
characterIndex++; 

// Repeat until we find the ending quotation mark or the end of the file.
while (true)
{
	// Check if we are at the end the source file, if so
	// then this string is open-ended and we need to give an error.
	if (characterIndex >= _sourceCode.Length)
	{
		Error = "Unfinished string literal";
		return;
	}

	// Grab the character at the current character index.
	character = _sourceCode[characterIndex++];

	// Check if its an ending quotation mark.
	if (character == '"')
		break;

	// Append this character to the mnemonic.
	stringMnemonic += character;
}

// Increment the character index so we don't keep parsing this one.
characterIndex++; 

// Create a token describing this string and add it
// into the token list.
_tokenList.Add(new Token(TokenID.String, stringMnemonic));

break;

This code is fairly simple, when the current character is a quotation mark moves the characterIndex onto the character after it, then loops until it finds either an end of file, where it throws an unfinished string error, or a end quotation mark where it breaks out of the loop. On each illiteration it appends the current character onto the stringMnemonic variable so that at the end of the loop we have the full mnemonic of the string we are parsing ( As shown in fiquire :P. After its lexed the mnemonic it creates a new instance of the Token class with a string ID and a mnemonic of the string's value.

The integer and identifier source code work in almost the exact same way except they look for an character that is not an integer or indentifier rather than an ending quotation mark.

So to add integer and identifier functionality into your lexer add the following code into your switch block.

// Its not a single character so check if its a multi-character token.
default:

// Check if the character is between 0 or 9 if it is its a integer
// so read in the rest of it and create a token describing it.
if (character >= '0' && character <= '9')
{
	string numberMnemonic = ""; // Describes the data that this number contains.

	// Repeat until we find a character thats not numeric or the end of the file.
	while (true)
	{
		// If we are at the end of the source code then we can exit the loop
		// under the assumption we have read it all in.
		if (characterIndex >= _sourceCode.Length)
			break;

		// Grab the character at the current character index.
		character = _sourceCode[characterIndex++];

		// Check if its not a numeric character.
		if (character < '0' || character > '9')
		{
			// Go back to the last character so we parse this non-numeric character
			// on the next pass.
			characterIndex--;
			character = _sourceCode[characterIndex];

			break;
		}

		// Append this character to the mnemonic.
		numberMnemonic += character;
	}

	// Create a token describing this string and add it
	// into the token list.
	_tokenList.Add(new Token(TokenID.Integer, numberMnemonic));
}

// Check if the character is between a and z or if its an under score.
else if ((character >= 'A' && character <= 'Z') ||
		 (character >= 'a' && character <= 'z') ||
		 character == '_')
{
	string identMnemonic = ""; // Describes the data that this identifier contains.

	// Repeat until we find a character thats not alphanumeric or the end of the file.
	while (true)
	{
		// If we are at the end of the source code then we can exit the loop
		// under the assumption we have read it all in.
		if (characterIndex >= _sourceCode.Length)
			break;

		// Grab the character at the current character index.
		character = _sourceCode[characterIndex++];

		// Check if its not a alphanumeric character.
		if ((character < 'A' || character > 'Z') &&
			(character < 'a' || character > 'z') &&
			 character != '_' && 
			 (character < '0' || character > '9'))
		{
			// Go back to the last character so we parse this non-alphanumeric character
			// on the next pass.
			characterIndex--;
			character = _sourceCode[characterIndex];		

			break;
		}

		// Append this character to the mnemonic.
		identMnemonic += character;
	}

	// Create a token describing this string and add it
	// into the token list.
	_tokenList.Add(new Token(TokenID.Identifier, identMnemonic));
}

// Its not a character that we know of so throw up an invalid character error.
else
{
	Error = "Invalid character ""+character+""";
	return;
}

break;

Tadah! You've just finished writing your own source file lexer, its now capable of lexically analysing source code like the following, try it out .

MessageBox "Hello World";
End;

Hope you liked this tutorial make sure to read the next one where we are actually going to parse the tokens and compile them into executable files!

Source Code

C#: Download

FAQ

Q. Ahh, every time I try to compile a script it tells me it can't find FASM.exe?

A. Make sure the latest copy of FASM.exe (and its INCLUDE folder) is in your output folder (the folder where the exe gets emitted to) of your project. If you are using VS Express the output folders are contained in the bin folder, you will need to copy it into all of them or change the output folder in the project settings.

~ Helios

http://www.binaryphoenix.com

Minalien · September 18, 2006

Not bad, Helios, I'll check this out after I get home from work today.

Helios · September 18, 2006

Thanks, the next tutorials will get more intresting as thats where I implement parsing and executable generation :D.

January 1, 2007

nice.

Sign In

Building A Programming Language Compiler - Part 1 : Lexical Analysis

Recommended Posts

Helios

Link to comment

Share on other sites

Minalien

Link to comment

Share on other sites

Helios

Link to comment

Share on other sites

Guest kaze

Link to comment

Share on other sites

Join the conversation

Browse

Activity