Writing a simple generic tokenizer

Table of Contents


Download C# source code (89 kb)

1. Introduction

In the life of a programmer, there is quite often need to parse any kind of text files, e.g. configuration files, script files or text-based data files.
The most common use cases (I came across) are to search and then to extract or replace specific parts of the files, read in configuration files, validate the structure of an input file or to convert the content of the file to another format.

If there is no native support to process the input file format (like it's the case for well-known formats like XML, JSON, ... ) but it's some proprietary or custom text file format, then, independent of the used programming language, there are basically two completely different approaches:

This article is about Option 2. It shows my own approach to write a simple (abstract) tokenizer in C# that provides useful base functionality and can easily be extended to implement a tokenizer for the specific use case.
This can be used as base to implement an own parser.


[Back to top]

2. What is a tokenizer?

A tokenizer gets a stream of characters as input and outputs a sequence of tokens.
A token is a string with additional meta data, e.g. the type of string or specific meaning like an keyword.

Example of a tokenizer process

Let's say the input stream is hello "world" 42. A tokenizer would process each single character of the stream in the order of occurrence. In this example it would be: 'h', 'e', 'l', 'l', 'o', ' ', '"', 'w', 'o', 'r', 'l', 'd', '"', '4', '2', '<EOF>'.
Considering that a space is treated as delimiter, the theoretical output of a tokenizer good be following sequence of tokens:
Token = hello, kind of token = string
Token = "world", kind of token = quoted string
Token = 42, kind of token = integer number, special value = 42 (as integer, not as string)

So the tokenizer provides a sequence of meaningful strings attached with additional data which are called tokens. The output of a tokenizer is normally input to a some kind of parser which knows when to expect what kind of token.
As side information, the first step of every compiler or interpreter is to tokenize the input source.

Unfortunately, there are often also other terms used like scanner or lexer beside a tokenizer which actually all mean nearly the same and it's hard to get the difference. From my understanding, the distinction is as follows:
A scanner proceeds the input character stream and outputs only low-level tokens. These are only string token, representing which characters if the input stream belong together to form a token, but without additional meaning.
A lexer also produces tokens like a scanner, but attaches additional meaning to the tokens as shown above.
Often, a tokenizer is said to be equal to a scanner, but sometimes also equal to a lexer.

In the following, a tokenizer and a lexer are treated the same. See appendix A (TODO) for an example scanner.


[Back to top]

3. Implementation approach

This chapter shows and explains my implementation approach.
In a nutshell, there is an Inputsource class which processes the input stream and provides each single character one after another.
The AbstractTokenizer class uses the Inputsource to forward the tokens to the concrete tokenizer implementation by the user. It is supported by default token implementations and some abstract helper classes.


[Back to top]

3.1 Input source class

The Inputsource implements the access to the single characters of the input stream. It provides an interface to retrieve the current character, the subsequent character and to move to the next character. However, once a character is "consumed", it is not possible to move back to it. The important public interface functions are the following:

There are two special characters that might be returned:

Internally, the StreamReader class is used to to read in one line of a string from the input which is generally a (text) file.
The current line and also the next line is stored to provide access to the current character and also to the subsequent character at any time.

> > Show source code of InputSource


[Back to top]

3.2 AbstractTokenizer class

The AbstractTokenizer does actually nothing. It is just the base class to implement the concrete Tokenizer. Especially, the abstract method ProcessToken needs to be implemented that produces one token after another using the InputSource class. It contains only an utility function to skip white spaces as this is a common used functionality, but it could have been located also in some utility class.

> > Show source code of AbstractTokenizer


[Back to top]

3.3 Token class

The Tokenizer produces tokens which are extending the base Token class. There is one concrete token class for each type of token. The main implementation part is located inside the token classes. A token contains:

Not only needs the token class to extract and store the current token, it must also decide if the current and subsequent character refer to the start of this token.
E.g. if the first character of a new token is a digit, this token will be treated as some kind of number token. This logic might either reside in the token class or in the tokenizer class.

The library provides default token implementations for numbers, strings, C & C++ comments, single quoted & double quoted strings as well as EOL and EOF tokens. They can be used as reference to implement own tokens.

> > Show source code of Token class


[Back to top]

4. How to use the GenericTokenizer library: Writing a concrete tokenizer

This chapter explains the usage of the generic tokenizer library by implementing an example tokenizer step-by-step.
The example tokenizer processes an INI configuration file (see also [2]) and supports following tokens:

Note that EOL tokens are skipped and not explicitly reported.

Example tokenization

Consider the following INI file:

; Header comment
[Section1]
MyKey1 = false

This will result in following tokens:
Type = Comment, String = "; Header comment", Value = " Header comment"
Type = Square bracket left, Value = String = "["
Type = NormalString, Value = String = "Section1"
Type = Square bracket right, Value = String = "]"
Type = NormalString, Value = String = "MyKey1"
Type = Equal sign, Value = String = "="
Type = NormalString, Value = String = "false"
Type = EOF

[Back to top]

4.1 Defining the token types

The recommended way is to define own token types by extending the AbstractTokenType and implementing an own enum type for the token types.
Following listing shows a possible implementation:

class IniFileTokenType : AbstractTokenType
{
    public enum ETokenType
    {
        EOF,
        COMMENT,
        SQUARE_BRACKET_LEFT,
        SQUARE_BRACKET_RIGHT,
        EQUAL_SIGN,
        STRING
    }

    public ETokenType Type { protected set; get; }

    private IniFileTokenType(ETokenType type)
    {
        Type = type;
    }

    public override string ToString()
    {
        return Type.ToString();
    }

    public static readonly IniFileTokenType EOF = new IniFileTokenType(ETokenType.EOF);
    public static readonly IniFileTokenType COMMENT = new IniFileTokenType(ETokenType.COMMENT);
    public static readonly IniFileTokenType SQUARE_BRACKET_LEFT = new IniFileTokenType(ETokenType.SQUARE_BRACKET_LEFT);
    public static readonly IniFileTokenType SQUARE_BRACKET_RIGHT = new IniFileTokenType(ETokenType.SQUARE_BRACKET_RIGHT);
    public static readonly IniFileTokenType EQUAL_SIGN = new IniFileTokenType(ETokenType.EQUAL_SIGN);
    public static readonly IniFileTokenType STRING = new IniFileTokenType(ETokenType.STRING);
}

[Back to top]

4.2 Implementing the token classes

For each token, a new class needs to be implemented extending the base Token class and overriding the extract() method. The extract() method uses the source parameter of type InputSource to "consume" the characters relevant for this token from the input stream. In particular, it sets the token type, token string and token value.
Following snippet shows the implementation of the comment token. If it starts with an ';' character, it is treated as a comment and all subsequent characters until the end of line belong to the comment. The token string contains the ';' character, the token value excludes it.

public class IniFileTokenComment : Token
{
    public IniFileTokenComment(InputSource source) : base(source)
    {
        /* nothing to do */
    }

    protected override void extract()
    {
        StringBuilder currentSymbol = new StringBuilder();

        if (IsCommentStart(source.GetCurrentChar()))
        {
            /* consume the ; */
            currentSymbol.Append(source.ConsumeCurrentChar());

            string curLine = source.CurrentLineStr;

            /* collect the complete line */
            char currentChar = source.GetCurrentChar();
            while (!(currentChar == InputSource.EOL ||
                     currentChar == InputSource.EOF))
            {
                currentSymbol.Append(source.ConsumeCurrentChar());
                currentChar = source.GetCurrentChar();
            }

            /* setup the token */
            TokenStr = currentSymbol.ToString();
            TokenValue = TokenStr.Remove(0, 1);
            TokenType = IniFileTokenType.COMMENT;
        }
    }

    public static bool IsCommentStart(char c)
    {
        return c == ';';
    }
}

The implementation of the special characters '[', ']' and '=' follow the same procedure. It is even simpler because each token is just a single character.

public class IniFileTokenSpecialSingleKey : Token
{
    public static Dictionary<char, AbstractTokenType> specialSingleKeyTokens = new Dictionary<char, AbstractTokenType>()
    {
        ['['] = IniFileTokenType.SQUARE_BRACKET_LEFT,
        [']'] = IniFileTokenType.SQUARE_BRACKET_RIGHT,
        ['='] = IniFileTokenType.EQUAL_SIGN,
    };

    protected override void extract()
    {
        StringBuilder currentSymbol = new StringBuilder();
        char c = source.GetCurrentChar();
        if (IsSpecialSingleKey(c))
        {
            TokenStr = c.ToString();
            TokenValue = TokenStr;
            TokenType = specialSingleKeyTokens[c];

            source.ConsumeCurrentChar();
        }
    }

    public static bool IsSpecialSingleKey(char c)
    {
        return specialSingleKeyTokens.ContainsKey(c);
    }
}

[Back to top]

4.3 Implementing the tokenizer

The actual tokenizer extends the abstract AbstractTokenizer class and implements the most important function ProcessToken(). This function produces the actual tokens. It previews the current character (and subsequent character if necessary) to determine the token type and creates the corresponding token.
The example uses the utility function from the base class to skip all white space characters. The implementation is as follows:

class IniFileTokenizer : AbstractTokenizer
{
    public IniFileTokenizer(InputSource sourcefile) : base(sourcefile)
    {
    }

    public override Token ProcessToken()
    {
        Token token;

        skipWhiteSpaces();

        char currentChar = source.GetCurrentChar();
        char nextChar = source.PeekNextChar();

        if (currentChar == InputSource.EOF)
        {
            token = new TokenEOF(source, IniFileTokenType.EOF);
        }
        else if (currentChar == InputSource.EOL)
        {
            token = new TokenEOL(source, IniFileTokenType.EOL);
        }
        else if (IniFileTokenComment.IsCommentStart(currentChar) )
        {
            token = new IniFileTokenComment(source);
        }
        else if (IniFileTokenSpecialSingleKey.IsSpecialSingleKey(currentChar))
        {
            token = new IniFileTokenSpecialSingleKey(source);
        }
        else
        {
            token = new IniFileTokenString(source);
        }


        return token;
    }
}

[Back to top]

4.3 Using the tokenizer

All parts of the tokenizer is implemented now. So now it can be used to process the tokens once after another.
Following shows an example implementation to process all tokens of the input file until the end-of-file is reached:

string fileNamePath = ... // path to init file
IniFileTokenizer tokenizer = new IniFileTokenizer(new InputSource(fileNamePath));

Token currentToken;

do
{
    currentToken = tokenizer.ProcessToken();
    /* process token here */
}
while (currentToken.TokenType != IniFileTokenType.EOF);

[Back to top]

5. Conclusion & References

The basic generic tokenizer provides me (and hopefully you) a good starting point to write own tokenizers and parsers.
That's it! Hope it was interesting for you and have fun and learned something! Keep coding!
Sunshine, January 2022


References

* [1] Lexical analysis @ wikipedia
* [2] INI file @ wikipedia



History