circusmachina: Writing a Command-Line Parser, Part 1

As with other compilers, causerie will need to accept a variety of command-line arguments from the user, or from make on behalf of the user. This necessity provides with an excellent opportunity to test and fine-tune the base libraries before we tackle the more complex task of parsing a full-fledged computer language.

Why use command-line arguments at all? Unless you have written a very simple program, it is likely that you will want some way to adjust the behavior of your code based on some event or circumstance which you cannot know at the time the program is compiled. You could try to make a list of such circumstances and try to compile an iteration of your code for each one of them, but that entails a high degree of effort which is largely wasteful, since it is simpler to allow the user of your code to indicate what behaviors are desired. Command-line arguments provide a way to do this, as do configuration files, but command-line arguments take up less space and are more portable, since they do not require an additional file besides the executable itself. Since simplicity is one of causerie's goals, we will use command-line arguments instead of a separate configuration file.

Causerie's parsing design is flexible enough that it can be adapted to other uses, such as parsing command-line arguments; thus, while there are libraries already available for this purpose, such as the getops unit that comes as part of Free Pascal's runtime library, utilizing the libraries we already have for a simple case like parsing command-line arguments will allow us to find bugs and ensure that the parser can handle a more complex task.

Parser Basics

While this series of articles is devoted to explaining the use of causerie's parsing routines, instead of to an explanation of how to create a parser in general, the examination of the design decisions and behaviors of causerie's parsing routines will provide a glimpse of how to write a certain kind of parser. And all parsers share some things in common: they are all, for instance, designed to translate human-readable text into something a computer can understand and upon which it can act. To do this, they process text in many of the same ways that we do. We'll cover a few basic definitions and concepts to ensure that we're all on the same page.

Tokens and Opcodes

When we read a sentence such as this one:

In case of fire or flames, use the stairs -- not the elevator.

we typically don't notice the space between words, taking it for granted -- but the space is important since it lets us know where one word ends and the next begins. Likewise, the period (.) at the end of the sentence makes it clear that we have, in fact, reached the end of the sentence; while the capitalization of "Grumpy" lets us know we are at the beginning of the sentence. We usually don't think much of any of these things -- unless they're absent -- but on an unconscious level, our minds must do the same thing a parser does: separate a series of words, whitespace, and punctuation so that the important ones can be determined and the meaning of the sentence understood.

In the context of a parser, the "words", the "whitespace", and the "punctuation" are all tokens: they all represent one or more values that, if given in a particular order or placed in a particular configuration, might have meaning. But tokens are more than just words or punctuation. For example, if we wanted to assign codes to certain English words, for example:

In case of fire or flames, use the stairs -- not the elevator.
384 93  28   8  19   215    9   42   52   4  11  42    803

then we would have a more concise way of representing the same sentence:

384,93,28,8,19,215,9,42,52,4,11,42,803

Anyone with access to the codes would be able to read the sentence and extract its meaning. While humans might have an easier time with (and a greater appreciation of) the full-text version of the sentence, computers process numbers far more quickly than text. A token represents a way of translating a human-readable piece of program -- a keyword, operator, identifier, delimiter, etc. -- into a numeric code that your program can then use. Thus, a token has two parts: a text representation and an opcode. The opcode represents a numeric value that is used by the program or computer in place of the text representation, and so greatly improves the overall performance of the parser. Sometimes, opcodes directly represent a machine instruction (such as when an assembler converts its tokens into the values recognized by a specific processor) and sometimes they represent an intermediate form of machine instruction (such as when Python's parser converts the text of a Python source file into bytecode).

Token types

Our example sentence is made up of words, of course; but these words can be classified according to their usage. English grammar assigns categories to words, such as noun, verb, preposition, and so forth. The category to which a word belongs determines where it may be placed in a sentence relative to other parts of that sentence.

What is true for grammar also applies to the syntax of a computer language: tokens are classified based on the idea they represent, and this classification determines where they may be placed in relation to other tokens. Most parsers recognize whitespace; many also recognize the end of a line of text. Parsers for computer languages such as causerie usually also recognize:

Numbers;
Special characters that are not letters or numbers; and
Identifiers, which may identify a symbol (not a symbolic character; see below) or a keyword. Keywords are words that are built into the language itself and usually identify the type of instruction you are giving to the computer, such as while (to start a loop) and if to begin the test of one value against another.

Scanners

The parser's job is to act on a given token, but these tokens must come from somewhere. Typically, these tokens are provided by a scanner (sometimes known as a tokenizer), which is responsible for proceeding in a linear fashion through the source, from start to finish, and returning each token that it finds. As each token is handled by the parser, it inquires of the scanner for the next one until the scanner indicates that there are no more tokens left to process.

Syntax

It is not enough to have tokens, however, just as it is not enough to have words. In order to have meaning, words must be placed in a specific order:

In the case of stairs, use flames or fire -- not the elevator.

The above sentence is nonsense: it has no logical meaning even though the first and last words, and the punctuation, are in the appropriate locations. To convey meaning, the words of our sentence must be assembled in a specific order. For human languages, this is called a "grammar"; for computers, it is known as a syntax: the rules that specify how, when, and in what order tokens must be given in order to mean something. Every language devised for computers has its own rules about what tokens, and in what order, have meaning. Because programs represent a series of instructions to the computer, it is the parser's job to enforce the syntax so that, when the time comes to act on those instructions, madness (such as using flames or fire in the event of stairs) does not ensue.

Statements

The statement:

In case of fire or flames, use the stairs -- not the elevator.

represents a complete sentence: it conveys a single, logical instruction about what a person should do if confronted with fire and the choice between taking stairs or an elevator in order to escape the fire. Likewise, when tokens are organized into a logical and complete instruction to the computer, this organization is known as a statement': a single instruction on which the computer can act.

Expressions

The example sentence we've been using, while conveying a course of action, also expresses three distinct ideas, however:

Is there a fire?
1. If so, use the stairs.
2. Do not use the elevator.

The first idea actually a requires you to make an evaluation, since it is assumed that you will not always be surrounded by fire or flames. The second idea will be applied based on the outcome of your evaluation: if, in fact, flames surround you at the present time, then you must act by taking the stairs -- not the elevator. Even as tokens can have values assigned to them, the groups of tokens that make up a statement can have their own, independent values. In case of fire, which is part of a larger statement, will either be true or false, and we may take different actions depending upon the value that we assign to it.

It is possible to construct a sequence of tokens that serve the same purpose: they require the computer to make an evaluation and then apply the resulting value to the remaining part of the statement. A statement is made up of parts which may have values associated with them, either directly (a value is explicitly assigned) or indirectly (a value is calculated, as it is when we determine whether or not there is currently a fire) -- these parts are known as expressions.

Symbols

In case of fire or flames, use the stairs -- not the elevator.

Is there a fire now? Did we check before? Should we keep checking, or should we act on the outcome of our previous check -- if, in fact, we did check previously? Without being able to remember whether or not we checked for a fire in the previous instant, we could get stuck in a cycle of indecision even as the flames rise all around us. Clearly, we must have some way of remembering the outcome of our evaluation.

Computer programs typically have access to portions of computer memory, such as the stack or the heap, where values can be stored for greater or lesser durations. However, if the outcome of our evaluation only occupies four to eight bytes, and we have billions of real (or imagined) memory locations in which to search, then finding this value again when we need it could prove next to impossible -- unless, of course, we have a way of marking this location somehow.

On the programming side of things, the marks or labels given to a memory location are known as variables or constants; on the parser's side, they are known as symbols. The difference is that, on the programming side, the variable or constant represents what is a position in memory, whereas a symbol represents what will be a position in memory. Variables and constants are simply aliases for a memory location; symbols usually store extra information that is of use to the parser, such as the type of data that is expected at that location and an initial value to place there when the program starts.

Symbol Tables

Symbols typically must be unique, in order to avoid confusing one memory location with another; to make certain of this, they are usually gathered into a structure that is quick and easy for the parser to search -- a symbol table. Depending on the parser implementation, there may be more than one symbol table for a given program; in which case, a symbol usually only needs to be unique to the table in which it resides.

Linear C: Parser Design from the Bottom Up

The above elements must all be in place before a parser can do its work. Fortunately, the parsing library does much of the heavy lifting: it defines classes and types to represent opcodes, tokens, language syntax and even entire language definitions; it also defines scanners and various types of parsers. We need only to customize the classes provided to handle our specific use case.

Tokens are fairly straightforward, since the types of tokens processed by computer language parsers are fairly uniform in type; however, we will need to define opcodes for the keywords, operators, and special symbols that we want our parser to recognize. To do this, we need a syntax for the parser to enforce, in order to ensure that we get coherent arguments on the command line. In tandem with this, we will define the types of expressions and statements which the parser can handle. With the syntax and the opcodes in place, we can finally write the parser itself.

Thus, the order in which we construct the parser for our command-line "language" -- known as "Linear C" -- will be as follows:

Define the language:
1. Define the syntax of the language:
  - The types of options that we want the parser to handle.
  - The rules that govern how and where tokens should be placed.
  - The token strings that will represent the keywords, operators, and special symbols that we want the parser to recognize.
  - The types of expressions that we want the parser to handle.
  - The types of statements that we want the parser to handle.
Implement the language:
1. Construct and implement the classes that represent the option types we want the parser to handle, and a class to manage these options.
2. Construct and implement the classes that represent the expression types handled by the parser.
3. Construct and implement the classes that represent the statement types handled by the parser.
4. Construct and implement the parser itself.

Tackling the smaller components and working our way up actually simplifies our parser design considerably: it takes less than 200 lines of code to implement the Linear C parser from start to finish (and many of those lines are comments for the purposes of documentation), while only 20 lines of code are needed for the main parsing loop itself. We'll handle the first component -- the types of options we want to handle -- next.

Next: Writing a Command-Line Parser, Part 2

Writing a Command-Line Parser, Part 1

by michael, on January 10, 2015, at 04:58 PM | print