circusmachina: Writing a Command-Line Parser, Part 2

Previously: Writing a Command-Line Parser, Part 1

In the previous article, we defined the basic elements that are required in order to produce a parser: tokens, opcodes, the general language syntax, statements, expressions, and symbols. Tokens and opcodes are fairly similar across parser implementations, typically differing only in the types of characters allowed to be part of a token and in the data type used to represent the corresponding opcode. The next item that must be addressed, then, is the syntax of LinearC, our command-line argument "language". The syntax will inform the opcodes used and the types of statement and expression handlers we must define. It will also help the tokenizer to correctly classify the tokens read from the command line.

Why and How

The syntax of a language defines the way in which the language expects to have its tokens strung together. The way in which the language will be used informs some of that syntax, but the systems on which the language will run, or for which it will generate programs, also play a role.

For command-line arguments, the syntax is based on the way the operating system and the runtime library represent arguments to a program. When terminal-based operating systems were widely used, the operating system required that any additional parameters meant to be passed to the program were specified on the same command line as the one which named the program to run. This tradition continues even into the present, when graphical operating systems have largely replaced (or covered) their terminal-based counterparts. One limit therefore placed on command-line parameters is that they cannot be broken into multiple lines by the user entering them -- they cannot be separated by line breaks. As a result, one or more spaces are typically used to separate one argument from the next, and most runtime libraries use this fact to determine how the arguments are presented to a running program. The argv vector passed to the main() function of a C program, and the ParamStr() function available to Pascal programs, both exemplify the way in which a runtime library typically presents command-line arguments to a program: as a series of strings.

Simple strings may be enough for simple programs, or for cases where the program only requires a single argument like a path or file name. It is easy enough to write a loop that checks each argument that is returned by ParamStr() or that is contained within argv against a hard-coded listed of accepted values. However, when more complex behavior is desired, simple strings are no longer enough. For instance, a program that can accept multiple arguments needs some way to keep them organized, particularly for the user who must enter them. One could simply demand that each argument be entered in a specific order, but what if one option only needs to be specified if another is first specified? Naming the options resolves these headaches, but then there must be a way to delimit the names from values -- and not all options are named. How, then, does one separate a name from a value and a named option from one that has no name when all one has to work with is a series of strings that can only be separated by whitespace?

As with any good programming question, this one has several answers. Option names can be given a special character prefix that unnamed options will not have; option names can be separated from their values by another special character; certain options can be abbreviated for easier entry; and so forth. Over time, certain standards have grown up around these answers; the two linked here are the two used to define the syntax for LinearC.

Defining the Syntax

Token Types

To parse text into various tokens, you must first determine the text that will fit each of the various token types. The parsing library used by LinearC breaks tokens into several types: tokens that represent identifiers, tokens that represent constant numeric values and string literals, and tokens that represent special characters such as delimiters and operators. These were touched on briefly in the previous article, but now we have to decide what they mean with regard to LinearC. Numeric constants and string literals are relatively straightforward, but what tokens should be considered keywords, operators, or delimiters? What character combinations, out of all combinations possible, can represent each token type?

Delimiters

The two standards referenced earlier define two kinds of options:

Short options, which are preceded by a single hyphen (-), and
Long options, which are prefixed with two hyphens (--).

Both usages of hyphenation have a special case:

When a single hyphen is encountered which is surrounded by whitespace, it indicates that input or output should occur using the standard input or output streams provided by the runtime or operating system (typically represented as stdin or stdout).
When a double-hyphen is encountered which is surrounded by whitespace, it indicates that any subsequent arguments be treated as simple strings, even if they begin with hyphens.

Thus, the hyphen and the double-hyphen become our first two special token types, or tokens that have meaning to the parser. Strictly speaking, these are delimiters: they indicate to the parser that special processing needs to occur. In the case of LinearC, coming across either of these means the parser can typically expect the name of an option to follow.

To ensure that the parser knows about these two delimiters, we define them as constant strings and marry them with their opcodes in the language specification that we will cover in more detail at the end of the article.

Identifiers

The names of the options themselves are identifiers: they identify a symbol which has a value and to which a value can potentially be assigned. Remember from the previous article that a symbol is simply a way of finding a value in memory.

The standards define two kinds of options -- long and short -- so there are at least two kinds of identifiers. As with other languages, however, LinearC also uses identifiers to represent pre-defined keywords; in this case, keywords that are used to represent boolean values. These allow the user to specify yes or no, on or off in addition to the typical true and false. This gives the user some flexibility when constructing command-line arguments, as they can use the value that makes the most sense; for instance, for an option named doOverwriteHardDrive they can specify an emphatic no instead of the less-forceful false.

The characters that are allowed to make up an identifier are limited by the underlying shell or operating system, which reserves certain characters to itself. For instance bash, as a C-like language, tends to assign special meaning to symbolic characters such as the pipe (|) and angled brackets (<>). When such characters are encountered on the command line, they are acted upon by the underlying system instead of being passed directly to the program; as a result, these characters cannot usually be used to form identifiers. Traditionally, therefore, an identifier is limited to letters and numbers, with certain other symbolic characters allowed if they are not reserved or prohibited by the underlying operating system. These constraints are typical of most compiled and parsed languages, too.

As a result, we will require identifiers to begin with a letter, either upper- or lower-case, and then be followed by one or more letters or numbers. Certain other characters, such as hyphens and underscores (_) are also allowed to be a part of the identifier. Most of these characters are already defined by the parsing library, but we add a few additional characters are part of a string constant which will be assigned to the language specification that we pass to the parser. The parser uses this specification to determine whether an identifier is valid.

In addition to specifying the characters that are allowed to be parts of an identifier, we must define the keywords that have special meaning to the parser. These are paired with opcodes by our language specification, so that the parser can efficiently determine whether a given token represents a keyword simply by checking the opcode. LinearC is a fairly simple language in this regard; the only keywords we need to reserve for it are the boolean value representations we just discussed: true, false, yes, no, on, and off.

Numeric constants

The boolean representations, in addition to being keywords and identifiers, also represent a kind of constant value. yes will always mean true, no false, and so on. The parser must accept other kinds of values, however: typically numbers, strings, and file paths are all accepted by programs. Together these form a category of tokens known as constant or literal values: values that do not require computation.

The characters allowed to make up a numeric constant are easy to define: any valid digit, right? But not all numeric values are base-ten integers. To support floating-point notation, we need to allow for a decimal point (.); to support scientific notation for very large or very small values, we need to allow e and E to be used; and to support hexadecimal notation, we must allow x or X as well as ABCDEF and abcdef to be used. Once again, most of these characters are already defined by the parsing library, but we will add two more: we will allow a decimal point to begin a number (for those floating-point numbers that omit the leading zero) and we will allow a dollar sign to be used at the beginning of a number to support Pascal's hexadecimal notation.

String literals

String literals are a special case. The parsing library does not implement them because various languages differ in the ways they allow strings to be defined (see, for instance, Python's string reference). The library provides a token type to represent a string literal, but it is up to the parser to determine when to use it.

String literals are typically delimited in some way, usually with a single (') or double quote ("). However, as with identifier names, one or both of these characters may have special meaning to the underlying system. Our safest bet is to allow either character to delimit a string and further, so long as the string consists of no spaces, to allow unquoted strings as well. But unquoted strings have a couple of special considerations:

If they name a known identifier (a known option name), then they should be treated as identifiers.
If they begin with a path separator (/ or \ depending on the underlying system) then they should be treated as such.

Within quoted strings, any character allowed by the underlying system can be used, except for the quote delimiters themselves. To work around the limitations of the underlying system, we will allow C-style escape sequences (such as \t) within quoted strings.

Most languages do not allow string literals to stand on their own; they must be assigned to something. LinearC is no different. Consequently, we can treat the existence of a string literal as a special type of simple expression, and this is exactly what we do: we define an expression type for each of the three possible usages. We will cover these expression types in greater detail in another article; for now, just know that they are mini-parsers with the sole function of gathering the string, identifier, or path into a single value that can be assigned to an option.

Operators

The second command-line option standard referenced above shows how a value is separated from an option name through the use of an equals character (=). This makes the equals sign a valid delimiter, like the hyphen, but it is also an operator because it causes a change in value -- the value of the named option. When the parser encounters an equals sign immediately following an option name, it knows to expect a value that will be subsequently assigned to the option.

Where operators are concerned, however, we need not stop at assignment. Although for most cases it is enough to simply assign a value to an option, there are times when that value needs to be calculated first. Thus the LinearC parser allows mathematical operators to be used, so that the values assigned to an option can be calculated, if desired. It is even possible to reference the value of another option as part of the calculation. Not only is this behavior relatively simple to implement, but it provides a template that can be used when writing causerie's expression-handling code.

LinearC thus supports the following operators:

Assignment: =
Basic math: + - * /
Extended math: ^ (exponent), mod (modulus division)
Logical operations: and (logical and), not (logical not), or (logical or)
Bitwise operations: shl (shift left), shr (shift right)
Equality testing: eq (equal), neq (not equal), gt (greater than), lt (less than), geq (greater than or equal to), leq (less than or equal to)

I mentioned above that the target operating system plays a part in defining the syntax. In this case, certain characters have special meaning to underlying operating system -- characters such as angled brackets (<>), which typically direct the operating system to read or write to a file; and the pipe symbol (|) which typically directs the system to pipe output through a second program. As a result, some of our operators (such as neq, or, gt, geq, lt, leq) have had to take on less-symbolic forms so that the operating system will pass them to the parser instead of attempting to act on them. To allow these operators to work without ambiguity, the parser will require that expressions which use them be enclosed in parentheses (()). Parentheses therefore become another delimiter.

To make things interesting, we will allow some mathematical operators to apply to strings:

The + operator will concatenate two string values, as might be expected.
The - operator will remove all occurrences of the string on the right hand side of the operator from the string on the left hand side of the operator. For instance, the expression "elephant" - "ant" will evaluate to "eleph".
The * operator will repeat the string on the left hand side of the operator by the number of times specified on the right hand side of the operator. For example, the expression "Ho!" * 3 will evaluate to "Ho!Ho!Ho!". However, to prevent buffer overruns, we will cap the value allowed on the right hand side to 255.

The parsing library does not provide a predefined set of special characters, since these differ for each language; as a result, we must define our own as a string that will be assigned to the language specification we provide to the parser.

Putting it all together

We have now sketched out the syntax of our language. We have defined the token types we want to use:

identifier/keyword
special character
numeric constant
string literal

and the characters that are allowed for each token:

for identifiers: A to Z, a to z, hyphens (-) and underscores _. For identifiers that represent paths, we also allow path separators (/ and \).
for special characters: the multiplication operators (+-*/^), the equals sign (=) for assignments, path separators (/ and \), and parentheses (( and )), the long (--) and short (-) option prefixes.
for numeric constants: 0 to 9, the decimal point (.), E and e for scientific notation, x $, A to F and a to f for hexadecimal notation.
for string literals: a single (') or double quote (") as delimiter, and any valid character allowed by the underlying system when the string is quoted. We also allow C-style escape sequences within quoted strings.

We have also defined our reserved words: true, false, yes, no, on, and off.

These definitions will all become part of our language specification.

The language specification

Now that we have sketched out the syntax of our language, we can define it for our parser to use. The parsing library provides a singleton class that represents a language specification. It is a singleton because only one instance of it ever needs to exist throughout the lifetime of the program; this instance is passed around as needed. To define our own language, we simply build a class which derives from this singleton class and override three methods:

defineCharacterCategories(): this is where we describe the characters that are valid for each character category recognized by the parser. These character categories are used to determine what kind of token is built as the source stream is scanned.
defineOpcodes(): this is where the token strings that represent keywords, operators, and delimiters known to the parser are described and associated with their corresponding opcodes. Remember that an opcode is an internal representation of a token which is more efficient to process than an entire token string. The token strings and opcodes are bound together in a mapping which is consulted by the parser to determine if a given token is a known keyword, operator, or delimiter.
defineRules(): we have not yet touched on this one except in a general sort of way. A rule defines a set of opcodes that can be used by a parser to determine whether a given token is allowed in the current context. It is a way of enforcing syntax. We will discuss the use of rules more in a later article.

You can see how all of this fits together in LinearC by consulting the sources. In particular, pay attention to linearctokenstrings.inc, which is where we define the various token strings that are known to the language; and linearc.inc where we define the language itself. Note that to make code maintenance easy, we gather all known token strings and opcodes into two separate constant arrays that are then passed to the opcode mapping in a single call. This makes the implementation of defineOpcodes() fairly simple but, of course, it is not the only way that such a thing can be done. The parsing library is designed with flexibility in mind.

Next up, we'll discuss how to represent the options that a user can set!

Writing a Command-Line Parser, Part 2

by michael, on June 23, 2015, at 08:43 PM | print