2,890 reads

Building a Linux Shell [Part V]

by Mohammed IsamAugust 6th, 2020

Too Long; Didn't Read

This is part V of a tutorial on how to build a Linux shell. You can download the complete source code for Part V from this GitHub repository. In this part, we'll implement the 7 word expansions defined by POSIX. The shell performs tilde expansion, parameter expansion, arithmetic expansion, command substitution, field splitting, pathname expansion, and quote removal. After finishing this lesson, it would be a good exercise if you extended the shell by implementing the non-POSIX word expansions.

Companies Mentioned

featured image - Building a Linux Shell [Part V]

This is part V of a tutorial on how to build a Linux shell. You can read the previous parts of this tutorial from the following links: part I, part II, part III, part IV.

NOTE: You can download the complete source code for Part V from .

Introduction to Part V

As we've seen in the previous parts, a simple command consists of one or more arguments, also known as words. The first word contains the name of the command we want to execute, while the rest of the words contain the command's arguments.

Before the shell executes a command, it performs word expansion on the command words. is the process by which the shell takes a command word, checks it to see if it contains variable names, pathnames, commands, and arithmetic expressions, and substitutes each name/command/expression with its value. The resultant word, which is usually (but not always) longer than the original word, might be broken down into one or more subwords (or fields), in a process known as field splitting.

In this part, we'll implement the 7 word expansions defined by POSIX, which are: , , , , , , and . There are other word expansions, such as and , which are not defined by POSIX, and which we won't be discussing here. After finishing this lesson, it would be a good exercise if you extended the shell by implementing the non-POSIX word expansions.

Note About this Lesson's Code

The word expansion process is very complex, and we'll have to implement a host of functions in order to perform word expansion in our shell. As such, we won't be able to discuss each and every function in detail. Instead, we'll have a general overview of each word expansion and the functions we need to process that expansion. At the end of each section, you'll find a link to our , where you'll be able to read the whole function's code at your own pace.

After we finish discussing the individual word expansion functions, we'll write the main function that will tie everything together, then we'll update our existing shell's code to make our shell use the word expansion functions we've just implemented. To make the code easier to read, I've defined all our word expansion functions in the source file

wordexp.c

, which you'll find in (the pattern matching functions are defined in

pattern.c

, which you can read in ).

Now let's get started.

The Word Expansion Process

When the shell performs word expansion, it checks every word in the command line to see if it contains possible word expansions. Expansions can appear anywhere in the word: at the beginning, in the middle, or at the end of the word. An expansion might also include the whole word.

Word expansions are preceded by a

sign. The characters following the

sign indicate the type of expansion to be performed by the shell. These characters are interpreted by the shell as follows:

One or more digits, which indicate variable expansion of a (we'll discuss these in a later lesson in this tutorial).
One of
```
@
```
,
```
*
```
,
```
#
```
,
```
?
```
,
```
-
```
,
```
$
```
,
```
!
```
, or
```
0
```
, which indicate variable expansion of a (we'll discuss these in a later lesson in this tutorial).
One or more alphanumeric characters and/or underscores, starting with a letter or underscore, which indicates a shell variable name.
A variable name surrounded by curly brackets
```
{
```
and
```
}
```
.
An arithmetic expansion, surrounded by
```
(
```
and
```
)
```
.
A command substitution, surrounded by
```
((
```
and
```
))
```
.

The shell performs tilde expansion, parameter expansion, command substitution, and arithmetic expansion first, which is then followed by field splitting and pathname expansion. Lastly, the shell removes any quote characters from the expanded word(s) that have been part of the original word.

Working with Words

When the shell performs word expansion, the process can result in zero, one, or more words. We'll use a special structure to represent those words, which we'll define in our header file :

struct word_s
{
    char  *data;
    int    len;
    struct word_s *next;
};

The structure contains the following fields:

```
data
```
=> the string representing this word.
```
len
```
=> the length of the
```
data
```
field.
```
next
```
=> pointer to the next word, or
```
NULL
```
if this is the last word (we'll use a linked list to represent our expanded words).

Of course, we'll need some functions to allocate and free our

struct word_s

structures. For this, we'll use the following functions:

struct word_s *make_word(char *str);
void free_all_words(struct word_s *first);

The first function allocates memory for the structure, creates a copy of the word string, and returns the newly allocated word. The second function frees the memory used by a list of word structures. You can read the functions' code in our

wordexp.c

Defining Some Helper Functions

As we've said earlier, word expansion is a complex process, for which we need to define a host of different functions. Before we dive into the details of word expansion, let's first define some helper functions.

The following list shows the function prototypes for our helper functions, all of which we'll define in the

wordexp.c

char *wordlist_to_str(struct word_s *word);
void delete_char_at(char *str, size_t index);
int is_name(char *str);
size_t find_closing_quote(char *data);
size_t find_closing_brace(char *data);
char *substitute_str(char *s1, char *s2, size_t start, size_t end);
int substitute_word(char **pstart, char **p, size_t len, char *(func)(char *), int add_quotes);

Here's a breakdown of what these functions do:

```
wordlist_to_str()
```
=> converts a linked list of expanded words to a single string.
```
delete_char_at()
```
=> removes the character at the given index from a string.
```
is_name()
```
=> checks if a string represents a valid variable name (refer back to The Word Expansion Process section above).
```
find_closing_quote()
```
=> when a word expansion contains an opening quote character (
```
"
```
,
```
'
```
, or
```
`
```
), we need to find the matching closing quote character that encloses the quoted string (more on quoting below). This function returns the zero-based index of the closing quote character in the word.
```
find_closing_brace()
```
=> similar to the above, except that it finds the matching closing brace. That is, if the opening brace is
```
{
```
,
```
(
```
, or
```
[
```
, this function returns the zero-based index of the matching
```
}
```
,
```
)
```
, or
```
]
```
character, respectively. Finding pairs of quotes is important for processing parameter expansion, arithmetic expansion, and command substitution.
```
substitute_str()
```
=> substitutes the substring of
```
s1
```
, from the character at position
```
start
```
to that at position
```
end
```
(both positions are zero-based), with the
```
s2
```
string. This is useful when the word expansion is part of a longer word, such as
```
${PATH}/ls
```
, in which case we need only expand
```
${PATH}
```
, then attach
```
/ls
```
to the end of the expanded string.
```
substitute_word()
```
=> a helper function that calls the other word expansion functions, which we'll define in the following sections.

Additionally, we'll define some functions to help us work with strings. We'll define all of these functions in the

strings.c

char   *strchr_any(char *string, char *chars);
char   *quote_val(char *val, int add_quotes);
int     check_buffer_bounds(int *count, int *len, char ***buf);
void    free_buffer(int len, char **buf);

Here's what these functions do:

```
strchr_any()
```
=> similar to , except that it searches the given string for any of the given characters.
```
quote_val()
```
=> does the reverse of quote removal, that is, converts a string to a quoted string (this might seem trivial at first, but quoting might become complicated when we have to escape quotes and backslash characters, for example).
The
```
check_buffer_bounds()
```
and
```
free_buffer()
```
functions will allow our backend executor to support variable number of command arguments, instead of the hard limit we've set in Part II, which was 255.

Now let's write the functions to handle each type of word expansion.

Tilde Expansion

During , the shell substitutes a tilde character (followed by an optional username) with the pathname of the user's home directory. For example,

and

~/

are tilde-expanded to the current user's home directory, while

~john

is tilde-expanded to user John's home directory, and so on. The tilde character, in addition to all the following characters up to the first unquoted forward slash, are known as the tilde prefix (we'll deal with quoting later on in this lession).

To perform tilde expansion, we'll define the

tilde_expand()

function, which has the following prototype:

char *tilde_expand(char *s);

The function accepts a single argument: the tilde prefix we want to expand. If the expansion succeeds, the function returns an 'd string representing the tilde-expanded prefix. Otherwise, it returns

NULL

. Here's a quick breakdown of what the function does in order to expand a tilde prefix:

If the prefix is
```
~
```
, get the value of the
```
$HOME
```
shell variable. If
```
$HOME
```
is defined and is not
```
NULL
```
, return its value. Otherwise, get the current User ID (UID) by calling , and pass the UID to to get the corresponding to the current user. The
```
pw_dir
```
field of the password database entry contains the pathname of the home directory, which the function returns.
If the prefix contains other characters (in addition to the leading
```
~
```
), we take these letters as the name of the user whose home directory we want to get. We call , passing it the username, and return the value of the
```
pw_dir
```
field.
If we can't retrieve the home directory (for example, if the username is invalid), we return
```
NULL
```
. Otherwise, we return an 'd copy of the home directory path.

Take a minute to read the

tilde_expand()

function's code in our .

Parameter Expansion

In , the shell replaces the name of a shell variable with the variable's value (hence the other name, variable expansion). Parameter Expansion is what allows the shell to execute a command such as

echo $PATH

. In this example, the shell performs parameter expansion on the

$PATH

variable, replacing it with the actual executable path (something like

/bin:/sbin:/usr/bin:/usr/sbin

). What the

echo

command sees is not the

$PATH

word, but its expanded value (which can be

NULL

, of course).

To signal to the shell that we want to expand a shell variable, we precede the variable name with a

sign. That is, to expand the

PATH

USER

, and

SHELL

variables, we need to pass the

$PATH

$USER

, and

$SHELL

words to the shell, respectively (alternatively, we can pass these variable expansions to shell shell as

${PATH}

${USER}

, and

${SHELL}

). Shell variable names can contain any combination of letters, digits, and underscores. Names can contain capital or small letters although, by convention, capitalized names are reserved for standard shell variables, such as those .

We can control how the shell performs parameter expansion by using a parameter expansion modifier, which tells the shell which part of the value we want expanded, as well as what to do if there is no shell variable with the given name. The table below summarizes the parameter expansion modifiers (the ones are marked by the POSIX word in the Description column). Most shells support other modifiers (found in the bottom half of the table), which we won't discuss here. Refer to your shell's manual page for more information on the non-POSIX modifiers.

To perform parameter (or variable) expansion, we'll define the

var_expand()

function, which has the following prototype:

char *var_expand(char *orig_var_name);

The function accepts a single argument: the parameter (i.e. the variable name) we want to expand. If the expansion succeeds, the function returns an 'd string containing the expanded value. Otherwise, it returns

NULL

. Here's a quick breakdown of what the function does in order to expand a variable name to get its value:

If the variable name is surrounded by curly braces (for example,
```
${PATH}
```
), remove the braces, as they are not part of the variable name itself.
If the name begins with
```
#
```
, we need to get the variable name's length (the
```
${#parameter}
```
expansion, which is seen in the 5th row in the table above).
If the variable name contains a colon, we'll use it to separate the name from the (optional) word or pattern. The word or pattern is used as indicated in the table above.
Get the symbol table entry with the given variable name (we implemented the symbol table stack in Part IV). Get the symbol table entry's value.
If the value is empty or null, use the alternative word supplied in the expansion, if any.
If the value isn't empty, use the value as the expanded result. To enable the shell to perform pattern matching (the
```
${parameter#word}
```
and
```
${parameter%word}
```
expansions), we'll need two helper functions:
```
match_suffix()
```
and match_prefix()
. We won't discuss these functions here, but you can read their code from .
If the expansion modifier is
```
${parameter:=word}
```
, we need to set the value of the symbol table entry to the value we've just expanded.
If the expansion starts with
```
#
```
, get the length of the expanded value and use it as the final result.
Return an 'd copy of the expanded value, or its length (in string format), as appropriate.

Take a minute to read the

var_expand()

function's code in our .

Command Substitution

In , the shell forks a process to run a command, then replaces the command substitution expansion with the command's output. For example, in the following loop:

for i in $(ls); do echo $i; done

the shell forks a process, in which the

ls

command is run. The output of this command is the list of files in the current directory. The shell takes that output, splits it into a list of words, then feeds those words, one at a time, to the loop. In each iteration of the loop, variable

$i

is assigned the name of the next file from the list.

This name is passed to the

echo

command, which outputs the name on a separate line (Practically, it would be better to execute

ls

directly, but this example is just to show how command substitution can be used in the shell).

Command substitution can be written as

$(command)

, or

`command`

. To perform command substitution, we'll define the

command_substitute()

function, which has the following prototype:

char *command_substitute(char *orig_cmd);

The function accepts a single argument: the command we want to execute. If the expansion succeeds, the function returns an 'd string representing the command's output. If the expansion fails, or if the command outputs nothing, the function returns

NULL

Here's a quick breakdown of what the function does in order to expand a command substitution:

Depending on the format used, we start by removing the
```
$()
```
or the backquotes
```
``
```
. This leaves us with the command we need to execute.
Call to create a pipe. We pass the command to be executed to
```
popen()
```
, and we get a pointer to a
```
FILE
```
stream from which we'll read the command's output.
Call to read the command's output from the pipe. Store the string read in a buffer.
Remove any trailing newline characters.
Close the pipe and return the buffer with the command output.

Take a minute to read the

command_substitute()

function's code in our .

Arithmetic Expansion

Using , we can let the shell perform different arithmetic operations and use the result in performing other commands. While POSIX requires the shell to only support signed long integer arithmetic, many shells (e.g. and ) support floating-point arithmetic. Also, the shell is not required to support any mathematical functions, although most shells do. For our simple shell, we'll only support signed long integer arithmetic with no math function support.

Arithmetic expansion is written as

$((expression))

To perform the expansion, we'll define the

arithm_expand()

function, which has the following prototype:

char *arithm_expand(char *expr);

The

arithm_expand()

function receives a string containing the arithmetic expression, performs the necessary calculations, and returns the result in the form of an 'd string (or

NULL

in case of error, e.g. an invalid arithmetic operation). The function and its associated helper functions are complex and lengthy, but here's the main highlights:

The arithmetic expression is converted to the (RPN), which is easier to parse and calculate. An RPN consists of a series of arithmetic operations, where the operator follows (i.e. comes after) its operands. For example, the RPN of
```
x - y
```
is
```
x y -
```
, and that of
```
3 + 4 × (2 − 1)
```
is
```
3 4 2 1 − × +
```
.
During the conversion, arithmetic operators are pushed on an operator stack, from which we'll pop each operator and perform its operation later on. Similarly, operands are added to their own stack.
Operators are popped off the stack, one at a time, and the operator is checked. One or two operands are popped off the stack, depending on the type of operator. The rules that govern this process are those of the shunting yard algorithm, which you can read about .
The result is converted to a string, which is returned to the caller.

Take a minute to read the

arithm_expand()

function's code in our .

Field Splitting

During , the shell takes the result(s) of parameter expansion, command substitution, and arithmetic expansion, and splits them into one or more parts, which we call fields (hence the name, field splitting). The process depends on the value of the

$IFS

shell variable. IFS is a historical term that stands for Internal (or Input) Field Separators, and it originates from the time when Unix shells didn't have a builtin array type.

As a workaround, early Unix shells had to find another way of representing multi-member arrays. The shell would join the array members in a single string, separated by spaces. When the shell needs to retrieve the array members, it breaks down the string into one or more fields. The

$IFS

variable tells the shell where exactly to break that string.

The shell interprets

$IFS

characters as follows (this description is ):

If the value of
```
$IFS
```
is a space, tab, and newline, or if the variable is not set, any sequence of space, tab, or newline characters at the beginning or end of the input shall be ignored and any sequence of those characters within the input shall delimit a field.
If the value of
```
$IFS
```
is null, no field splitting shall be performed.
Otherwise, the following rules shall be applied in sequence: (a)
```
$IFS
```
white space shall be ignored at the beginning and end of the input. (b) Each occurrence in the input of an
```
$IFS
```
character that is not
```
$IFS
```
white space, along with any adjacent
```
$IFS
```
white space, shall delimit a field, as described previously. (c) Non-zero-length
```
$IFS
```
white space shall delimit a field.

To perform the expansion, we'll define the

field_split()

function, which has the following prototype:

struct word_s *field_split(char *str);

Take a minute to read the

field_split()

function's code in our .

Pathname Expansion

During (also known as filename globbing), the shell matches one or more file names with a given pattern. The pattern can contain normal characters (which are matched to themselves), in addition to the special characters

, and

[]

, which are also known as the globbing characters.

The asterisk

matches any length of characters (including zero characters), the

matches one character, and the brackets introduce a . The result of the expansion is the list of files whose names match the pattern.

To perform the expansion, we'll define the

pathnames_expand()

function, which has the following prototype:

struct word_s *pathnames_expand(struct word_s *words);

This function accepts a single argument: pointer to the first word in the linked list of words we want to pathname-expand. For each word, the function does the following:

Check if the word contains any globbing characters (
```
*
```
,
```
?
```
, and
```
[]
```
), by calling the helper function
```
has_glob_chars()
```
, which we'll define in the source file . If the word contains globbing characters, we treat it as the pattern we need to match; otherwise, we move to the next word (if any).
Get the list of files whose names match the pattern, excluding the special names
```
.
```
and
```
..
```
(which point to the current directory and the parent directory, respectively). We delegate pattern matching to another helper function,
```
get_filename_matches()
```
, which we'll define in the same source file .
Add the matched file names to the final list.
Move on to the next word and loop.
Return the list of the filenames that matched all the given words.

Take a minute to read the

pathnames_expand()

function's code in our .

Quote Removal

The final step in the word expansion process is . Quoting is used to remove the special meaning of certain characters to the shell. The shell treats some characters, such as the backslash and the quotes in a special way. To inhibit this behavior, we need to quote those characters to force the shell to treat them as normal characters (you can read more about quoting in ).We can quote characters in one of three way: using the backslash, single quotes, or double quotes. The backslash character is used to preserve the literal meaning (that is, to escape) the character following the backslash. This is similar to the way we escape characters in the C language.Single quotes preserve the literal meaning of all characters inside the quotes, that is, the shell doesn't attempt word expansion on single-quoted strings.

Double quotes as similar to single quotes, except that the shell recognizes the special meaning of the backquote, backslash, and

sign. That is, the shell can perform word expansion inside double-quoted strings.

To perform quote removal, we'll define the

remove_quotes()

function, which has the following prototype:

void remove_quotes(struct word_s *wordlist);

Take a minute to read the

remove_quotes()

function's code in our .

Putting It All Together

Now that we have our word expansion functions, its time to tie it all together. In this section, we'll write our main word expansion function, the one we'll call to perform word expansion. This function, in turn, will call the other functions to perform the various steps of word expansion.

Our main function is

word_expand()

, which we'll define in the source file :

struct word_s *word_expand(char *orig_word);

Here's what the function does in order to perform word expansion on the word we pass as its sole argument:

Create a copy of the original word. We'll perform our word expansions on this copy, so that if anything goes wrong, we'll have our original word intact.
Scan the word, character by character, looking for the special characters
```
~
```
,
```
"
```
,
```
'
```
,
```
`
```
,
```
=
```
,
```
\
```
, and
```
$
```
.
If one of the above characters is found, call substitute_word()
, which in turn will call the appropriate word expansion function.
Skip any characters that has no special meaning.
After finishing with the word expansion, perform field splitting by calling field_split()
.
Perform pathname expansion by calling pathnames_expand()
.
Perform quote removal by calling remove_quotes()
.
Returns the list of expanded words.

Take a minute to read the

word_expand()

function's code in our .

Updating the Scanner

In Part II of this tutorial, we wrote our

tokenize()

function, which we used to get input tokens. So far, our tokenize()
function doesn't know how to handle quoted strings and escaped characters. To add this functionality, we need to update our code. Open the scanner.c
file and add the following code to the tokenize()
function, right after the

switch

statement's opening brace:

            case  '"':
            case '\'':
            case  '`':
                add_to_buf(nc);
                i = find_closing_quote(src->buffer+src->curpos);
                if(!i)
                {
                    src->curpos = src->bufsize;
                    fprintf(stderr, "error: missing closing quote '%c'\n", nc);
                    return &eof_token;
                }
                while(i--)
                {
                    add_to_buf(next_char(src));
                }
                break;

            case '\\':
                nc2 = next_char(src);
                if(nc2 == '\n')
                {
                    break;
                }

                add_to_buf(nc);

                if(nc2 > 0)
                {
                    add_to_buf(nc2);
                }
                break;
                
            case '$':
                add_to_buf(nc);
                nc = peek_char(src);

                if(nc == '{' || nc == '(')
                {
                    i = find_closing_brace(src->buffer+src->curpos+1);
                    if(!i)
                    {
                        src->curpos = src->bufsize;
                        fprintf(stderr, "error: missing closing brace '%c'\n", nc);
                        return &eof_token;
                    }

                    while(i--)
                    {
                        add_to_buf(next_char(src));
                    }
                }
                else if(isalnum(nc) || nc == '*' || nc == '@' || nc == '#' ||
                                       nc == '!' || nc == '?' || nc == '$')
                {
                    add_to_buf(next_char(src));
                }
                break;

Now our lexical scanner knows how to recognize and skip over quoted strings, escaped characters, and other word expansion constructs. Check out the updated lexical scanner code in .

Updating the Executor

Lastly, we need to update our backend executor so that it can:

perform word expansion on the arguments of a command before executing that command.
support more than 255 arguments per command (we had this limit set in Part II).

Open the

executor.c

file, navigate to the beginning of the do_simple_command()
function and find the following lines (I reduced the loop to

...

to preserve space):

    int argc = 0;
    long max_args = 255;
    char *argv[max_args+1];
    char *str;

    while(child)
    {
        ...
    }
    argv[argc] = NULL;

and replace them with the following code:

    int argc = 0;
    int targc = 0;
    char **argv = NULL;
    char *str;

    while(child)
    {
        str = child->val.str;
        struct word_s *w = word_expand(str);
        
        if(!w)
        {
            child = child->next_sibling;
            continue;
        }

        struct word_s *w2 = w;
        while(w2)
        {
            if(check_buffer_bounds(&argc, &targc, &argv))
            {
                str = malloc(strlen(w2->data)+1);
                if(str)
                {
                    strcpy(str, w2->data);
                    argv[argc++] = str;
                }
            }
            w2 = w2->next;
        }
        
        free_all_words(w);
        
        child = child->next_sibling;
    }

    if(check_buffer_bounds(&argc, &targc, &argv))
    {
        argv[argc] = NULL;
    }

With this code, the executor calls

word_expand()

on each command argument and adds the expanded word(s) to the arguments list, which we'll eventually pass on to the command. The list can grow as much as needed, thanks to our check_buffer_bounds()
function, which allocates memory to the buffer as needed.

All that remains now is to free our arguments list once we've finished executing the command, before we return to the caller. We do this by calling:

free_buffer(argc, argv);

in three different places: after executing a builtin utility, if

fork()

returns error status (which means we can't execute an external command), and after

waitpid()

has returned. Check out the updated executor code in .

Compiling the Shell

Let’s compile our shell. Open your favorite terminal emulator, navigate to your source directory and make sure you have 19 files and 2 subdirectories in there:

Now compile the shell using the following command (you need to download the source project from our ):

make

If everything goes well,

gcc

should not output anything (apart from a few harmless warnings), and there should be an executable file named

shell

in the current directory:

Now invoke the shell by running

./shell

, and try using our word expansion functions and checking the results:

$ echo *
Makefile build builtins executor.c executor.h initsh.c main.c node.c node.h parser.c parser.h pattern.c prompt.c scanner.c scanner.h shell shell.h shunt.c source.c source.h strings.c symtab wordexp.c
$ echo '*'
*
$ echo ~
/home/user
$ echo ~/Downloads
/home/user/Downloads
$ echo ${A=value}
value
$ echo $A
value
$ echo $((2+7))
9

And that's it for today. Our shell can now handle all kinds of word expansions (the ones defined by POSIX, to be specific). Toy around with the shell and see what results you can get from the different types of word expansion. Compare the results with the ones you get from your default shell.

What's Next

We've taken a huge leap in this lesson, with lots of code, most of which we didn't have the time or space to examine in detail. You might want to spend some time reading through the code in our GitHub repository to get yourself familiar with the word expansion process.

In the next part, we’ll talk about positional parameters and special parameters, what they are, why we need them, and how to implement them in our shell.

Stay tuned!

Previously published at

L O A D I N G
. . . comments & more!