visit
In the first part of this tutorial, we've built a simple Linux shell that prints a prompt string, reads input, then echoes the input back to the screen. This isn't very much impressive now, is it?
In this part, we'll update our code to give our shell the ability to parse and execute simple commands. Let's first have a look at what simple commands are.
NOTE: You can download the complete source code for Part II & III of this tutorial from .
A simple command consists of a list of words, separated by whitespace characters (space, tab, newline). The first word is the command name, and is mandatory (otherwise, the shell won't have a command to parse and execute!). The second and subsequent words are optional. If present, they form the arguments we want the shell to pass to the executed command.
For example, the following command:
ls -l
consists of two words: ls
(the command's name), and -l
(the first and sole argument). Similarly, the command: gcc -o shell main.c prompt.c
(which we used in part I to compile our shell) consists of 5 words: a command name, and a list of 4 arguments.Now to be able to execute simple commands, our shell needs to perform the following steps:In order to get the next token, we need to be able to scan our input, one character at a time, so that we can identify the characters that can be part of the token, and those that are delimiter characters. A delimiter character is one that marks the end of a token (and possibly the beginning of another token). Typically, delimiters are whitespace characters (space, tab, newline), but can also include other characters, such as
;
and &
.In general, scanning input means we should be able to:We'll define the functions to perform all of these tasks in a minute. But first, let's have a word about abstracting input.
Remember the
read_cmd()
function, which we defined in part I of this tutorial? That was the function we used to read user's input and return it as an malloc
'd string. We could pass this string directly to our scanner, but that would make the scanning process a bit cumbersome. In particular, it would be very difficult for the scanner to remember the last character it gave us, so that it can move past that character and give us the character that follows.To make the scanner's job easy, we'll abstract our input by passing the input string as part of a
struct source_s
structure, which we'll define in the source file source.h
. Go ahead and create that file in your source directory, then open it in your favorite text editor and add the following code:#ifndef SOURCE_H
#define SOURCE_H
#define EOF (-1)
#define ERRCHAR ( 0)
#define INIT_SRC_POS (-2)
struct source_s
{
char *buffer; /* the input text */
long bufsize; /* size of the input text */
long curpos; /* absolute char position in source */
};
char next_char(struct source_s *src);
void unget_char(struct source_s *src);
char peek_char(struct source_s *src);
void skip_white_spaces(struct source_s *src);
#endif
Focusing on the structure's definition, you can see that our
struct source_s
contains a pointer to the input string, in addition to a two long
fields that hold information about the string's length and our current position in the string (where we'll get the next character from).Now create another file named
source.c
, to which you should add the following code:#include <errno.h>
#include "shell.h"
#include "source.h"
void unget_char(struct source_s *src)
{
if(src->curpos < 0)
{
return;
}
src->curpos--;
}
char next_char(struct source_s *src)
{
if(!src || !src->buffer)
{
errno = ENODATA;
return ERRCHAR;
}
char c1 = 0;
if(src->curpos == INIT_SRC_POS)
{
src->curpos = -1;
}
else
{
c1 = src->buffer[src->curpos];
}
if(++src->curpos >= src->bufsize)
{
src->curpos = src->bufsize;
return EOF;
}
return src->buffer[src->curpos];
}
char peek_char(struct source_s *src)
{
if(!src || !src->buffer)
{
errno = ENODATA;
return ERRCHAR;
}
long pos = src->curpos;
if(pos == INIT_SRC_POS)
{
pos++;
}
pos++;
if(pos >= src->bufsize)
{
return EOF;
}
return src->buffer[pos];
}
void skip_white_spaces(struct source_s *src)
{
char c;
if(!src || !src->buffer)
{
return;
}
while(((c = peek_char(src)) != EOF) && (c == ' ' || c == '\t'))
{
next_char(src);
}
}
The
unget_char()
function returns (or ungets) the last character we've retrieved from input, back to the input source. It does this by simply manipulating the source structure's pointers. You'll see the benefit of this function later in this series.The
next_char()
function returns the next character of input and updates the source pointer, so that the next call to next_char()
returns the following input character. When we reach the last character in input, the function returns the special character EOF
, which we defined as -1 in source.h
above.The
peek_char()
function is similar to next_char()
in that it returns the next character of input. The only difference is that peek_char()
doesn't update the source pointer, so that the next call to next_char()
returns the same input character we've just peeked at. You'll see the benefit of input peeking later in this series.Finally, the
skip_white_spaces()
function skips all whitespace characters. This will help us when we've finished reading a token, and we want to skip delimiter whitespaces before we read the next token.Go ahead and create a file named
scanner.h
in your source directory, then open it and add the following code:#ifndef SCANNER_H
#define SCANNER_H
struct token_s
{
struct source_s *src; /* source of input */
int text_len; /* length of token text */
char *text; /* token text */
};
/* the special EOF token, which indicates the end of input */
extern struct token_s eof_token;
struct token_s *tokenize(struct source_s *src);
void free_token(struct token_s *tok);
#endif
Focusing on the structure definition, our
struct token_s
contains a pointer to the struct source_s
that holds our input. The structure also contains a pointer to the token's text, and a field that tells us the length of that text (so that we don't need to repeatedly call strlen()
on the token's text).Next, we'll write the
tokenize()
function, which will retrieve the next token from input. We'll also write some helper functions that will help us work with input tokens.In the source directory, create a file named
scanner.c
, and enter the following code:#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include "shell.h"
#include "scanner.h"
#include "source.h"
char *tok_buf = NULL;
int tok_bufsize = 0;
int tok_bufindex = -1;
/* special token to indicate end of input */
struct token_s eof_token =
{
.text_len = 0,
};
void add_to_buf(char c)
{
tok_buf[tok_bufindex++] = c;
if(tok_bufindex >= tok_bufsize)
{
char *tmp = realloc(tok_buf, tok_bufsize*2);
if(!tmp)
{
errno = ENOMEM;
return;
}
tok_buf = tmp;
tok_bufsize *= 2;
}
}
struct token_s *create_token(char *str)
{
struct token_s *tok = malloc(sizeof(struct token_s));
if(!tok)
{
return NULL;
}
memset(tok, 0, sizeof(struct token_s));
tok->text_len = strlen(str);
char *nstr = malloc(tok->text_len+1);
if(!nstr)
{
free(tok);
return NULL;
}
strcpy(nstr, str);
tok->text = nstr;
return tok;
}
void free_token(struct token_s *tok)
{
if(tok->text)
{
free(tok->text);
}
free(tok);
}
struct token_s *tokenize(struct source_s *src)
{
int endloop = 0;
if(!src || !src->buffer || !src->bufsize)
{
errno = ENODATA;
return &eof_token;
}
if(!tok_buf)
{
tok_bufsize = 1024;
tok_buf = malloc(tok_bufsize);
if(!tok_buf)
{
errno = ENOMEM;
return &eof_token;
}
}
tok_bufindex = 0;
tok_buf[0] = '\0';
char nc = next_char(src);
if(nc == ERRCHAR || nc == EOF)
{
return &eof_token;
}
do
{
switch(nc)
{
case ' ':
case '\t':
if(tok_bufindex > 0)
{
endloop = 1;
}
break;
case '\n':
if(tok_bufindex > 0)
{
unget_char(src);
}
else
{
add_to_buf(nc);
}
endloop = 1;
break;
default:
add_to_buf(nc);
break;
}
if(endloop)
{
break;
}
} while((nc = next_char(src)) != EOF);
if(tok_bufindex == 0)
{
return &eof_token;
}
if(tok_bufindex >= tok_bufsize)
{
tok_bufindex--;
}
tok_buf[tok_bufindex] = '\0';
struct token_s *tok = create_token(tok_buf);
if(!tok)
{
fprintf(stderr, "error: failed to alloc buffer: %s\n", strerror(errno));
return &eof_token;
}
tok->src = src;
return tok;
}
tok_buf
is a pointer to the buffer in which we'll store the current token.tok_bufsize
is the number of bytes we allocate to the buffer.tok_bufindex
is the current buffer index (i.e. it tells us where to add the next input character in the buffer).eof_token
is a special token we'll use to signal the end of file/input (EOF).The
add_to_buf()
function adds a single character to the token buffer. If the buffer is full, the function takes care of extending it.The
create_token()
function takes a string and converts it to a struct token_s
structure. It takes care of allocating memory for the token's structure and text, and fills in the structure's member fields.The
free_token()
function frees the memory used by a token structure, as well as the memory used to store the token's text.The
tokenize()
function is the heart and soul of our lexical scanner, and the function's code is fairly straight-forward. It starts by allocating memory for our token buffer (if not already done), then initializes our token buffer and source pointers. It then calls next_char()
to retrieve the next input character. When we reach the end of input, tokenize()
returns the special eof_token
, which marks the end of input.The function then loops to read input characters, one at a time. If it encounters a whitespace character, it checks the token buffer to see if it's empty or not. If the buffer is not empty, we delimit the current token and break out of the loop. Otherwise, we skip the whitespace characters and move along to the beginning of the next token.After we've got our token,
tokenize()
calls create_token()
, passing it the token text (which we stored in the buffer). The token text is converted to a token structure, which tokenize()
then returns to the caller.