Shell scripting | awk & sed

FreeBSD gearing-up

Index


Shell scripting covers almost every essential need to create automated command-line programs. But what about going beyond the standards and extending our arsenal with some external tools? Let's dive a bit inside awk and sed.

Write programs to handle text streams, because that is a universal interface. — Ken Thompson.

What are awk and sed

Both of them are stream-oriented; they read input from text files one line at a time and direct the result to the standard output, which means the input file itself is not changed if it's not specified to do so.

Although their syntax may look cryptic, awk(1) and sed(1) can solve a lot of complex tasks in a single line of code. Combining them with the use of regular expressions we have a Swiss army knife for anyone working with text files. Since we're working inside a *nix system this is perfect for us.

One of the most useful cases with awk(1) and sed(1) is parsing files and generating reports. It's a bit complicated to explain both tools without seeing them in action. To work through this post without searching too much for a file to parse, create a file named pieces-list and populate with some text inside:

Name= "Capacitor" ID= 3456 quant.= 204 Man.= "Bosch"
Name= "Battery" ID= 2760 quant.= 0 Man.= "Phillips"
Name= "Fan-Frame" ID= 7864 quant.= 131 Man.= "Mitsubishi"
Name= "Bluetooth-Emmiter" ID= 19085, quant.= 184 Man.= "Intel"
Name= "WiFi-Card", ID= 2941, quant.= 115, Man.= "Intel"
Name= "Fan" ID= 4512 quant.= 98 Man.= "OEM"

Working with awk

awk(1) is a full fledged programming language and a powerful file parser. It offers a more general computational model for processing a file, allowing us to replace an entire shell script with an awk single liner.

awk(1) programs consist of a series of rules. Rules generally consist of a pattern and a set of actions.

When a file is processed, awk(1) reads the file line by line, then it checks to see if the line matches one or more of the patterns in the file and executes the actions associated to the matching pattern, taking the selected line as it's input.

If you've been reading the previous workshops, you'll notice that we've used awk(1) previously to configure our panel bar.

— The basic command-line syntax to invoke awk(1) is:

$ awk [options] 'pattern {actions}' inputFile

We've seen how to get an output of a file before, using the cat command.

$ cat pieces-list

We've also seen how to split data to print only the parts we want using grep.

$ cat pieces-list | grep Intel

To start working with awk(1) let's use it to print our pieces-list file:

$ awk '{ print }' pieces-list

We should have the same output result after running the program with both cat and the new awk method.

With awk(1) we can use patterns too:

$ awk '/Intel/ { print }' pieces-list

patterns are declared between forward slashes.

This is useful but we still get a complete line containing the pattern we were looking for. One powerful feature of awk(1) is that we can select pieces (named fields) of the line.

Named fields are represented with a dollar sign and the position number ($N).

$ awk '/Intel/ { print $2 }' pieces-list

Sometimes our pattern has to meet some conditions to be useful for us. We can use boolean statements to perform as patterns too:

In this example, the condition is that the sixth field has to be greater than one hundred:

$ awk '$6 > 1 { print $2 }' pieces-list

By default field separators are defined by spaces or tabs. If we want to use other pattern as a field separator we have to indicate so, changing the F variable:

-F=,

awk(1) allows us to use some internal functions to perform several actions.

$ awk '{ print length($2) }' pieces-list
$ awk '$6 > 1 { printf "%-19s", $2 }' pieces-list

Making an AWK command file

We can go further with awk(1) and store all our commands inside a file so it's easier to apply the same line of commands for multiple files.

awk(1) command files can contain two special patterns:

Let's create a script to store our awk(1) commands:

$ touch steps.awk

so now we can perform some awk(1) examples into our pieces-list file.

— Format output

Our example pieces-list text file is a bit messy. Wouldn't it be great to have each field ordered in nice columns?

First we need to define which character size our columns need. This value is given by our longest value in each field.

Using the builtin function length($N) we can get those values.

Let's define our main columns with the given values in our BEGIN pattern:

BEGIN{ printf "\n%-15s %-22s %-5s %9s\n", "MANUFACTURER ", "| PIECE NAME ","| ID ","|QUANTITY"}

In the main body we need a similar line for each one of the products in the list. This time we have to change our printed format in the fields that need to output a number:

{printf "%-16s %-22s %6d %9d\n", $8, $2, $4, $6}

In order to execute our stored awk(1) commands, we simply need to indicate awk(1) to read the file as follows:

$ awk -f steps.awk pieces-list

Our result should look similar to this:

MANUFACTURER   | PIECE NAME          | ID    | QUANTITY
-------------------------------------------------------

"Bosch"         "Capacitor"            3456         204
"Phillips"      "Battery"              2760           0
"Mitsubishi"    "Fan-Frame"            7864         131
"Intel"         "Bluetooth-Emmiter"   19085         184
"Intel"         "WiFi-Card"            2941         115
"OEM"           "Fan"                  4512          98

The same we used our messy example file we can use a web server ip traffic, username and password databases… endless possibilities to format.

— Process command-line arguments

We can take input from the user and pass it as a variable to perform actions with our data.

Let's say we want to ask the user for the product's ID and report them the manufacturer's name, the product's name and it available quantity.

First, we can create a search.awk script where we can perform the following instructions:

BEGIN{ print "Search results:\n" }

{if ( id == $4 ) print "Item ID " $4 "\n\t— Manufacturer: " $8 "\n\t— Piece Name: " $2, "\n\t—Stock Amount: " $6}

END{ print "\n---------------------------------\n"}

In this case we have created a variable named id to compare against our ID field. To make it work we should run the script addressing a value for the variable:

$ awk -v id=3456 -f search.awk pieces-list

Search results:

Item ID 3456
     Manufacturer: "Bosch"
     Piece Name: "Capacitor" 
     Stock Amount: 204

---------------------------------

— Arithmetic and string operators

As in almost every programming programming language we can perform arithmetic operations inside awk(1) passing named fields as values to operate with:

$ awk '{result += $6} END{printf "total amount of items: %d\n", result}' pieces-list 

total amount of items: 732

Working with sed

sed(1) automates actions that seem a natural extension of interactive text editing. Most of these actions like replacing text, deleting lines, inserting new text, removing words… could be done manually from a text editor.

Automating all editing instructions inside one place and execute them in one pass can change hours of manual working in minutes of automated computing.

The command-line syntax to invoke sed(1) is:

$ sed [options] instructions inputFile

If we run sed(1) without any of these three parts we will have our file printed into our command line:

$ sed '' pieces-list

As you can see, the structure of calling sed is similar to calling awk.

Using sed instructions

some intro text…

This are a few instructions we can combine:

/patternA/patternB/
s/orig/new/ 

We can indicate where to replace the matching pattern by adding the number of lines before the s character.

2s/orig/new/

This will replace orig with new in the second line of the file.

We can replace a pattern with an empty space too by leaving the new pattern blank:

$ sed 's/orig//g'
s/orig/new/g

This will replace every occurrence of the orig pattern found in the document with the new pattern.

w /path/to/output_file
1d inputfile
1!d inputfile

Combining SED commands

Running multiple commands with sed(1) can be achieved by separating them inside the single quotes with semi colons ; in the command line, or by writing all the commands into a file with the extension .sed.

Let's use some sed power to work inside our pieces-list file:

— Find and replace

$ sed 's/quant./Quantity/' pieces-list
$ sed 's/Man./Manufacturer/' pieces-list

This method will change any original pattern (quant. and Man.) match in the worklist with a new specified pattern (Quantity and Manufacturer) that occurs the first time on a line.

We can replace the pattern " with an empty space too by leaving the new pattern blank:

$ sed 's/"//g' pieces-list

Now let's combine both instructions at once:

$ sed 's/quant./Quantity/'; 's/Man./Manufacturer/'; 's/"//g' pieces-list

so our pieces-list content looks like this:

Name= Capacitor ID= 3456 Quantity= 204 Manufacturer= Bosch
Name= Battery ID= 2760 Quantity= 0 Manufacturer= Phillips
Name= Fan-Frame ID= 7864 Quantity= 131 Manufacturer= Mitsubishi
Name= Bluetooth-Emitter ID= 19085, Quantity= 184 Manufacturer= Intel
Name= WiFi-Card, ID= 2941, Quantity= 115, Manufacturer= Intel
Name= Fan ID= 4512 Quantity= 98 Manufacturer= OEM

— Extract and edit

Another powerful option that we have the ability to perform within sed(1) is to extract information from a file, edit that information in memory and put the new edited data inside another file, without using pipelines.

Let's use a file for storing the instructions for sed(1).

$ vim extract.sed

We want to inspect a whole file, and we're not going to know the number of lines. We need to search from pattern one through to pattern two:

/Name=/,/Man.=/ 

so we work with the text contained between the start and the end pattern.

Working on that pattern space we can open a curly brackets section, just like a function so we can store the commands to execute in.

/Name=/,/Man.=/ {
s/"//g
s/.*Man.=//g
w manufacturer_list
}

Now we can run sed(1) associating this script file to create our output file.

$ sed -f extract.sed pieces-list

Bosch
Phillips
Mitsubishi
Intel
Intel
OEM

Of course all of this can be scripted through pipelines but using just sed(1) we've achieved the same in fewer lines and less time.

Combining Awk and Sed

We've seen that we can take the advantage of clean our text data with sed(1), and format it with awk(1). Let's go a step further and combine both powers to get a better report.

An example of this combination can be:

$ sed 's/"//g' pieces-list | awk -f steps.awk

This way we remove the double quotes from all names and get a clean result.

We can sort results taking any desired field as an index base. In this case we are going to use the Manufacturer's name to perform a sorted list at the items:

$ sed 's/"//g' pieces-list | awk '{ print $8 " " $0 }' | sort | awk -f steps.awk

We know what the sed(1) line does. Let's analyze the awk(1) one:

Our result is going to be something weird. The formatted list maybe looks like this:

MANUFACTURER   | PIECE NAME          | ID    | QUANTITY
-------------------------------------------------------

Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0
Man.=           Name=                     0           0

---- End of report. Time: 06:44 | Date: 2020-04-01 ----

Since we are adding the eighth field as an index, we have increased the length of the lines and we need to increase the field to print inside our steps.awk file.

Having to track all this steps individually and in different files is not useful at all, that's why writing shell scripts for multiple tasks is so handy (yes, we can call sed(1) and awk(1) from within a shell script!).

— Create a script named format-report.sh and open it.

#!/bin/sh
awk '{print $8" " $0 }' $* | sort | 
awk 'BEGIN{ printf "\n%-15s %-22s %-5s %9s\n", "MANUFACTURER ", "| PIECE NAME ","| ID ","| QUANTITY"
print "-------------------------------------------------------\n"}
{printf "%-18s %-23s %6d %9d\n", $9, $3, $5, $7}
{ if ($7 < 1) printf "\nWarning! Item %d is out of stock.(%s from %s)\n", $5, $3, $9}
END {"date +'%Y-%m-%d'"|getline d; "date +'%H:%M'"|getline t; print "\n---- End of report. Time: " t " | Date: " d " ----"}' | 
sed 's/"//g'

In order to run the script, save it, change its permissions to make it executable, and pass the pieces-list as the first argument:

$ ./format-report.sh pieces-list

We should see something similar to this:

MANUFACTURER   | PIECE NAME          | ID |    QUANTITY
-------------------------------------------------------

Bosch           Capacitor              3456         204
Intel           Bluetooth-Emmiter     19085         184
Intel           WiFi-Card              2941         115
Mitsubishi      Fan-Frame              7864         131
OEM             Fan                    4512          98
Phillips        Battery                2760           0

Warning! Item 2760 is out of stock.(Battery from Phillips)

---- End of report. Time: 07:40 | Date: 2020-04-01 ----

Summing up

A fundamental part of the power of *nix systems are pipes and the ability to use them to combine programs as building blocks in many ways to create automated workflows.

We've seen how to manage text data without touching a manual text editor in several ways, so now we can introduce these techniques using awk(1) and sed(1) to our pipe workflow with a new level of flexibility.