Shell scripting | Regular expressions

FreeBSD gearing-up

Index


When working with text data, something that stands out in scripting is using regular expressions (aka regex). Regular expressions are symbolic notations we can use to identify patterns in text data.

Although regex are available in almost every scripting and programming languages (if not all of them), they vary slightly from language to language.

Regex 101

Regular expressions take literals and meta-characters as values to form patterns. They can be used both inside shell scripts and along command line tools.

With regular expressions we can:

To work with regex in this episode along with the command line let's create a document containing some random data.

content of data.txt

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqurtuvwxyz

0123456789

big bag bug 
love live like

.[{()\^$|?*+    

codeberg.org
user@domain.com
admin@other.co.uk

432-111-7654   
234.111.4567

Sir Camelot James
Sir James
Mr. Watson
Mr John

A car should not fly
A plane cannot sail
A boat can float

The grep(1) tool searches one or more input files for lines containing a match to a specified pattern. grep(1) is always going to print the line that contains the regex match. There are ways to limit the print so we get only the isolated match combining it with other command line tools. We are going to use regex with grep(1), but we can use them with commands and tools like ed(1), awk(1) or sed(1).

Let's find James in this example text file:

$ grep 'James' data.txt

Sir Camelot James
Sir James

The example above takes a literal (‘James') and will print any line that matches at least the four characters J, a, m, e, s in that order.

$ grep 'J[oa]' data.txt

Sir Camelot James
Sir James
Mr John

In this case we're passing a meta-character pattern as regex. The regular expression will match any instance of the character J followed by either the character o or a.

Note that in both cases we enclosed the regex with quotes. This is important since the majority of the meta-characters used by regex are meaningful to the shell too.

The accepted meta-characters in regular expressions are:

$ ^ . - * ? + ( ) { } [ ] \ |

Let's take a more in-depth look to them. POSIX makes a difference between basic regex and extended regex.

Basic regex

Basic regex (BRE) requires adding a backslash to anything that exceeds the defined meta-characters if we want to use it as a meta-character.

In Basic Regex we can find anchors, the any character meta-character, character sets, and the find zero or more meta-character.

Anchors

# print all directories and files in /home/ with matching beginning "^Do"
$ cd
$ printf "%s\n" * | grep "^Do"
Documents
Downloads
# print all directories and files in /home/ with mathcing ending "s$"
$ cd
$ printf "%s\n" * | grep "s$"
Documents
Downloads

Any Character

$  grep '.co' data.txt 
user@domain.com
admin@other.co.uk

$ grep 'b.g' data.txt
big bag bug

$ grep 'l.v' data.txt
live love

$ grep 'l..e' data.txt
live like love

Character sets

# character set [ABC]

# negate set [^ABC]

# range [A-Z]

# combined range [A-Za-gh0-9]

# an equivalent to dot (.) is [^\n\r]

Match zero or more

# look for the character b followed by the character e zero or more times

$ grep 'be*' data.txt
abcdefghijklmnopqurtuvwxyz
codeberg.org
big bag bug
A boat can float

Extended regex

Extended regular expressions (ERE) can sometimes be used with *nix utilities like grep(1) by including the command line flag -E. Other Unix utilities like awk(1) or egrep(1) use it by default.

Let's take a look at the meta-character functionality that extended regex offers to us.

Alternation

# two
$ grep -E 'car|plane' data.txt

A car should not fly
A plane cannot sail

# more than two
$ grep -E 'car|plane|boat' data.txt

A car should not fly
A plane cannot sail
A boat can float

Escape

# search for dot (.) and don't treat it as a metachar.
$ grep '\.' data.txt 
.[{()\^$|?*+   
codeberg.org
234.111.4567
Mr. Watson

To represent non printable characters we can use the following:

Combined with some specific letters after it, the backslash gives us more functionality:

Typing the uppercase of s, d, and w makes the expression to search for the opposite meaning of the lowercase character.

Match zero or once.

— Match once or more

Match n specific times

# {n} matches exactly n times the preceding item.
$ egrep '[0-9]{3}' data.txt
0123456789
432-111-7654  
234.111.4567

# {n,} matches at least n times
$ egrep '[2-6]{4,}' data.txt
0123456789

# {n, m} matches at least n, but no more than m times
$ egrep 'n{1,2}' data.txt 
abcdefghijklmnopqurtuvwxyz    
Mr. Watson
Mr John
A car cannot fly
A plane cannot sail
A boat cannot land

# {,m} matches less or equal m times
# if not combined with other regex block it will print the whole document.

We can take advantage of this to avoid long regular expression syntax. As an example if we would like to find every phone number inside a clients_data file we could do it as follows:

#American format 234.555.6789
\d{3}.\d{3}.\{4}

#British format 7222 555 555
\d{4}\s?\d{3}\s?\d{3}

Grouping

# this will match whether Camelot is present or not
$ grep -E 'Sir (Camelot)?James' data.txt
Sir Camelot James
Sir James