Shell scripting | Regular expressions

Index

Regex 101
Basic regex
Extended regex

When working with text data, something that stands out in scripting is using regular expressions (aka regex). Regular expressions are symbolic notations we can use to identify patterns in text data.

Although regex are available in almost every scripting and programming languages (if not all of them), they vary slightly from language to language.

Regex 101

Regular expressions take literals and meta-characters as values to form patterns. They can be used both inside shell scripts and along command line tools.

With regular expressions we can:

Search in huge text files to find specific words.
Validate input to match what the program may require (e.g., don't accept letters when asking for a number in input).
Replace particular words or letters automatically in a document (e.g., uppercase words' first letter after a dot).
Coordinate actions inside the command line tools (e.g., redirecting some parameters only if a regex condition is met).

To work with regex in this episode along with the command line let's create a document containing some random data.

content of data.txt

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqurtuvwxyz

0123456789

big bag bug 
love live like

.[{()\^$|?*+    

codeberg.org
user@domain.com
admin@other.co.uk

432-111-7654   
234.111.4567

Sir Camelot James
Sir James
Mr. Watson
Mr John

A car should not fly
A plane cannot sail
A boat can float

The grep(1) tool searches one or more input files for lines containing a match to a specified pattern. grep(1) is always going to print the line that contains the regex match. There are ways to limit the print so we get only the isolated match combining it with other command line tools. We are going to use regex with grep(1), but we can use them with commands and tools like ed(1), awk(1) or sed(1).

Let's find James in this example text file:

$ grep 'James' data.txt

Sir Camelot James
Sir James

The example above takes a literal (‘James') and will print any line that matches at least the four characters J, a, m, e, s in that order.

$ grep 'J[oa]' data.txt

Sir Camelot James
Sir James
Mr John

In this case we're passing a meta-character pattern as regex. The regular expression will match any instance of the character J followed by either the character o or a.

Note that in both cases we enclosed the regex with quotes. This is important since the majority of the meta-characters used by regex are meaningful to the shell too.

The accepted meta-characters in regular expressions are:

$ ^ . - * ? + ( ) { } [ ] \ |

Let's take a more in-depth look to them. POSIX makes a difference between basic regex and extended regex.

Basic regex encloses ^ $ . [ ] *
Extended regex adds { } ( ) ? + \ |

Basic regex

Basic regex (BRE) requires adding a backslash to anything that exceeds the defined meta-characters if we want to use it as a meta-character.

In Basic Regex we can find anchors, the any character meta-character, character sets, and the find zero or more meta-character.

Anchors

Caret ^ matches the beginning of a line. This matches a position, not a character.

# print all directories and files in /home/ with matching beginning "^Do"
$ cd
$ printf "%s\n" * | grep "^Do"
Documents
Downloads

Dollar sign $ matches the end of a line. This matches a position, not a character.

# print all directories and files in /home/ with mathcing ending "s$"
$ cd
$ printf "%s\n" * | grep "s$"
Documents
Downloads

Any Character

Period (or dot) . matches any single character except line breaks. Inside a regular expression it increases the length of the required match.

$  grep '.co' data.txt 
user@domain.com
admin@other.co.uk

$ grep 'b.g' data.txt
big bag bug

$ grep 'l.v' data.txt
live love

$ grep 'l..e' data.txt
live like love

Character sets

Square Brackets [] match any of the characters given between the brackets. This can be fixed or a range, and may be both a positive match or a negative match.

# character set [ABC]

# negate set [^ABC]

# range [A-Z]

# combined range [A-Za-gh0-9]

# an equivalent to dot (.) is [^\n\r]

Match zero or more

Asterisk * matches zero or more occurrences of the previous character.

# look for the character b followed by the character e zero or more times

$ grep 'be*' data.txt
abcdefghijklmnopqurtuvwxyz
codeberg.org
big bag bug
A boat can float

Extended regex

Extended regular expressions (ERE) can sometimes be used with *nix utilities like grep(1) by including the command line flag -E. Other Unix utilities like awk(1) or egrep(1) use it by default.

Let's take a look at the meta-character functionality that extended regex offers to us.

Alternation

Vertical bar | acts like a logical OR operand. The patterns will be tested in order. It matches the expression before or after the vertical bar.

# two
$ grep -E 'car|plane' data.txt

A car should not fly
A plane cannot sail

# more than two
$ grep -E 'car|plane|boat' data.txt

A car should not fly
A plane cannot sail
A boat can float

Escape

Backslash \ removes or adds special meaning to the next character. (Handy when we want to look for a character that is actually a meta-character).

# search for dot (.) and don't treat it as a metachar.
$ grep '\.' data.txt 
.[{()\^$|?*+   
codeberg.org
234.111.4567
Mr. Watson

To represent non printable characters we can use the following:

\t matches a tab.
\r matches a carriage return.
\n matches a newline.

Combined with some specific letters after it, the backslash gives us more functionality:

\s matches anything considered a white space, like tabs, line breaks, etc.
\d matches any digit. A handy alternative to [0-9].
\w matches anything considered a word character.

Typing the uppercase of s, d, and w makes the expression to search for the opposite meaning of the lowercase character.

\S matches anything not considered a white space.
\D matches anything not considered a digit.
\W matches anything not considered a word character.

Match zero or once.

Question mark ? matches the preceding character zero or one times only.
- ab?c matches either ac or abc.
- (ab)? matches '' or ab.

— Match once or more

Plus sign + matches one or more occurrences of the preceding character.
- ab+c matches abc, abbbc but not ac.
- [abc]+ matches a, b, c, ca, cba, abccb, etc.

Match `n` specific times

Curly brackets {} match the preceding element the n times defined inside them. It can be fixed or cover a range between n and m.

# {n} matches exactly n times the preceding item.
$ egrep '[0-9]{3}' data.txt
0123456789
432-111-7654  
234.111.4567

# {n,} matches at least n times
$ egrep '[2-6]{4,}' data.txt
0123456789

# {n, m} matches at least n, but no more than m times
$ egrep 'n{1,2}' data.txt 
abcdefghijklmnopqurtuvwxyz    
Mr. Watson
Mr John
A car cannot fly
A plane cannot sail
A boat cannot land

# {,m} matches less or equal m times
# if not combined with other regex block it will print the whole document.

We can take advantage of this to avoid long regular expression syntax. As an example if we would like to find every phone number inside a clients_data file we could do it as follows:

#American format 234.555.6789
\d{3}.\d{3}.\{4}

#British format 7222 555 555
\d{4}\s?\d{3}\s?\d{3}

Grouping

Parenthesis () group several characters together of a regular expression.

# this will match whether Camelot is present or not
$ grep -E 'Sir (Camelot)?James' data.txt
Sir Camelot James
Sir James

vertex farm

n0mad coder's blog

Shell scripting | Regular expressions

Index

Regex 101

Basic regex

Anchors

Any Character

Character sets

Match zero or more

Extended regex

Alternation

Escape

Match zero or once.

Match `n` specific times

Grouping

vertex farm

n0mad coder's blog

Shell scripting | Regular expressions

Index

Regex 101

Basic regex

Anchors

Any Character

Character sets

Match zero or more

Extended regex

Alternation

Escape

Match zero or once.

Match n specific times

Grouping

Match `n` specific times