Regular Expression
To demonstrate various examples, we will use a large text file
A Room with a View by E. M. Forster
downloaded from Project Gutenberg. We will use egrep
program to find specific matches in the file. egrep is grep -E which allows
interpretation of pattern as an extended regular expression. We will wrap our
the regular expression in ' (single quotes, they are not part of the regular
expression) e.g.,
egrep '^cat' A_Room_with_a_View_by_E_M_Forster.txt
Similarly, when regular expressions are processed by certain languages, we need
to provide the regular expression enclosed between /, they are not part of
regular expression but required by the parser to identify regex.
grep (g/re/p) stands for Global Regular Expression Print.
Simple character match
egrep 'cat' A_Room_with_a_View_by_E_M_Forster.txt
It will match all instances of cat such as: delicate, indicated, catching, located etc.
Meta-characters
Wild card: . (dot) is the wild card meta-character, it matches any
character including spaces except the newline character. E.g., /h.t/ will
match hot, hit, hat, that, with the, wish to etc. Similarly, /.a.a.a/ will
match banana or papaya.
Escape meta-character: /9.00/ will match 9.00, 9500, 9:00. What do we do
if we want to look for 9.00? Escape character comes to rescue: /9\.00/. In
case we want a literal backslash, we can use /\\/. Do not use escape
characters for characters that are not meta-character such as quotation marks.
Tab character: \t, newline: \n, \r or \r\n.
Anchors
| Meta-character | Meaning |
|---|---|
^ | Start of line |
$ | End of line |
\A | start of string, never the end of line |
\Z | end of string, never the end of line |
Line start: Regular expression contains various meta-characters such as
^ in the above example. It means the beginning of the line, i.e., /^cat/
matches lines that start with cat.
Line ending: $ represents the matches at the end of the line. Note that if
the line ending is CRLF (\r\n) instead of LF (\n), the UNIX/Linux grep might
not work as expected. Convert the line ending from DOS to Unix:
dos2unix filename.txt
# if dos2unix is not installed, below is the command for Ubuntu/ Debian
sudo apt update
sudo apt install dos2unix
Let's say we want to validate a list of email addresses. We assume that valid
TLD is either 2 or 3 characters long (e.g., .in, .com, etc.):
/^\w+@\w+\.[a-z]{2,3}$/
Line start and line ending might not work as we expect in single-line mode, to
enable multiline mode, we may need to add m after the regular
expression: /^\w+@\w+\.[a-z]{2,3}$/mgi, g for global, and i for case
insensitive.
Character class
Character set: [ea] can be used to match either e or a, e.g., to match
grey or gray, /gr[ea]y/ or equivalently, /gr[ae]y/, however it will not
match great. If we need to find any line starting with cat or Cat we could
use: /^[Cc]at/.
Character range: Say we are looking for HTML heading tags, we could use:
<h[123456]> it will match either of <h1>, <h2>, ..., <h6>. We could
simplify the above expression by using range meta-character: <h[1-6]>. [0-9]
matches any number, [a-z] matches any lowercase English letters. We could use
31[./-]10[./-]2022 to match a certain date with any one of the usual
separators. The range not necessarily can have numeric or uppercase/lowercase,
they are the ASCII character ranges.
Negative character set: in the above example, if we used /<h^[7-9]>/, it
means matches that are not <h7>, <h8> or <h9>. Similarly, we could try to
look for words that have q followed by another letter that is not u: /q[^u]/.
/see[^mn]/ will match seek or seed but not seen, seem or see. However, it will
also match "see." or "see ". /[^a-z0-9A-Z]/ will negate all three ranges, not
just the first one. Notice that - is a meta-character inside a character set,
otherwise it is just a regular character. On the other hand ^ has different
meanings: inside a character it makes a negative character set, otherwise, it is
an anchor for the line beginning. Characters (except, ], -, ^, and \)
inside a character set are regular characters, we do not need to escape them.
E.g., . is not a wild card inside a character set but a literal dot.
/[[(][0-9][)\]]/ will match either a number between parentheses or square
brackets. Notice how we needed to escape the closing square bracket. Match
virtue but not virtues: /virtue[^s\s]/.
Matching any of several sub-expressions: ^cat|ion$ will match any line
starting with cat or ending with ion. ^cat and ion$ are separate
expressions. In our example gr[ea]y could be written as grey|gray. We can
match 1st or First by (1|Fir)st.
Case insensitive: We can use the -i flag with egrep to make our lookup
case insensitive.
Shorthand character sets:
| Shorthand | Meaning | Equivalent |
|---|---|---|
\d | Digit | [0-9] |
\w | Word character | [a-zA-Z0-9_] |
\s | Whitespace | [\r\n\t] |
\D | Non digit | [^0-9] |
\W | Non word char | [^a-zA-Z0-9_] |
\S | Not whitespace | [^\r\n\t] |
/\d\d\d\d/ matches 1947, /\w\w\w/ matches cat, Car, 123, a_1. /[\w\-]/
will incorporate - with the word character. Note that /^\d\s/ is not the
same as /\D\S/, it is the same as /\D\s/. Note that this shorthand might not
be supported by all regex engines.
Word boundaries
Match any word that starts with cat:
egrep '\<cat' A_Room_with_a_View_by_E_M_Forster.txt
Match any word that ends with ion:
egrep 'ion\>' A_Room_with_a_View_by_E_M_Forster.txt
Word boundaries will act whenever there is a word character followed by a non-
word character. Word characters are [a-z][A-Z][0-9]_. Any word boundaries are
denoted by \b. Find the standalone letter a's: /\ba\b/.
Optional
How to match either color or colour: colou?r. July or Jul: July?. January or
Jan: Jan(uary)?. Note that the expression inside the parenthesis could be more
complex, but here it is used for simple grouping for the optional.
Repetition
| Meta-character | Meaning |
|---|---|
* | Preceding item zero, or more times |
+ | Preceding item one or more times |
? | Preceding item one or zero times |
Match <h3 size=14>:
egrep <h3 +size *= *14 *> filename.txt
/apples*/ will match apple, apples, or applesss. /apples+/ will match
apples, or applesss but not apple. /apples?/ will match apple and apples, but
not applesss.
Match any size value above:
egrep <h3 +size *= *[0-9]+ *> filename.txt
Match any tag with or without size attribute:
egrep <h3( +size *= *[0-9]+)? *> filename.txt
Range quantifiers
Match 8 to 10 digit numbers: [0-9]{8,10}. {0,1} is same as ?; {1,} is
same as +; {0,} same as *. Validate US phone numbers:
/\d{3}-\d{3}-\d{4}/ matches 234-456-6789.
Grouping
We can use () to group. For example, /(abc)+/ will match abc, or abcabc,
or abcabcabc, etc.
Resources
- Online Regex Playground: https://regexr.com
- Online Visual Regex Playground: https://extendsclass.com/regex-tester.html