Regular Expressions (Regex) Tutorial


The command egrep is like grep -E and interprets the pattern like a extended regular expression.
Shows all the lines that begin with From or Subject in the Inbox file.
cat Inbox|egrep '^(From|Subject)'

^ matches at the begining of a line (caret symbol)
$ matches at the end of a line (dollar symbol)

Command
Matching lines
cat file.txt | egrep '^sorgonet.com' sorgonet.com is nice
cat file.txt | egrep 'sorgonet.com$' I like sorgonet.com
cat file.txt | egrep '^sorgonet.com$' sorgonet.com

You can express a list of characters by using [...]
The [...] is called a character class.

cat file.txt | egrep 'I have [Aa] website'
cat file.txt | egrep 'try in gr[ae]y color'

A range of characters [a-z]

cat file.txt | egrep ' the three digit code is [0-9][0-9][0-9]'

Multiple ranges are also allowed:
[a-zA-Z0-9] will match any letter or number in that position.

Be carefull, because [^a-z] inside a character class indicates negation, and will match any character outside the a-z range, so:
cat file.txt | egrep '^[^a-zA-Z]
will match lines where it's first character is not a letter.

The dot . matches any character (if you use it outside a character class)
egrep '.a' file.txt
Will match lines like: 1234a, jjjjjjaaa, uuauuuu
But not: ammm

The symbol | means or
egrep '^(david|john|philip)' file.txt
Matches any line begining with david or john or philip.

When you want to match a word, the \< \> symbols comes handly, it detects "automagically" a single word by checking the boundary characters.
egrep '\<[dD]avid\>' file.txt


Quantifiers

Metacharacters + , * and ? are called quantifiers
+ will match one or more times the preceding item
* will match one or more times the preceding item, but 0 times is also allowed

The character ? means optional and is used after the character that could or could not be there.
egrep 'encyclopa?edia' file.txt

egrep -i ignores case characters, this is not a part of the regular expression language, but is handful to know.

Using Perl with regular expressions.

Perl will allow to use much more complex regular expressions (regex) than egrep, and there are sligthy differences in notation.
Sample code in Perl:

if ($answer =~ m/^[a-zA-Z]+$/) {
    print "only letters\n";
} else {
    print "not only letters\n";
}

The surrounding m/..../ means to attempt a regular experssion match, and the slashes delimit the regular expression itself.
The operator =~ links the string to be searched with the regular expression. You can read the operator =~ as "matches"

$1 $2 $3 and so on, in Perl represents a special variables that are the matching parts of a regex between (), example:
if ($result=~m/([a-h][0-9])(a|p)/) {
print $1; #will print the letter and the number, first parentheses
print $2; #will print the last parentheses, letter a or letter p
}

The operator =~ means match.
The operator !~ means don't match.

Character Classes are slighty different in regex and Perl.
Remember that character classes are enclosed between []
[\t] matchs a TAB
[\n] matchs a newline
[\b] a whitespace
\b means word boundary in regex, but it's nosense within a character class, so it represents a whitespace if it's inside a character class
Perl has a metacharacter \s it means "whitespace character", this includes among others, space, tab, newline and carriage return.
Usefull shorthands that Perl provides us:

\w the same as [a-zA-Z0-9_] to match a word
\W anything not \w
\d the same as [0-9] a digit
\D anything not a digit [^0-9]

Modifiers.
Modifiers are placed after the m/..../

$result =~ m/[a-z]/i
/i tells Perl to do the match in a case-insensitive manner. It's not part of the regex, but part of the m/.../ syntactic packaging.

Replacing text using regex.
Instead of using m/.../ we can use s/.../.../ that is:
s/stringtosearchandreplace/stringtoreplacewith/

$result =~ s/Sorgo/Sorgonet/;
That will search for the string Sorgo and replace the first ocurrence by the string Sorgonet.
Adding a /g will mean globally match and will change all the strings in that text instead of only the first occurrence.
$result =~ s/Sorgo/Sorgonet/g;

A tricky example is:
$result =~ s/SoRGo/Sorgonet/ig
That will replace all the strings like sorgo,sorGO,SoRgo,SoRgO, by the string Sorgonet as it is. Case doesn't matter on the first string, but it'll be replaced by exactly the string Sorgonet with only it's first letter S in uppercase. I added /ig to introduce this syntax where you can combine /ig to match globally and case insensitive.

We can replace a string easily with only one line in Perl
% perl -p -i -e 's/Sorgo/Sorgonet/g' file
-e indicates that the entire Perl code follows the command line and -i -p is for working with the given file.

Intervals

Intervals are like a "counting quantifier" where you specify the minimum number of matches you need and the maximum number to allow.
[a-z]{3} matches exactly 3 times a lowercase letter
[a-z]{1,5} matches min. one time max. five times a lowercase letter

The use of parentheses

To match a From: line in a email you can use this regex: m/^From: /
but if you want to use later on your program who is it from, you better use: m/^From: (.*)/ thats because in Perl the variable $1 will contain the string that comes after From:  (the dot means anycharacter and the star means 0 or more times.
print $1; #in Perl will print who is sending the email.

Regular Expressions Examples:

To match a IP adress. We need 4 numbers separated by a dot and only from 0 to 255. Numbers can be one digit, two or three.
\d|\d\d|[01]\d\d|1[0-4]\d|25[0-5]
or we can do it shorter like:
[01]?\d\d?|2[0-4]\d|25[0-5]
This lines matches a number from 0 to 255 and you need it 4 times to have a complete IP adress, like
^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$
The caret at the begining means that it must start with that number, and the dollar at the end that it must finish with that last number. The dot must be escaped \. else it will mean any character.



 


 


by DrDoom at Sorgonet.com 
www.sorgonet.com