Start Using Regular Expressions Right Now
If you are not a computer science related major and you wouldn’t spare more than half an hour of your life learning any programming languages, then learn regular expressions. It consists of a few dozen symbols that help you search and substitute texts in amazingly powerful ways ten times more easily than any programming languages could. In half an hour you don’t master it but you get to know its potential and why you may want to spare another “half an hour” every few months or years, in your own pace, to build up your REGEX muscles gradually that will benefit you immensely for the rest of your life — regardless of your career. You certainly have to learn it if you are a CS related major, although few, if any, of your teachers and professors ever told you about it.
Sec 1. The Basics
So head over to this page and copy its content to the lower (much larger) text input area of regex101 or regexr or any other “online regex” practice sites. It doesn’t matter that the content may look meaningless to you. No worries. We are only interested in witnessing the power of regex on any text files. On the upper (one-line) text input area, please type, one character at a time, the following, slowly: [A-Z][a-z][a-z]\b
and watch how the highlighted parts in the lower input box change as you type. So you see all the month names are highlighted, and you are right if you guess that the regex you typed is meant to match any 3-letter abbreviation that begins with a capital letter. That “\b” at the end insists on seeing a boundary between an alphanumerical character and a non-alphanumerical character. Unwanted strings like “Apa” or “Ubu” would be erroneously matched if “\b” were removed. BTW, [A-Za-z0-9_]
is the precise definition of an “alphanumerical character”, ie, any one of the 52 letters, the 10 digits, and the underscore character. It can also be abbreviated as \w
. A “non-alphanumerical character” is precisely defined as [^A-Za-z0-9_]
, and short-handed as \W
. That is, any single character that is not an alphanumerical character. It may be a punctuation mark, space, or tab. Congratulations! You have learned about 1/10 of the most frequently used symbols in regular expressions.
Suppose we would like to highlight all the timestamps, i.e. things like “06:41:31”. We can match a single digit using [0-9]
, which can also be abbriviated as \d
. So let’s type \d\d:\d\d:\d\d
in place of the previous regex. Oops! It begins the match too early — at the last two digits of the year. You see: regex engines are impatient — it wants to begin matching as early as possible. Add one more space at the end of the regex and the matches become correct. BTW, the opposite of \d
is \D
, meaning [^0-9]
i.e., any single character that is not a digit.
Now let’s delete again everything in the regex input box and type a single period “.” in it. Look! Every single character is highlighted. You can understand “.” as [\w\W]
or as [\d\D]
. Note that adjacent matches are colored slightly differently so that it is obvious that they belong to separate matches. What do I mean, you ask? Let’s replace the “.” with \d
and then with \d+
. In the first case, adjacent digits are colored light blue and not-so-light blue alternately because each digit is a separate match. In the second case, adjacent digits are colored the same because\d+
means “any number (at least one) of digit characters” and therefore adjacent digits together form a single match. Question: How would the matching texts be colored if your regex is \w\w\w
?
So let’s try to find all the numerical IP addresses such as “20.181.108.77" . Wait, how do I insist on matching “.” ? You would use \.
to tell the regex engine to search for the period character literally as it is, without giving it special interpretations. Now that you have all the special symbols you need, try to solve the question! And you will feel tremendously empowered. I bet most programmer friends around you did not know about regex and would take far more time to do the same thing using their favorite programming language. Regex is what I believe should have been in the elementary education about computers in any country, and yet sadly it is mostly missing in our present computer education, which is mostly driven and even controlled by commercial products and companies. I am using regex as an example to express 1/10 or 1/100 of Alan Kay’s frustration . (transcript)
Now you can jump to the last section if you have almost used up your half an hour. You can always come back for the next two sections whenever you feel like — 5 weeks later or 5 years later.
Sec 2. The Power of the Command Line
Highlighting the watched-for results is useful but far from being satisfactory if one has to process each highlighted result manually from that point on. A friend proficient in the GNU/linux command line would most likely be able to completely automate the remaining steps for you as long as you can describe what you want precisely. This section is written for such a friend but is still worth browsing even if you, mostly a computer muggle, don’t plan to follow the hands-on exercises.
So let’s open a linux terminal and get the data file: wget https://ckhung.github.io/a/m/22/access.log
. Suppose we are interested in counting the number of times each IP address appears in that file. This command does it: perl -ne 'print "$1\n" if m/(\d+\.\d+\.\d+\.\d+)/' access.log | sort | uniq -c | sort -n
Don’t worry about the perl command. We will get to that in a few paragraphs. Just pay attention to the regular expressions, which you already understand. Oh, and $1 means “the substring corresponding to the first pair of parenthesis inside each matching line we have found”.
As another example, suppose I have transferred a lot of pictures videos from my phone to my computer. Let’s get the imaginary listing of files first: wget https://www.cyut.edu.tw/~ckhung/b/re/dcim-listing.txt
and take a look at its content. Suppose I would like to organize image files (IMG_*.jpg) into separate directories (folders) according to their years and then their months like this:
IMG_20200907_112437.jpg => /archive/picture/20/09/07_112437.jpg
IMG_20200911_090510.jpg => /archive/picture/20/09/11_090510.jpg
…
IMG_20210413_134820.jpg => /archive/picture/21/04/13_134820.jpg
IMG_20210413_140753.jpg => /archive/picture/21/04/13_140753.jpg
Let’s first print just the names of all the image files: perl -ne 'print "$1\n" if m/(IMG_20(\d\d)(\d\d)\d{2}_\d{6}\.jpg)/' dcim-listing.txt
Now let’s create a listing of commands from the list of files: perl -ne 'print "cp $1 /archive/picture/$2/$3/$4\n" if m#(IMG_20(\d\d)(\d\d)(\d{2}_\d{6}\.jpg))#' dcim-listing.txt
This may look lengthy but really it just has a longer print statement and more pairs of parentheses. The # was used in place of / so as to avoid confusion with the / in paths. Any other punctuation character can also be used as the “delimiter” for our regular expression. OK, so we have printed all the commands we would like to give. If the commands look correct, then just pipe it to bash (| bash
) and all the files will be copied into their respective correct directories. How much time would it take if you were using any file manager? What if there were 2000 files?
Better yet, it can do many more interesting things that file managers cannot do. Download this file and extract its content. These are 800x600 wallpapers containing the “Subject” metadata. Let’s print that using exiftool, along with the path and file name of each file: exiftool -p '$FileName $Subject' *.jpg
. Suppose we would like to paint each subject line into the picture and put the result in an existing subdirectory “captioned/”. With a single file, one would do this: convert -pointsize 24 -stroke "#0008" -undercolor "#ffc8" -gravity South -annotate +0+5 "reddish autumn forest" autumn.jpg captioned/autumn.jpg
So let’s repeat the previous command and redirect the result into a text file, say, “list.txt” . Then we can generate the required commands by this command: perl -ne 'print qq(convert -pointsize 24 -stroke "#0008" -undercolor "#ffc8" -gravity South -annotate +0+5 "$2" $1 captioned/$1\n) if m/(.*?\.jpg)\s+(.*)/' list.txt
If the printed commands look ok, you can then pipe it to bash to executed those commands. Now take a look at the generated pictures, and think how much time it would save if there were thousands of image files to be processed.
Note 1: qq(…) is just perl’s fancy way of writing “…” without the double quotes so that you can use double quote characters inside your printed string. Note 2: Regex engines are normally greedy, meaning that they are eager to match as much as possible. If the $Subject string happens to contain “.jpg”, then $1 would erroneously contain too long a string. Using “*?” instead of “*” inhibits its greediness, making it stop as soon as the match is satisfied the first time.
The gist of the previous example is not the lengthy exiftool and convert commands. For that part, you can just find and copy what you want to accomplish by googling “exiftool” or “imagemagick” along with your desired actions. The key point to take away is the idea of generating a table-like text file containing all the variable parts of repetitive commands, and then generating commands from that text file using regular expressions. Think about a spreadsheet with a column of input file names and a few columns different parameters/arguments corresponding to each file name. If you would like to tag and/or read tags from mp3 files, for example, you would study how to process a single file using the commands “id3v2” and/or “mediainfo”, and then use regex to generate hundreds or thousands of mostly-identical commands to process lots of files in one batch. So can you batch process many video files using the “ffmpeg” command. Anything that can be accomplished using the command line can always be automated like this. And now you see why hackers of all colors (white- gray- black-hat) prefer the command line. It saves time.
Sec 3. The Most Frequently Used Rules Summarized
There are mainly three groups special symbols in regex.
A. [Matching a single character]
[...]
Any single one of the enclosed characters[^...]
Any single character that is not one of the enclosed characters.
Any single character- …
B. [Anchor] These symbols do not consume input but insist that only the desired patterns occurring at certain spots count as matches.
^
pattern must occur at the beginning of a line$
pattern must occur at the end of a line\b
pattern must occur at the boundary between an alphanumerical character and a non-alphanumerical character, or at the beginning or the end of a line
C. [Quantifier] These symbols do not consume input but rather repeat the previous pattern. (Think of loops in a programming language.)
{5}
Repeat 5 times{3,5}
Repeat 3 to 5 times{3,}
Repeat at least 3 times?
Repeat 0 times or once, equivalent to{0,1}
*
Repeat any number of times, including none, equivalent to{0,}
+
Repeat any number of times, at least once, equivalent to{1,}
Append ?
to a quantifier to make it non-greedy, e.g. *?
or +?
You may want to use parentheses in one of the following situations:
- to make the following quantifier apply to the parenthesized expression as a whole, e.g.
\d+(\.\d+){3}
- to express “one of the alternative strings or patterns”, e.g.
(Mon|Tue|Wed|Thu|Fri)
- to save the matched string so that you can refer to it later
The most popular variant of all regex engines is PCRE, perl-compatible regular expressions. In PCRE, any punctuation mark prefixed with \
is guaranteed to match that literal punctuation mark, regardless of whether it has any special meaning when standing alone, e.g. \.
matches the period character and \/
matches the slash character. PCRE also supports the following shorthands:
\d
is equivalent to[0-9]
\D
is equivalent to[^0-9]
\w
is equivalent to[0-9a-zA-Z_]
\W
is equivalent to[^0-9a-zA-Z_]
\s
is equivalent to[ \t\n]
\S
is equivalent to[^ \t\n]
You don’t need to learn the perl language. Here are three commands I use most frequently. Learn to recognize when to copy and edit which one of the them to suit your needs. Being able to do that alone will make you more powerful than spending 5 hours learning any programming language.
- The first command
perl -ne 'print if m/.../' …
prints all lines containing a certain pattern. It is equivalent togrep -P …
. - The second command
perl -ne 'print "$1\n" if m/..(...)../' …
prints desired (parenthesized) parts in the pattern, possibly along with any additional texts. It is a more powerful version ofgrep -Po …
. - The third command
perl -pe 's/.../.../' …
replaces matching patterns. It is similar tosed 's/.../.../' …
except that sed has a somewhat different syntax than PCRE.
One can append a number of options after the “/”. For example, a trailing “i” means ignoring cases, and 's/.../.../ig'
means replacing all matches (instead of just the first one by default) on each matching line, ignoring cases. Remember, the “/” can be replaced by any other punctuation character not occurring in the pattern.
Sec 4. Next Steps
You can learn much more about regex from this very comprehensive tutorial. But to be honest I never really studied it. Just like learning a natural language, the best way to learn regex is to use it in your daily life. I started using regex 30 years ago. I learned new tricks only very infrequently, and only by a small step each time since then. But I use it dozens of times every week, just like I talk to people everyday. Regex is supported by every major programming language I know of. But let me emphasize again: you don’t need to be a programmer. You can even use it in office applications. Regex barely changes over time. There is no commercial interests to push for unnecessary changes so that people have to buy a newer version of the software and enroll in yet another new class for a new certificate exam, once every few years. Regex is more akin to mathematics than commercial software products. You learn the most essential part of it once and early, and use it for the rest of your life. It’s good if you learn one or two new tricks every few years, but you don’t have to. Ten years from now, you will be grateful to your present-day self for having taken my advice to start using it today.