Start Using Regular Expressions Right Now

If you are not a computer science related major and you wouldn’t spare more than half an hour of your life learning any programming languages, then learn regular expressions. It consists of a few dozen symbols that help you search and substitute texts in amazingly powerful ways ten times more easily than any programming languages could. In half an hour you don’t master it but you get to know its potential and why you may want to spare another “half an hour” every few months or years, in your own pace, to build up your REGEX muscles gradually that will benefit you immensely for the rest of your life — regardless of your career. You certainly have to learn it if you are a CS related major, although few, if any, of your teachers and professors ever told you about it.

Sec 1. The Basics

So head over to this page and copy its content to the lower (much larger) text input area of regex101 or regexr or any other “online regex” practice sites. It doesn’t matter that the content may look meaningless to you. No worries. We are only interested in witnessing the power of regex on any text files. On the upper (one-line) text input area, please type, one character at a time, the following, slowly: [A-Z][a-z][a-z]\b and watch how the highlighted parts in the lower input box change as you type. So you see all the “May” are highlighted, and you are right if you guess that the regex you typed is meant to match any 3-letter abbreviation that begins with a capital letter. That “\b” at the end insists on seeing a boundary between an alphanumerical character and a non-alphanumerical character. Unwanted strings like “Apa” or “Ubu” would be erroneously matched if “\b” were removed. BTW, [A-Za-z0-9_] is the precise definition of an “alphanumerical character”, ie, any one of the 52 letters, the 10 digits, and the underscore character. It can also be abbreviated as \w . A “non-alphanumerical character” is precisely defined as [^A-Za-z0-9_], and short-handed as \W . That is, any single character that is not an alphanumerical character. It may be a punctuation mark, space, or tab. Congratulations! You have learned about 1/10 of the most frequently used symbols in regular expressions.

Sec 2. The Power of the Command Line

Highlighting the watched-for results is useful but far from being satisfactory if one has to process each highlighted result manually from that point on. A friend proficient in the GNU/linux command line would most likely be able to completely automate the remaining steps for you as long as you can describe what you want precisely. This section is written for such a friend but is still worth browsing even if you, mostly a computer muggle, don’t plan to follow the hands-on exercises.

IMG_20200907_112437.jpg => /archive/picture/20/09/07_112437.jpg
IMG_20200911_090510.jpg => /archive/picture/20/09/11_090510.jpg

IMG_20210413_134820.jpg => /archive/picture/21/04/13_134820.jpg
IMG_20210413_140753.jpg => /archive/picture/21/04/13_140753.jpg

Sec 3. The Most Frequently Used Rules Summarized

There are mainly three groups special symbols in regex.

  1. [^...] Any single character that is not one of the enclosed characters
  2. . Any single character
  1. $ pattern must occur at the end of a line
  2. \b pattern must occur at the boundary between an alphanumerical character and a non-alphanumerical character, or at the beginning or the end of a line
  1. {3,5} Repeat 3 to 5 times
  2. {3,} Repeat at least 3 times
  3. ? Repeat 0 times or once, equivalent to {0,1}
  4. * Repeat any number of times, including none, equivalent to {0,}
  5. + Repeat any number of times, at least once, equivalent to {1,}
  1. to express “one of the alternative strings or patterns”, e.g. (Mon|Tue|Wed|Thu|Fri)
  2. to save the matched string so that you can refer to it later
  1. \D is equivalent to [^0-9]
  2. \w is equivalent to [0-9a-zA-Z_]
  3. \W is equivalent to [^0-9a-zA-Z_]
  4. \s is equivalent to [ \t\n]
  5. \S is equivalent to [^ \t\n]
Three perl commands to copy.
  1. The second command perl -ne 'print "$1\n" if m/..(...)../' … prints desired (parenthesized) parts in the pattern, possibly along with any additional texts. It is a more powerful version of grep -Po … .
  2. The third command perl -pe 's/.../.../' … replaces matching patterns. It is similar to sed 's/.../.../' … except that sed has a somewhat different syntax than PCRE.

Sec 4. Next Steps

You can learn much more about regex from this very comprehensive tutorial. But to be honest I never really studied it. Just like learning a natural language, the best way to learn regex is to use it in your daily life. I started using regex 30 years ago. I learned new tricks only very infrequently, and only by a small step each time since then. But I use it dozens of times every week, just like I talk to people everyday. Regex is supported by every major programming language I know of. But let me emphasize again: you don’t need to be a programmer. You can even use it in office applications. Regex barely changes over time. There is no commercial interests to push for unnecessary changes so that people have to buy a newer version of the software and enroll in yet another new class for a new certificate exam, once every few years. Regex is more akin to mathematics than commercial software products. You learn the most essential part of it once and early, and use it for the rest of your life. It’s good if you learn one or two new tricks every few years, but you don’t have to. Ten years from now, you will be grateful to your present-day self for having taken my advice to start using it today.

Greg teaches at CYUT