Regular Expressions

Regular Expression in programming is quite daunting but also very important to learn. It’s like a separate min-language to grasp with lot of patience and attention to nuances.

First off, what is “Regular Expressions”? According to Wikipedia, “In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.”

When you encounter them, they look like this – “(?:(?:\r\n)?[ \t])(?:(?:(?:[^()<>@,;:\”.[] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[[“()<>@,;:\”.[]]))|”(?:[^\”\r\]|\.|(?:(?:\r\n)?[ \t]))“(?:(?: \r\n)?[ \t]))(?:.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\”.[] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[[“()<>@,;:\”.[]]))|”(?:[^\”\r\]|\.|(?:(?:\r\n)?[ \t]))“(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\”.[] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[[“()<>@,;:\”.[]]))|[([^[]\r\]|\.)*\”

To get to the nuts and bolts, we need to know there are sequence of ranking in interpreting and expressing expressions:

Detailed explanations for some of these REs are:

First, atom expressions:

\d any number equivalent to [0-9]
\D any non-number euivalent to [^0-9]
\w any symbol equivalent to [a-zA-Z0-9_]
\W, any non-symbol, equivalent to [^a-zA-Z0-9_]
\s any blank space, equivalent to [ \f\n\r\t\v]
\S any non blank space, equivalent to [^ \f\n\r\t\v]
. any symbol other than \r\n, equivalent to [^\r\n]

for example,

str = '<dl>(843) 542-4256</dl> <dl>(431) 270-9664</dl>' pttn = r'\d\d\d\-' re.findall(pttn, str), so the result is  ['542-', '270-'] 

There is a shortcut to memorize them:

类别原子挺好记忆的,如果你知道各个字母是哪个词的首字母的话:

  • d 是 digits
  • w 是 word characters
  • s 是 spaces

另外,在空白的集合[ \f\n\r\t\v]中:f 是分页符;\n \r 是换行符; \t 是制表符;\v 是纵向制表符(很少用到)。各种关于空白的转义符也同样挺好记忆的,如果你知道各个字母是那个词的首字母的话:

  • f 是 flip
  • n 是 new line
  • r 是 return
  • t 是 tab
  • v 是 vertical tab

In actual practice, it’s necessary to state the application of Regular Expression (Regex) is usually to find a pattern from a large body of text.

import re
str = ‘The quick brown fox jumps over the lazy dog’
pttn = re.compile(r’\wo\w’)
re.findall(pttn, str)

re.compile(‘[a-z]+’).match(‘tempo’).string
re.findall(re.compile(‘[a-z]+’), ‘The quick brown fox jumps over the lazy dog’)

referencing this link, we can see 8 scenarios to compile the pattern for:

  1. Matching a Username, for example “/^[a-z0-9_-]{3,16}$/ “, this means I want the machine to find if there is any lowercase letter (a, b, c … to z), number (0, 1, 2, 3, … to 9), an underscore, or a hyphen. I also want to makes sure that are at least 3 of those characters, but no more than 16 by this part – {3,16}. As a result, such string ” my-us3r_n4m3″ is a match, while this string ” th1s1s-wayt00_l0ngt0beausername (too long) ” is not.

Similarly, we apply RE for the following:
Matching a Password
Matching a Hex Value
Matching a slug
Matching an Email
Matching a URL
Matching an IP Address
Matching an HTML Tag

Elaborate a little bit more on email case, note in below illustration ”
/^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,6})$/ “, dot is escaped by \ in front. As a result, such string won’t be a match – john@doe.something (TLD is too long).

A very powerful example to convince that we do need to take excessive efforts to grasp RE is by Xiaolai Li’s attempt to editing his book. In stead of resorting to tedious, error-prone way, he wrote a python script to automate the process:

#https://github.com/selfteaching/the-craft-of-selfteaching/blob/master/Part.3.D.indispensable-illusion.ipynb
import re
import os

files = [f for f in os.listdir(‘.’) if os.path.isfile(f)]
files.sort()
for f in files:
if ‘.ipynb’ in f:
with open(f, ‘r’) as file:
str = file.read()
pttn = r'”# (.*)”\n’
r = re.findall(pttn, str)
if len(r) > 0:
print(f’> – [{f.replace(“.ipynb”, “”)}(**{r[0]}**)]({f})’) # 生成 markdown

the key Regular expression here is it starts with number, followed with . and new line.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.