Skip to content

3.1 RegEx

Q1: Patterns

  • grep “[mM].*[Kk]” file.txt Matches any string that starts with either m or M and ends with k or K. It can be a substring or even string with multiple words.
  • grep ”<[mM].*[Kk]>” file.txt similar to above, but only when the first character is a beginning of word, and the last one is end of word. (No substring)
  • grep “[mM].{2}[Kk]” file.txt Any substring with 4 characters that starts with m/M and ends with k/K.
  • grep -v “[mM].{2}[Kk]” file.txt Match the inverse of previous question. -v means invert match. This is also a good time to remind them about man command man grep and search with /-v. Unix

Q1.b: Extended RegEx

  • [Jj]effer?(y|ie)s?
  • [hH]itch[ei]ng?
  • [hH][ei]a?rd
  • [dD]i(x|ck)s?(on)?
  • [mM][ac]gh?ee

Q1.c

This question requires two additional commands date and cut and piping to redirect the output. Command date is used to retrieve the current date, while command cut can be used to parse the result. Let them research these commands via command man. Since there is no sample text file, they can create their own. Let’s say the text looks like this:

Michael:12/03/2000 Josh:25/12/2010 Jennifer:10/05/2018

where each line start with first name (for simplicity) followed by a ’:’ and the date with format “dd/mm/yyyy”. The suitable command for this text file is:

grep "$(date +%d/%m/%y)" text.txt | cut -d ":" -f 1

This command matches the line containing the current date, and then piped to the command cut to parse based on delimiter ”:”, and we pick the first token (which is the name of the person). For further reading about these commands, feel free to read it here:

Q1.d

For any kind of pattern matching that depends on the previous match, we need to use backreferencing. For example, in this question we need to match three consecutive and identical characters. This requires regex to recognize the first match and then it has to repeat two more times. The simplest regex for this question is: (.)\1\1

Q1.e

This question requires regex to find any string containing at least 50 characters. The simplest answer is:

grep -E ".{50,}" text.txt

This might lead them to ask “How about less than 50 characters?”. Keep in mind that the following regex is NOT the answer:

grep -E ".{0,49}" text.txt

This is because this regex will still match substring inside a very long string. The correct answer is:

grep -E "^.{0,49}$" text.txt

This works because it only matches short string that is between the start of the line (^) and the end of the line ($). Any string containing 50 characters or more will not get matched by this regex.

Q2.a

  • s/ */ /g

Adding one space after every non-space character. For the multiple consecutive space characters, it will become a single space character. The character ’g’ (global) at the end means apply the rule on all occurences on the same line (NOT only first occurence). Character ’s’ only affect the current line where the cursor is.

  • s/^/ /

Adding a single space character at the beginning of the line. Keep in mind that ’^’ means the beginning of the line.

  • %s/^[0-9][0-9]* //

Matches any number at the beginning of the line (any number of digits) and remove them. One good application is to removes line number. Character “%s” means substitute on all lines in the file.

  • s/b[aeio]g/bug/g

Replaces bag OR beg OR big OR bog all into bug.

  • s/t([aou])g/h\1t/g

Since there is bracket () and \1, this involves backreferencing. This will substitute tag into hat, tog into hot, tug into hut.

Q2.b

Assume we only removes any character following the double forward-slash ”//”. We can use command:

%s/\/\/.*//

The character ”.*” is the wildcard representing all the string after ”//” that will be removed.

Q2.c

In order to capture whole words only, they need to use characters "<" and ">". One example from the lecture note is:

%s/\<UNIX\>/Linux/gc

This will replace the whole word “UNIX” into “Linux”. This applies on global scale in the file and each occurence will require confirmation from the user. (letter ’c’ for confirmation)

Q3

Command sed is a stream editor which can be used to transform/manipulate the text on input stream. (file OR pipeline). Regex is very useful for text matching in the transformation process. Feel free to read this online resource: https://www.grymoire.com/Unix/Sed.html

Q3.a

You can refer to question 2(e) for the similar pattern. One possible answer is:

cat text.txt | sed -r -n "s/^.{0,49}$/&/p"

-r is to enable Extended Regular Expression. -n is to not print by default. The last character ’p’ is to print the matched result. The character ’&’ is to refer to the matched pattern. In this example, there is no modification, we only print the matched result.

Q3.b

We can use backreferencing to retrieve the consecutive identical characters. Since we need to count the occurences, we can use the strategy to print each occurence on new line, and then we use command grep with -c option to count the amount of lines:

cat text.txt | sed -r -n 's/(.)\1\1/&\n/gp' | grep -E '(.)\1\1' -c

Notice that &\n is used to print every occurence on every new line.

Q3.c

The first issue is we need to add a semi-colon ’;’ between each substitution. The second issue is the order will cause problem. If we substitute “compute” into “calculate” first, then all the words “computer” will become “calculater”. One better solution is:

sed -r 's/computer/host/g ; s/compute/calculate/g' file.txt

-r is to enable Extended Regular Expression.

Q3.d

We can use the similar strategy in question (b):

cat text.txt | sed -r -n 's/encryption/&\n/gp' | grep -E 'encryption' -c

Q4

This exercise is about applying regex match in C programming. The code has been pre-written, so they can experiment with it.

int match(const char *string, char *pattern)
{
int match = FALSE;
regex_t re;
if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) == 0)
{
if (regexec(&re, string, (size_t)0, NULL, 0) == 0)
{
match = TRUE;
regfree(&re);
}
}
return match;
}