A regular expression is a pattern that describes a set of strings. They are constructed analagously to arithmetic expressions, by using various operators and constructs to make large expressions from smaller ones.

There's actually quite a few different flavors of regular expressions, or regex for short. In practise, the main features and feel between the different implementations is pretty standard. So if you know one, you know almost all. But it does cause enough confusion that most people play regex by feel...

If we restrict ourselves to GNU regex, that is, regex used by the grep
family, sed, gawk, vim and others, there are still two flavors: basic (which
is mostly classic regex from before Linux) and extended, which *for the
most part* is implemented by all relatively new software including GNU
stuff and scripting languages like Perl and Tcl.

AMatches A.

MiSsIsSiPpI-5==Matches MiSsIsSiPpI-5==.

Cat.MatchesCatfollowed by any single letter, character or number, likeCats,Cat1orCat`.

The^HelloMatches any line that begins with the wordHello.

Goodbye$Matches any line that ends with the wordGoodbye.

The^AB*$Matches anAon a line by itself. It also matches a line beginning with an A followed by any number ofB's.

^AB+$Matches anAfollowed by one or moreB's.

Thex{1,5}A string of x's: at least one, and up to 5.

Finally,x{5,}A string of 5 or more x's

x{10}A string of exactly 10 x's

The[a5Y]Matches ana,5orY

When a[0-9]{3}A three digit number.

[^a-z24680]Matches a single character which isn't a lowercase letter or even number.

^\...A line beginning with a period followed by any two characters.

\(.\)\1Matches any double letter, like aa, bb, etc.

You can tag more than one string. The tags are refered to by^\(.\).*\1$Matches lines which end with the same letter they started with.

\(.\)\(.\)\2\1Matches all 4 letter palindromes like deed, noon and peep.

This will print out all 4 letter palindromes listed in /usr/share/dict/words. Can you see whygrep -e '^\(.\)\(.\)\2\1$' /usr/share/dict/words

We need a new regex for each palindrome of a different length. Can we construct a regex that will match a palindrom of arbitrary letter length? The answer turns out to be no.grep -e '^\(.\)\(.\)\(.\)\3\2\1$' /usr/share/dict/words

A DFA is much like a Turing machine. The machine has a one-way read-only tape, which contains the word to be recognized. It has a read head, a finite number of states, and a transition function. At any given step, the read head reads a letter, and the machine switches states according to the transition function (given current state, and the letter). If, after reading the word, the DFA ends up in one of the pre-determined "accept" states, then the DFA accepts the word. Otherwise, it rejects the word.

There is a duality between regex and Deterministic Finite Automata (DFA). Any language (i.e. collection of words) which can be recognized by one can be recognized by the other.

One can think of the states roughly as some sort of memory for the DFA.

Now suppose that a DFA has N states. Then after some thinking, you should be able to convince yourself that it would not be able to recognize all palindromes of length 2N+1, because the DFA does not have enough "memory" to handle all possible cases.