精华区文章阅读

发信人: tcpip (高级草包.org), 信区: Linux
标题: 一片不错的文章：介绍egrep和正则表达式
发信站: 哈工大紫丁香 (2000年11月11日23:04:37 星期六), 转信

Fun with Regular Expressions
============================
by Adrian J. Chung (c) 2000 (ajchung@email.com)

Many of the text processing GNU tools include a powerful pattern matching
mechanism called Regular Expressions. A more or less complete implementation
is supported by a utility called "egrep". Other text utilities such as
"gawk" also support regex's. Some tools like "sed" and "grep" support an
older less powerful regex syntax. Perl implements a regex variant that has
proven so popular that the same syntax is used in Python regex's.

For the rest of this article we'll be using the POSIX standard regular
expressions as supported by "egrep".

"egrep" is a tool that searches for substrings. The following command will
output the lines of the /etc/inittab file that contain the string "ini".
Note that "ini" does not have to appear as a word by itself.

$ egrep ini /etc/inittab

We can search for digit strings also:

$ egrep 321 /etc/termcap

One may be interested only in the lines that start with the given target
string. We use the special character ^ which will match the beginning of a
line. To search a dictionary for words that start with "rege":

$ egrep ^rege /usr/dict/words

Similarly, special character $ can be used to match the end of a line. We
can now answer the age old question -- what words end in "gry"?

$ egrep 'gry$' /usr/dict/words

The single forward quotes are needed to prevent the command shell (bash in
this case) from interpreting the special characters before passing them to
egrep. If you want to match any of these special characters so that they no
longer have any special interpretation within egrep, precede them with a
backslash:

$ egrep '\^Q' /etc/termcap

This matches a two character substring of ^ followed by "Q".

A single period matches any single character. To find all three letter words
starting with "p" and ending with "n":

$ egrep '^p.n$' /usr/dict/words

Note that this pattern matches both the start and end of the line in order
to force an exact match rather than just a substring. "egrep" has an option
to enable this behavior so that the ^ and $ become unnecessary:

$ egrep -x 'p.n' /usr/dict/words

Sometimes the period is too general and one needs to match a more restricted
range of characters. This command:

$ egrep 't[aeiou]p$' /usr/dict/words

outputs all words containing "tap", "tep", "tip", "top", or "tup". Instead
of enumerating all matching characters, a range can be specified:

$ egrep ':[3-5][0-9]:' /etc/termcap

Date stamps where the minutes field is in the latter half of the hour are
extracted. Ranges also work with letters:

$ egrep '^[n-t].[aeiou]$' /usr/dict/words

This finds all three letter words beginning with the letter "n", "o", "p",
..., or "t", and ending with a vowel. Ranges and enumeration may be mixed:

$ egrep -x '.[aeit-z]' /usr/dict/words

lists all two letter words ending in "a", "e", "i", or any letter from "t"
through to "z". If the leftmost character between the [] is a ^ then the
match is negated:

$ egrep -ix '[^n-t][^aeiou].' /usr/dict/words

finds all three letter words beginning in any character other than the
letters "n" through "t", with a consonant for the second letter. The "-i"
option makes the match case insensitive. Words like "Sri" and "DEC" are
omitted.

The '\<' and '\>' combinations match the beginning and end of complete words
respectively:

$ egrep '\

matches lines that contain words beginning with "r".

$ egrep 't\>' /etc/inittab

matches lines with words ending in "t".

The * is a repetition operator. Any pattern that immediately precedes it,
can match zero or more times:

$ egrep 'ho*t$' /usr/dict/words

matches word such as "fight", "hot", and "hoot". It will even match
"hooooot" if it were in the dictionary.

$ egrep '^s.*ho*t$' /usr/dict/words

shows the effect of * on special patterns like the single period. It is the
equivalent of having zero or more periods in the regular expression. Words
like "sleight", "snapshot", and "sharpshoot" match.

$ egrep -ix '[^e]*e[^e]*' /usr/dict/words

finds all words that use exactly one "e"

$ egrep -ix '[a-ep]*' /usr/dict/words

finds all words spelt using the letters "a","b", "c", "d", "e", and "p"
only. Similary, + makes the preceding pattern match one or more times.
Expression 'ot+o$' is equivalent to 'ott*o$'.

$ egrep '^s.*ho+t$' /usr/dict/words

matches "shot" and "shoot" but not "sleight".

$ egrep -i 'f.+f' /usr/dict/words

lists words that contain two non-adjacent f's.

Here are some more repetition operators. List all three letter words:

$ egrep -x '.{3}' /usr/dict/words

All words at least 19 letters long:

$ egrep -x '.{19,}' /usr/dict/words

And words between 11 and 14 letters in length, inclusive:

$ egrep -x '.{11,14}' /usr/dict/words

Find words with at least 4 consecutive vowels:

$ egrep '[aeiuo]{4}' /usr/dict/words

Find a word with six consecutive consonants, excluding "y":

$ egrep -i '[^aeiouy]{6}' /usr/dict/words

(People that frequent central London will know this.)

The ? is equivalent to {0,1}:

$ egrep -x 'po?l.' /usr/dict/words

matches "ply", "pole", "poll" and "polo", but not "pools".

A | in the regular expression acts like a boolean OR:

$ egrep -ix 'p.n|b.+ght' /usr/dict/words

outputs words matching either 'p.n' ("pin", "pan", etc) or 'b.+ght'
("brought", "blight", etc.)

$ egrep -x 'b(ea|oo).' /usr/dict/words

matches "bead" and also "book". Patterns joined with | need not be the same
length:

$ egrep '^s(ha|o)p' /usr/dict/words

matches both "shape" and "soprano". Note the use of rounded brackets to
delimit the reach of the | operator. The ( ) can also define the scope of
the repetition operators:

$ egrep -i '([aeiou][^aeiou]){7}' /usr/dict/words

lists words with 7 alternations of vowel and consonant.

$ egrep -ix '[^s]*s([^s]+s){2,}[^s]*' /usr/dict/words

finds all words containing at least 3 occurances of the letter "s", none of
which are adjacent to each other.

Parenthesis have another important use. Any text that matches the pattern
enclosed in the () is stored temporarily. This text can then be referred to
later in the same expression. In the following command the parenthesis
encloses a pattern matching any single vowel:

$ egrep '([aeiou])\1' /usr/dict/words

The \1 now refers to whatever vowel that was matched, hence this regex
matches words containing "aa", "ee", "ii", "oo", or "uu". When there is more
than one pattern in parenthesis, the matched text is referenced by \1, \2,
etc.

$ egrep -x '(.)(.)\2\1' /usr/dict/words

matches words like "deed", "noon", etc. Each (.) matches a single letter,
and the \2\1 must match these same letters but in reverse order in which
they previously appeared.

$ egrep -ix '(.)(.)(.).*\3\2\1' /usr/dict/words

lists words whose last three letters are the same as the first three letters
reversed.

A more complicated example:

$ egrep '^(.).+\1\1.+\1$' /usr/dict/words

What does it do? It returns words like "enfeeble", "gagging", and
"sicknesses". The parenthesized pattern matches the first letter in the
word. Any matching text must also end in this letter, and must also contain
this same letter doubled somewhere in the middle.

Patterns within parenthesis can be of any length:

$ egrep -i '(.{5}).+\1' /usr/dict/words

lists words that contain a subsequence of 5 letters more than once.

And finally a really advanced example:

$ egrep -ix '(.).*(\1.+).*\2' /usr/dict/words

find words whose last few letters are also found, adjacent and in the same
order, somewhere in the middle of the word; the initial letter of this group
also being identical to the first letter of the word.

For more details see the "regex" info page, regex(7) man page (by typing
"man 7 regex"). Also the "awk" and "egrep" documentation is worth checking
out.

Comments

※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: diamond.hit.edu.cn]

Linux 版 (精华区)