Linux 版 (精华区)
发信人: tcpip (高级草包.org), 信区: Linux
标 题: 一片不错的文章:介绍egrep和正则表达式
发信站: 哈工大紫丁香 (2000年11月11日23:04:37 星期六), 转信
Fun with Regular Expressions
============================
by Adrian J. Chung (c) 2000 (ajchung@email.com)
Many of the text processing GNU tools include a powerful pattern matching
mechanism called Regular Expressions. A more or less complete implementation
is supported by a utility called "egrep". Other text utilities such as
"gawk" also support regex's. Some tools like "sed" and "grep" support an
older less powerful regex syntax. Perl implements a regex variant that has
proven so popular that the same syntax is used in Python regex's.
For the rest of this article we'll be using the POSIX standard regular
expressions as supported by "egrep".
"egrep" is a tool that searches for substrings. The following command will
output the lines of the /etc/inittab file that contain the string "ini".
Note that "ini" does not have to appear as a word by itself.
$ egrep ini /etc/inittab
We can search for digit strings also:
$ egrep 321 /etc/termcap
One may be interested only in the lines that start with the given target
string. We use the special character ^ which will match the beginning of a
line. To search a dictionary for words that start with "rege":
$ egrep ^rege /usr/dict/words
Similarly, special character $ can be used to match the end of a line. We
can now answer the age old question -- what words end in "gry"?
$ egrep 'gry$' /usr/dict/words
The single forward quotes are needed to prevent the command shell (bash in
this case) from interpreting the special characters before passing them to
egrep. If you want to match any of these special characters so that they no
longer have any special interpretation within egrep, precede them with a
backslash:
$ egrep '\^Q' /etc/termcap
This matches a two character substring of ^ followed by "Q".
A single period matches any single character. To find all three letter words
starting with "p" and ending with "n":
$ egrep '^p.n$' /usr/dict/words
Note that this pattern matches both the start and end of the line in order
to force an exact match rather than just a substring. "egrep" has an option
to enable this behavior so that the ^ and $ become unnecessary:
$ egrep -x 'p.n' /usr/dict/words
Sometimes the period is too general and one needs to match a more restricted
range of characters. This command:
$ egrep 't[aeiou]p$' /usr/dict/words
outputs all words containing "tap", "tep", "tip", "top", or "tup". Instead
of enumerating all matching characters, a range can be specified:
$ egrep ':[3-5][0-9]:' /etc/termcap
Date stamps where the minutes field is in the latter half of the hour are
extracted. Ranges also work with letters:
$ egrep '^[n-t].[aeiou]$' /usr/dict/words
This finds all three letter words beginning with the letter "n", "o", "p",
..., or "t", and ending with a vowel. Ranges and enumeration may be mixed:
$ egrep -x '.[aeit-z]' /usr/dict/words
lists all two letter words ending in "a", "e", "i", or any letter from "t"
through to "z". If the leftmost character between the [] is a ^ then the
match is negated:
$ egrep -ix '[^n-t][^aeiou].' /usr/dict/words
finds all three letter words beginning in any character other than the
letters "n" through "t", with a consonant for the second letter. The "-i"
option makes the match case insensitive. Words like "Sri" and "DEC" are
omitted.
The '\<' and '\>' combinations match the beginning and end of complete words
respectively:
$ egrep '\
matches lines that contain words beginning with "r".
$ egrep 't\>' /etc/inittab
matches lines with words ending in "t".
The * is a repetition operator. Any pattern that immediately precedes it,
can match zero or more times:
$ egrep 'ho*t$' /usr/dict/words
matches word such as "fight", "hot", and "hoot". It will even match
"hooooot" if it were in the dictionary.
$ egrep '^s.*ho*t$' /usr/dict/words
shows the effect of * on special patterns like the single period. It is the
equivalent of having zero or more periods in the regular expression. Words
like "sleight", "snapshot", and "sharpshoot" match.
$ egrep -ix '[^e]*e[^e]*' /usr/dict/words
finds all words that use exactly one "e"
$ egrep -ix '[a-ep]*' /usr/dict/words
finds all words spelt using the letters "a","b", "c", "d", "e", and "p"
only. Similary, + makes the preceding pattern match one or more times.
Expression 'ot+o$' is equivalent to 'ott*o$'.
$ egrep '^s.*ho+t$' /usr/dict/words
matches "shot" and "shoot" but not "sleight".
$ egrep -i 'f.+f' /usr/dict/words
lists words that contain two non-adjacent f's.
Here are some more repetition operators. List all three letter words:
$ egrep -x '.{3}' /usr/dict/words
All words at least 19 letters long:
$ egrep -x '.{19,}' /usr/dict/words
And words between 11 and 14 letters in length, inclusive:
$ egrep -x '.{11,14}' /usr/dict/words
Find words with at least 4 consecutive vowels:
$ egrep '[aeiuo]{4}' /usr/dict/words
Find a word with six consecutive consonants, excluding "y":
$ egrep -i '[^aeiouy]{6}' /usr/dict/words
(People that frequent central London will know this.)
The ? is equivalent to {0,1}:
$ egrep -x 'po?l.' /usr/dict/words
matches "ply", "pole", "poll" and "polo", but not "pools".
A | in the regular expression acts like a boolean OR:
$ egrep -ix 'p.n|b.+ght' /usr/dict/words
outputs words matching either 'p.n' ("pin", "pan", etc) or 'b.+ght'
("brought", "blight", etc.)
$ egrep -x 'b(ea|oo).' /usr/dict/words
matches "bead" and also "book". Patterns joined with | need not be the same
length:
$ egrep '^s(ha|o)p' /usr/dict/words
matches both "shape" and "soprano". Note the use of rounded brackets to
delimit the reach of the | operator. The ( ) can also define the scope of
the repetition operators:
$ egrep -i '([aeiou][^aeiou]){7}' /usr/dict/words
lists words with 7 alternations of vowel and consonant.
$ egrep -ix '[^s]*s([^s]+s){2,}[^s]*' /usr/dict/words
finds all words containing at least 3 occurances of the letter "s", none of
which are adjacent to each other.
Parenthesis have another important use. Any text that matches the pattern
enclosed in the () is stored temporarily. This text can then be referred to
later in the same expression. In the following command the parenthesis
encloses a pattern matching any single vowel:
$ egrep '([aeiou])\1' /usr/dict/words
The \1 now refers to whatever vowel that was matched, hence this regex
matches words containing "aa", "ee", "ii", "oo", or "uu". When there is more
than one pattern in parenthesis, the matched text is referenced by \1, \2,
etc.
$ egrep -x '(.)(.)\2\1' /usr/dict/words
matches words like "deed", "noon", etc. Each (.) matches a single letter,
and the \2\1 must match these same letters but in reverse order in which
they previously appeared.
$ egrep -ix '(.)(.)(.).*\3\2\1' /usr/dict/words
lists words whose last three letters are the same as the first three letters
reversed.
A more complicated example:
$ egrep '^(.).+\1\1.+\1$' /usr/dict/words
What does it do? It returns words like "enfeeble", "gagging", and
"sicknesses". The parenthesized pattern matches the first letter in the
word. Any matching text must also end in this letter, and must also contain
this same letter doubled somewhere in the middle.
Patterns within parenthesis can be of any length:
$ egrep -i '(.{5}).+\1' /usr/dict/words
lists words that contain a subsequence of 5 letters more than once.
And finally a really advanced example:
$ egrep -ix '(.).*(\1.+).*\2' /usr/dict/words
find words whose last few letters are also found, adjacent and in the same
order, somewhere in the middle of the word; the initial letter of this group
also being identical to the first letter of the word.
For more details see the "regex" info page, regex(7) man page (by typing
"man 7 regex"). Also the "awk" and "egrep" documentation is worth checking
out.
Comments
※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: diamond.hit.edu.cn]
Powered by KBS BBS 2.0 (http://dev.kcn.cn)
页面执行时间:2.517毫秒