[wellylug] Procmail recipie problem

Ewen McNeill ewen at naos.co.nz
Tue Jun 17 22:42:37 NZST 2003


In message <1055845256.551.26.camel at spleen>, Stephen Judd writes:
>- I think you might have fallen into the first one. If you come from the
>DOS world, you are used to thinking of * as a wildcard that matches
>anything. In standard Unix regexes, * is a quantifier that says "zero or
>more times". ".*" is what matches anything.

If only it were that simple.

Filename patterns use "*" as a multiple-character wildcard match (anywhere
in the filename/pathname in Unix/Linux; in the DOS world it used to have
to be at the end -- but that's changed somewhat more recently).

Regular expressions use "*" to mean "zero or more of the previous thing".

Filename patterns use "?" as a single character wildcard.  Regular
expressions use ".".  Regular expressions use "?" to mean "the previous
thing is optional".

>- Perl has very much extended the core regex syntax 

IME, _everything_ has its own extended regex syntax.  If you're
particularly lucky some of these extensions happen to be the same as
other extensions you've come across, but sometimes they're gratitiously
different.

"+" (match one or more of the previous thing) is a common extension
in more modern things (originating in Perl IIRC), but by no means is
it everywhere.  Using parenthesis for grouping, or with alternatives,
is common, but I don't think that's everywhere either.  Using parenthesis
for extracting matched patterns is common, but in two different syntaxes
(some use \( ... \) as the extract matched text and use ( ) as literals,
and others use ( ... ) and require \( etc to match a literal '('; see my
comments about things being gratitiously different.)

>- poor font choices make it easy to accidentally mix up \w and \W, \s and \S.

And these special patterns pretty much only work in perl anyway.  Some 
programs expose POSIX style special patterns (eg, sed, awk, on some
platforms), such as [:digit:], and yet others do their own thing (and a
few don't even bother at all).

All of those special patterns are just shortcuts, and you can specify
them yourself (at the expense of a bit of additional length) if needed.

>- "greediness" seems to be the default in most regex engines, but very
>often you want the smallest pattern that matches - learn about the "?"
>flag, which stipulates non-greeedy matching.

AFAIK greediness is the default in all regex engines (ie, off the top of
my head I can't think of one that isn't greedy by default; off the top
of my head it's cheaper/easier to implement it with a greedy regex and
back tracking given the chance).

The "?" flag as a suffix to mean "don't be greedy" is a perl special
extension AFAIK; where it's not available you have to write a better
(read: more complicated) regex instead, to prevent the greedy match from
working (eg, I often use [^>] when matching SGML/HTML/XML like tags,
to prevent matching outside the tag boundary).

Fortunately, despite all of this, the really core regex syntax is
standard to all regex engines, including ., *, ?, [ ], etc, but mainly
as a de facto standard.  (POSIX specifies that stuff and more, and not
everything is POSIX compliant.)

Ewen



More information about the wellylug mailing list