Pattern Matching Hyphen-Minus Sign in Bash
I was trying to use the sed
command to perform some changes to a text and stepped into an interesting “problem”; pattern matching the minus-hyphen (-) symbol.
Assume we have the following text:
something SoMeThiNg some-thing soMe_thing
and we want to match all the different versions of the word with one expression (one by one).
My initial idea was to use this regular expression:
's/[a-zA-Z\-\_]*/matched/' |
Naturally, I tried to escape the – sign. As you can see from the output, this doesn’t work:
$ sed 's/[a-zA-Z\-\_]*/matched/' test matched matched matched-thing matched |
The minus sign is not matched, because of its special meaning (setting ranges). In order to make the expression work, you need to move the “-” either in the beginning or in the end of the expression:
$ sed 's/[a-zA-Z\_-]*/matched/' test matched matched matched matched $ sed 's/[-a-zA-Z\_]*/matched/' test matched matched matched matched |
and leave it un-escaped!
That one should be a bug in sed. Leaving the hyphen unescaped at the beginning or end of the square brackets is optional. You can escape the hyphen in perl:
$ perl -ne ‘s/[a-zA-Z\-\_]*/matched/;print’ test
$ perl -ne ‘s/[a-zA-Z\_\-]*/matched/;print’ test
$ perl -ne ‘s/[\-a-zA-Z\_]*/matched/;print’ test
$ perl -ne ‘s/[-a-zA-Z\_]*/matched/;print’ test
$ perl -ne ‘s/[a-zA-Z\_-]*/matched/;print’ test
All the above throwed:
matched
matched
matched
matched
$ perl -v | head -2 | tail -1
This is perl 5, version 12, subversion 3 (v5.12.3) built for i686-linux-thread-multi
This is a good point.
grep
has the same issue withsed
though:What’s actually happening is that the regexp tries to match the range from character ‘\’ to character ‘_’.
I ran into this bug in GNU sed today when trying to pattern match the hypen in my markdown to html blogging script. I had to put it at the beginning of the statement for it to work! Thanks for the explanation here; it was driving me crazy.