Opinionated Software

… we have opinions about everything!

What’s in a word? (\w regexp shorthand class) —

Well not just letters of the alphabet it seems.

Take the case of the logstash pattern WORD:

WORD \b\w+\b

but the shorthand character class \w matches [a-zA-Z0-9_] – notice the digits and underscore! So WORD is not really a WORD!

REALWORD \b[a-zA-Z]+\b

would be better … although I suppose things might be different in Unicode. But generally log files may be Unicode but frequently the data itself is still effectively ASCII.

Categorised as: Hints and Tips | Logstash | Regular Expressions | Ruby

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.