Matching Enumerated Types Using Regular Expressions

C#, Regular Expressions No Comments »

Regular expressions are a very useful tool. Among the uses I’ve found for them are validating user input, performing simple HTML manipulation (although in general this is a bad idea — one should prefer a real HTML parser), and parsing textual data in custom formats from numerous sources.

Naturally, regular expressions have downsides as well. They are virtually a write-only language (although Perl’s x flag combined with copious comments largely alleviates this), some regular expressions have ghastly performance characteristics, learning their syntax takes quite a bit of time, far too many developers seem to be unaware of their existence, different regular expression implementations have different features, and one needs to get intimately familiar with the escaping rules for both regular expressions and the programming language (e.g. to create a regular expression which matches a single backslash character in C/C++, one needs to write “\\\\”)

One common task I often need to perform is to create a regular expression which matches any one of a number of values, e.g., matching an enumerated type1. Consider creating a regular expression which matches any two-letter U.S. state code. Most people will write something like (greatly simplified):

regex = "(AK|AL|AR|AZ|...|WA|WI|WV|WY)"

This will work fine, but as ( ) defines a capturing group I prefer to use the non-capturing (?: ) unless otherwise required:

regex = "(?:AK|AL|AR|AZ|...|WA|WI|WV|WY)"

Furthermore, since I don’t know the rules for operator precedence in regular expressions very well, I prefer to encase each allowed value in its own non-capturing group. This will also allow me to use any regular expression as an allowed value, even those which include | characters:

regex = "(?:(?:AK)|(?:AL)|(?:AR)|(?:AZ)|...|(?:WA)|(?:WI)|(?:WV)|(?:WY))"

One can easily write a function to perform this enumerated type regular expression generation. Here’s one implementation in C#:

class RegexUtils
{
    public static string CreateEnumeration(string[] regexs)
    {
        Debug.Assert(regexs != null);
        Debug.Assert(regexs.Length >= 2);

        StringBuilder sb = new StringBuilder();
        sb.Append("(?:");

        foreach (string regex in regexs)
        {
            sb.Append("(?:");
            sb.Append(regex);
            sb.Append(")|");
        }

        sb.Remove(sb.Length - 1, 1);
        sb.Append(")");
        return sb.ToString();
    }
}

The function is used as follows:

string[] stateCodeRegexs = new string[] { "AK", "AL", "AR", "AZ", ..., "WA", "WI", "WV", "WY" };
string anyStateCodeRegex = RegexUtils.CreateEnumeration(stateCodeRegexs);

Please note that the contents of stateCodeRegexsAK, AL, etc. — are themselves regular expressions and not simple character strings. This means that one can use the full set of regular expression features, but one must also beware of escaping issues.

In general, one must be very careful when combining regular expressions together. Typically, copious use of non-capturing groups is required in order to ensure correct behavior; blind string concatenation is just asking for bugs.

[1] For single characters one can use the [ ] construct, but that doesn’t work for more complicated enumerated types.
WP Theme & Icons by N.Design Studio
Entries RSS Comments RSS Log in