Table of Contents for
Regular Expressions Cookbook, 2nd Edition

Cover image for bash Cookbook, 2nd Edition

Regular Expressions Cookbook, 2nd Edition by Steven Levithan Published by O'Reilly Media, Inc., 2012

Discussion

Extracting the folder from a Windows path is a bit tricky if we want to support UNC paths, because we can’t just grab the part of the path between backslashes. If we did, we’d be grabbing the server and share from UNC paths too.

The first part of the regex, ‹^([a-z]:|\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+)?›, skips over the drive letter or the network server and network share names at the start of the path. This piece of the regex consists of a capturing group with two alternatives. The first alternative matches the drive letter, as in Recipe 8.20, and the second alternative matches the server and share in UNC paths, as in Recipe 8.21. Recipe 2.8 explains the alternation operator.

The question mark after the group makes it optional. This allows us to support relative paths, which don’t have a drive letter or network share.

The folders are easily matched with ‹(?:[^\\/:*?"<>|\r\n]+\\)+›. The character class matches a folder name. The noncapturing group matches a folder name followed by a literal backslash that delimits the folders from each other and from the filename. We repeat this group one or more times. This means our regular expression will match only those paths that actually specify a folder. Paths that specify only a filename, drive, or network share won’t be matched.

If the path begins with a drive letter or network share, that must be followed by a backslash. A relative path may or may not begin with a backslash. Thus, we need to add an optional backslash to the start of the group that matches the folder part of the path. Since we will only use our regex on paths known to be valid, we don’t have to be strict about requiring the backslash in case of a drive letter or network share. We only have to allow for it.

Because we require the regex to match at least one folder, we have to make sure that our regex doesn’t match e\ as the folder in \\server\share\. That’s why we use ‹(\\|^)› rather than ‹\\?› to add the optional backslash at the start of the capturing group for the folder.

If you’re wondering why \\server\shar might be matched as the drive and e\ as the folder, review Recipe 2.13. Regular expression engines backtrack. Imagine this regex:

^([a-z]:|\\\\[a-z0-9_.$●-]+\\[a-z0-9_.$●-]+)?↵ ((?:\\?(?:[^\\/:*?"<>|\r\n]+\\)+)
This regex, just like the regex in the solution, requires at least one nonbackslash character and one backslash for the path. If the regex has matched \\server\share for the drive in \\server\share and then fails to match the folder group, it doesn’t just give up; it tries different permutations of the regex.
In this case, the engine has remembered that the character class ‹[a-z0-9_.$●-]+›, which matches the network share, doesn’t have to match all available characters. One character is enough to satisfy the ‹+›. The engine backtracks by forcing the character class to give up one character, and then it tries to continue.
When the engine continues, it has two remaining characters in the subject string to match the folder: e\. These two characters are enough to satisfy ‹(?:[^\\/:*?"<>|\r\n]+\\)+›, and we have an overall match for the regex. But it’s not the match we wanted.
Using ‹(\\|^)› instead of ‹\\?› solves this. It still allows for an optional backslash, but when the backslash is missing, it requires the folder to begin at the start of the string. This means that if a drive has been matched, and thus the regex engine has proceeded beyond the start of the string, the backslash is required. The regex engine will still try to backtrack if it can’t match any folders, but it will do so in vain because ‹(\\|^)› will fail to match. The regex engine will backtrack until it is back at the start of the string. The capturing group for the drive letter and network share is optional, so the regex engine is welcome to try to match the folder at the start of the string. Although ‹(\\|^)› will match there, the rest of the regex will not, because ‹(?:[^\\/:*?"<>|\r\n]+\\)+› does not allow the colon that follows the drive letter or the double backslash of the network share.
If you’re wondering why we don’t use this technique in Recipes Recipe 8.18 and Recipe 8.19, that’s because those regular expressions don’t require a folder. Since everything after the part that matches the drive in those regexes is optional, the regex engine never does any backtracking. Of course, making things optional can lead to different problems, as discussed in Recipe 8.19.
When this regular expression finds a match, the first capturing group will hold the drive letter or network share, and the second capturing group will hold the folder. The first capturing group will be empty in case of a relative path. The second capturing group will always contain at least one folder. If you use this regex on a path that doesn’t specify a folder, the regex won’t find a match at all.

Table of Contents for
Regular Expressions Cookbook, 2nd Edition

8.22. Extract the Folder from a Windows Path

Problem

Solution

Discussion

See Also

Table of Contents for Regular Expressions Cookbook, 2nd Edition

8.22. Extract the Folder from a Windows Path

Problem

Solution

Discussion

See Also

Table of Contents for
Regular Expressions Cookbook, 2nd Edition