Python: Fix a bug in glob conversion

If you have a filter like `**/foo/**` set in the `paths-ignore` bit of
your config file, then currently the following happens:

- First, the CodeQL CLI observes that this string ends in `/**` and
  strips off the `**` leaving `**/foo/`
- Then the Python extractor strips off leading and trailing `/`
  characters and proceeds to convert `**/foo` into a regex that is
  matched against files to (potentially) extract.

The trouble with this is that it leaves us unable to distinguish
between, say, a file `foo.py` and a file `foo/bar.py`. In other words,
we have lost the ability to exclude only the _folder_ `foo` and not any
files that happen to start with `foo`.

To fix this, we instead make a note of whether the glob ends in a
forward slash or not, and adjust the regex correspondingly.
This commit is contained in:
Taus
2025-05-15 14:48:06 +00:00
parent 2ded42c285
commit 61719cf448

View File

@@ -41,6 +41,9 @@ def glob_part_to_regex(glob, add_sep):
def glob_to_regex(glob, prefix=""):
'''Convert entire glob to a compiled regex'''
# When the glob ends in `/`, we need to remember this so that we don't accidentally add an
# extra separator to the final regex.
end_sep = "" if glob.endswith("/") else SEP
glob = glob.strip().strip("/")
parts = glob.split("/")
#Trailing '**' is redundant, so strip it off.
@@ -53,7 +56,7 @@ def glob_to_regex(glob, prefix=""):
# something like `C:\\folder\\subfolder\\` and without escaping the
# backslash-path-separators will get interpreted as regex escapes (which might be
# invalid sequences, causing the extractor to crash)
full_pattern = escape(prefix) + ''.join(parts) + "(?:" + SEP + ".*|$)"
full_pattern = escape(prefix) + ''.join(parts) + "(?:" + end_sep + ".*|$)"
return re.compile(full_pattern)
def filter_from_pattern(pattern, prev_filter, prefix):