35 Commits

Author SHA1 Message Date
Taus
f55ff96674 Python: Bump extractor version and add change note 2025-11-27 13:52:37 +00:00
Nora Dimitrijević
c749607db8 Bump python extractor version to 7.1.5 2025-10-07 11:22:16 +02:00
Nora Dimitrijević
1a9683f986 Add @top database type 2025-10-06 11:47:14 +02:00
Nora Dimitrijević
6f208e9dec Write overlay metadata at end of extraction. 2025-10-06 11:47:12 +02:00
Nora Dimitrijević
49b18db044 Python extractor: in overlay mode, traverse only changed files
- fall back to full extraction on overlay changes json read error
- we filter both root modules and (transitive) imports against the overlay-changes json.
2025-10-06 11:47:09 +02:00
Nora Dimitrijević
e0cf719cb9 Path transformer: handle Windows-style paths
And don't add slash to start of path patterns on Windows.
2025-10-06 11:37:04 +02:00
Nora Dimitrijević
29b1a7403b Support CODEQL_PATH_TRANSFORMER env var in python path renamer
The new name is required by overlay support.
2025-10-06 11:37:02 +02:00
Nora Dimitrijević
a88d3397cd Add overlay builtins to python dbscheme 2025-10-06 11:36:56 +02:00
Taus
f6732a927b Python: Bump extractor version 2025-09-03 11:56:54 +00:00
Taus
2158eaa34c Python: Fix a bug in glob regex creation
The previous version was tested on a version of the code where we had
temporarily removed the `glob.strip("/")` bit, and so the bug didn't
trigger then.

We now correctly remember if the glob ends in `/`, and add an extra part
in that case. This way, if the path ends with multiple slashes, they
effectively get consolidated into a single one, which results in the
correct semantics.
2025-05-15 15:34:11 +00:00
Taus
c8cca126a1 Python: Bump extractor version 2025-05-15 14:59:33 +00:00
Taus
98388be25c Python: Remove special casing of hidden files
If it is necessary to exclude hidden files, then adding
```
paths-ignore: ['**/.*/**']
```
to the relevant config file is recommended instead.
2025-05-15 14:49:17 +00:00
Taus
61719cf448 Python: Fix a bug in glob conversion
If you have a filter like `**/foo/**` set in the `paths-ignore` bit of
your config file, then currently the following happens:

- First, the CodeQL CLI observes that this string ends in `/**` and
  strips off the `**` leaving `**/foo/`
- Then the Python extractor strips off leading and trailing `/`
  characters and proceeds to convert `**/foo` into a regex that is
  matched against files to (potentially) extract.

The trouble with this is that it leaves us unable to distinguish
between, say, a file `foo.py` and a file `foo/bar.py`. In other words,
we have lost the ability to exclude only the _folder_ `foo` and not any
files that happen to start with `foo`.

To fix this, we instead make a note of whether the glob ends in a
forward slash or not, and adjust the regex correspondingly.
2025-05-15 14:48:06 +00:00
Taus
0c1b379ac1 Python: Extract files in hidden dirs by default
Changes the default behaviour of the Python extractor so files inside
hidden directories are extracted by default.

Also adds an extractor option, `skip_hidden_directories`, which can be
set to `true` in order to revert to the old behaviour.

Finally, I made the logic surrounding what is logged in various cases a
bit more obvious.

Technically this changes the behaviour of the extractor (in that hidden
excluded files will now be logged as `(excluded)`, but I think this
makes more sense anyway.
2025-05-02 12:44:05 +00:00
Taus
918c05c538 Python: Don't prune any MatchLiteralPatterns
Extends the mechanism introduced in
https://github.com/github/codeql/pull/18030
to behave the same for _all_ `MatchLiteralPattern`s, not just the ones
that happen to be the constant `True` or `False`.

Co-authored-by: yoff <yoff@github.com>
2025-02-11 12:58:52 +00:00
Taus
131ec8d22f Python: Handle loop constructs outside of loops
Observed on some test files in Nuitka/Nuitka, having `break` and
`continue` outside of loops in Python is (to Python) a syntax error, but
our parser happily accepted this broken syntax.

This then caused issues further downstream in the control-flow
construction, as it broke some invariants.

To fix this we now skip the code that would previously fail when the
invariants are broken.

Co-authored-by: yoff <yoff@github.com>
2025-02-06 14:30:16 +00:00
Taus
60d97e0e16 Python: Print file path when logging context errors
This makes it _much_ easier to find the offending bit of syntax.
2025-02-05 13:13:39 +00:00
Taus
d779ae5c3e Python: Add change note for CFG pruning fix
... And also bump the extractor version.
2024-11-26 15:39:15 +00:00
Taus
a4ccda5fe3 Python: Fix pruning of literals in match pattern
Co-authored-by: yoff <lerchedahl@gmail.com>
2024-11-19 13:48:13 +00:00
Taus
2ef3ae9860 Python: Improve parser logging/timing/customisability
Does a bunch of things, unfortunately all in the same place, so my
apologies in advance for a slightly complicated commit.

As for the changes themselves, this commit

- Adds timers for the old and new parsers. This means we get the overall
time spent on these parts of the extractor if the extractor is run with
`DEBUG` output shown.
- Adds logging information (at the `DEBUG` level) to show which
invocations of the parsers happen when, and whether they succeed or not.
- Adds support for using an environment variable named
`CODEQL_PYTHON_DISABLE_OLD_PARSER` to disable using the old parser
entirely. This makes it easier to test the new parser in isolation.
- Fixes a bug where we did not check whether a parse with the new parser
had already succeeded, and so would do a superfluous second parse.
2024-10-30 13:58:46 +00:00
Taus
f75615b913 Merge pull request #17822 from github/tausbn/python-more-parser-fixes
Python: A few more parser fixes
2024-10-30 13:47:10 +01:00
Taus
fcec8e0256 Python: Fail tests when errors/warnings are logged
This is primarily useful for ensuring that errors where a node does not
have an appropriate context set in `python.tsg` actually have an effect
on the pass/fail status of the parser tests. Previously, these would
just be logged to stdout, but test could still succeed when there were
errors present.

Also fixes one of the logging lines in `tsg_parser.py` to be more
consistent with the others.
2024-10-22 15:11:51 +00:00
Taus
cc39ae57dc Python: Fix dataset check error for string encoding
Here's an example of one of these errors:
```
INVALID_KEY predicate py_cobjectnames(@py_cobject obj, string name)

The key set {obj} does not functionally determine all fields. Here is a
pair of tuples that agree on the key set but differ at index 1: Tuple 1
in row 63874: (72088,"u'<X>'") Tuple 2 in row 63875: (72088,"u'<?>'")
```
(Here, the substring `X` should really be the Unicode character U+FFFD,
but for some reason I'm not allowed to put that in this commit message.)

Inside the extractor, we assign IDs based on the string type (bytestring
or Unicode) and a hash of the UTF-8 encoded content of the string. In
this case, however, certain _different_ strings were receiving the same
hash, due to replacement characters in the encoding process.

In particular, we were converting unencodable characters to question
marks in one place, and to U+FFFD in another place. This caused a
discrepancy that lead to the dataset check error.

To fix this, we put in a custom error handler that always puts the
U+FFFD character in place of unencodable characters. With this, the
strings now agree, and hence there is no clash.
2024-10-21 15:31:16 +00:00
Taus
417e60a466 Python: Update extractor version 2024-10-15 11:22:54 +00:00
Taus
2af0d78435 Python: Add default field to the relevant AST nodes 2024-10-15 11:22:32 +00:00
Rasmus Lerchedahl Petersen
e2eb08b543 Python: improve messaging 2024-10-11 15:36:44 +02:00
Rasmus Lerchedahl Petersen
22588c9f85 Python: update ectractor version 2024-10-11 15:36:44 +02:00
Rasmus Lerchedahl Petersen
4a291147e0 Python: only look for the py2 stdlib if we extract std lib 2024-10-11 15:36:44 +02:00
Rasmus Lerchedahl Petersen
e91efaa92e python: do not extract stdlib by default 2024-10-11 15:36:44 +02:00
Rasmus Wriedt Larsen
354394d4c2 Python: Don't use fake locations in diagnostics
Some of the internal tooling would not be too happy about this :D
2024-07-12 13:36:41 +02:00
Rasmus Wriedt Larsen
60d1dc8af8 Python: Bump extractor version 2024-07-09 14:15:52 +02:00
Rasmus Wriedt Larsen
6b3625e24e Python: Handle diagnostics writing for BuiltinModuleExtractable 2024-07-09 14:15:52 +02:00
Rasmus Wriedt Larsen
c1da2c1d2f Python: Gracefully handle exceptions in diagnostics writing 2024-07-09 14:15:51 +02:00
Rasmus Wriedt Larsen
a8b976b389 Python: Always log errors before writing diagnostics
So we have the info in the logs if the diagnostics processing fails
2024-07-09 13:47:53 +02:00
Taus
6dec323cfc Python: Copy Python extractor to codeql repo 2024-03-07 13:59:16 +00:00