Commit Graph

13 Commits

Author SHA1 Message Date
Taus
cc39ae57dc Python: Fix dataset check error for string encoding
Here's an example of one of these errors:
```
INVALID_KEY predicate py_cobjectnames(@py_cobject obj, string name)

The key set {obj} does not functionally determine all fields. Here is a
pair of tuples that agree on the key set but differ at index 1: Tuple 1
in row 63874: (72088,"u'<X>'") Tuple 2 in row 63875: (72088,"u'<?>'")
```
(Here, the substring `X` should really be the Unicode character U+FFFD,
but for some reason I'm not allowed to put that in this commit message.)

Inside the extractor, we assign IDs based on the string type (bytestring
or Unicode) and a hash of the UTF-8 encoded content of the string. In
this case, however, certain _different_ strings were receiving the same
hash, due to replacement characters in the encoding process.

In particular, we were converting unencodable characters to question
marks in one place, and to U+FFFD in another place. This caused a
discrepancy that lead to the dataset check error.

To fix this, we put in a custom error handler that always puts the
U+FFFD character in place of unencodable characters. With this, the
strings now agree, and hence there is no clash.
2024-10-21 15:31:16 +00:00
Taus
417e60a466 Python: Update extractor version 2024-10-15 11:22:54 +00:00
Taus
2af0d78435 Python: Add default field to the relevant AST nodes 2024-10-15 11:22:32 +00:00
Rasmus Lerchedahl Petersen
e2eb08b543 Python: improve messaging 2024-10-11 15:36:44 +02:00
Rasmus Lerchedahl Petersen
22588c9f85 Python: update ectractor version 2024-10-11 15:36:44 +02:00
Rasmus Lerchedahl Petersen
4a291147e0 Python: only look for the py2 stdlib if we extract std lib 2024-10-11 15:36:44 +02:00
Rasmus Lerchedahl Petersen
e91efaa92e python: do not extract stdlib by default 2024-10-11 15:36:44 +02:00
Rasmus Wriedt Larsen
354394d4c2 Python: Don't use fake locations in diagnostics
Some of the internal tooling would not be too happy about this :D
2024-07-12 13:36:41 +02:00
Rasmus Wriedt Larsen
60d1dc8af8 Python: Bump extractor version 2024-07-09 14:15:52 +02:00
Rasmus Wriedt Larsen
6b3625e24e Python: Handle diagnostics writing for BuiltinModuleExtractable 2024-07-09 14:15:52 +02:00
Rasmus Wriedt Larsen
c1da2c1d2f Python: Gracefully handle exceptions in diagnostics writing 2024-07-09 14:15:51 +02:00
Rasmus Wriedt Larsen
a8b976b389 Python: Always log errors before writing diagnostics
So we have the info in the logs if the diagnostics processing fails
2024-07-09 13:47:53 +02:00
Taus
6dec323cfc Python: Copy Python extractor to codeql repo 2024-03-07 13:59:16 +00:00