tiferet
ced7a33419
Add a negative characteristic that indicates that an endpoint was manually modeled as a neutral model.
2023-03-14 12:49:31 -07:00
tiferet
084f0ee57a
Add an endpoint filter that indicates that an endpoint is not a to node for any known taint step. Such a node cannot be tainted, because taint can't flow into it.
2023-03-14 12:49:31 -07:00
tiferet
b48c6badba
Add an endpoint filter to filter out MaD-modeled taint steps.
...
This filter currently has some overlap with `CreatePathSinkCharacteristic`. We add a flag to `erroneousEndpoints` such that these known modeling errors can optionally be ignored.
We turn the flag off when extracting prompt examples, to ensure the prompt contains only examples we're highly certain about.
If there are errors even with this flag turned on, we return an error message in the query that extracts positive examples, to prevent us from accidentally running it when there's a codex-generated data extension file in `java/ql/lib/ext`.
2023-03-14 12:49:31 -07:00
tiferet
25f103a010
Cleanup of EndpointCharacteristics, to get rid of historical naming such as "endpoint filters" and of classes that are used nowhere.
2023-03-14 12:49:31 -07:00
tiferet
d47007a930
Rename AtmConfig to AtmConfigs and fix some imports.
2023-03-14 12:49:31 -07:00
tiferet
e9da1f3751
Rename isEffectiveSink to isSinkCandidate
2023-03-14 12:49:30 -07:00
tiferet
dbb4fa0b1c
Replace EndpointType with either SinkType or SourceType wherever possible.
2023-03-14 12:49:30 -07:00
tiferet
f5833ffc3d
Simplify AtmConfig:
...
- We no longer create new configs for each query we want to boost with ATM.
- Instead the `AtmConfig` module imports the configs for the Java queries it can and copies the configs for the ones that are defined in a ql file.
- The predicates that used to be defined in the `AtmConfig` class are now defined either in candidate extraction query or(in the case of `isKnownSink` which is used in more than one file) in `EndpointCharacteristic.qll`.
- Delete all the derived classes of AtmConfig.
- Surface all candidates that pass the endpoint filters, regardless of flow from a source.
2023-03-14 12:49:30 -07:00
tiferet
efb6522656
EndpointType.getKind is final and just returns this. The name of the endpoint type is its MaD kind. Human-readable descriptions of these kinds are encoded only in Python, not in CodeQL.
2023-03-14 12:49:30 -07:00
tiferet
1d5afaec0e
Get rid of EndpointType.getDescription
2023-03-14 12:49:30 -07:00
tiferet
43db83a28f
Delete some commented out code that was copied directly from JS
2023-03-14 12:49:30 -07:00
tiferet
2e4cc7efd0
Delete EndpointType.getEncoding, which is not needed anywhere.
...
If we need this down the line for model training, we can add it back in then.
2023-03-14 12:49:30 -07:00
tiferet
bcd1ac1bb0
Delete EndpointType.getEncoding, which is not needed anywhere.
...
If we need this down the line for model training, we can add it back in then.
2023-03-14 12:49:30 -07:00
tiferet
10b81eebb7
Improve EndpointTypes:
...
- Create two derived classes for EndpointType: SinkType and SourceType.
- EndpointTypes don't use a `newtype`, but rather extend string, with their characteristic predicate replacing the current getDescription predicate.
2023-03-14 12:49:30 -07:00
tiferet
91109c826d
List the MaD provenance as "ai-generated" rather than "manual"
...
See https://github.com/github/codeql/pull/12228
2023-03-14 12:49:30 -07:00
tiferet
abe3a2dae1
Improve positive prompt examples:
...
Include only sinks that are arguments to an external API call, because these are the sinks we are most interested in.
2023-03-14 12:49:30 -07:00
tiferet
4db03cf4ae
Remove IsMaDTaintStepCharacteristic for now because it's catching all our known sinks as well as taint steps
2023-03-14 12:49:30 -07:00
tiferet
f73b3e0d97
Add endpoint filters:
...
- Filter out MaD taint steps
2023-03-14 12:49:30 -07:00
tiferet
3b508f7879
Remove redundancy from ExceptionCharacteristic
2023-03-14 12:49:30 -07:00
tiferet
9b028476b8
Add endpoint filters:
...
- Filter out exceptions
- Filter out endpoints in test files
2023-03-14 12:49:30 -07:00
tiferet
24e01104a2
As part of the metadata extraction predicate, surface whether or not the argument is being passed to an external API
2023-03-14 12:49:29 -07:00
tiferet
8f6db6b244
Switch back to one sink type per supported query, rather than existing MaD kinds.
2023-03-14 12:49:29 -07:00
tiferet
d6c897c9fd
Small bug fix for handling queries with multiple sink types:
...
`getAReasonSinkExcluded` excludes endpoints that have a characteristic that implies they're not sinks for this particular sink type _for every sink type relevant to this query_.
2023-03-14 12:49:29 -07:00
tiferet
8d8a21b100
Fix a bug that allowed some known sinks to end up as sink candidates for codex
2023-03-14 12:49:29 -07:00
tiferet
a27ae27101
In the MaD data, set the subtypes field to false for final classes / methods.
2023-03-14 12:49:29 -07:00
tiferet
4b6d1f7b78
Create a new class other sink:
...
See https://github.com/github/atm-codex/pull/3
- Add a sink type `OtherMaDSinkType`, and corresponding characteristic `OtherMaDSinkCharacteristic`, for other sinks modeled by a MaD `kind` but not belonging to any of the existing sink types.
- Extract positive prompt examples for the new sink type, together with the corresponding MaD `kind`.
2023-03-14 12:49:29 -07:00
tiferet
66c77e890c
Bug fix
2023-03-14 12:49:29 -07:00
tiferet
be9c6500b8
In the MaD data, extract the argument index as an int rather than a string wrapped up in "Argument[]"
2023-03-14 12:49:29 -07:00
tiferet
831830831c
Fix the MaD signature to the correct format
2023-03-14 12:49:29 -07:00
tiferet
ae69a2bcd9
Separate out the sink types to align with the MaD kinds that currently exist, adding a sink type for all sinks of a given query that are not currently mapped in the MaD kinds.
2023-03-14 12:49:29 -07:00
tiferet
65923ed2c1
Add support for multiple sink types per query
2023-03-14 12:49:29 -07:00
tiferet
a7269075e2
As part of the metadata extraction predicate, surface whether or not the callee is a public method
2023-03-14 12:49:29 -07:00
tiferet
d3a5ee53c6
Refactor the CodeQL code that extracts metadata for methods presented to Codex, to make it easy to add another field
2023-03-14 12:49:29 -07:00
tiferet
f32bb65c54
Refactor the CodeQL code that extracts metadata for methods presented to Codex, to make it easy to add another field
2023-03-14 12:49:29 -07:00
tiferet
633bfdba28
Make the endpoint filter to filter out flow steps in Java a bit broader, and document it
2023-03-14 12:49:28 -07:00
tiferet
db9cec6ea6
Add an endpoint filter to filter out flow steps
2023-03-14 12:49:28 -07:00
tiferet
ec5425d952
When extracting positive and negative examples for the Java prompt, extract the data used in the MaD extensible predicate.
...
This will enable the codex prompt to optionally use this data in additional columns.
2023-03-14 12:49:28 -07:00
tiferet
7666843316
Resolve two TODO items
2023-03-14 12:49:28 -07:00
tiferet
e06bcc3112
Exclude negative examples that are type access nodes.
...
These will never be on a flow path so they're not useful negative examples.
2023-03-14 12:49:28 -07:00
tiferet
3229b37436
Increase diversity of negative prompt examples by creating finer sub-types
2023-03-14 12:49:28 -07:00
tiferet
559570419d
If a node satisfies the logic for both isSink and isSanitizer, don't include it as a positive or negative example in the prompt, because it's too ambiguous and will confuse the model.
2023-03-14 12:49:28 -07:00
tiferet
844171a28e
Simplify the definition of ExtractPositiveExamples.ql
2023-03-14 12:49:28 -07:00
tiferet
ecf4d4dc02
Avoid accidentally extracting positive prompt examples when there is a codex-generated data extension file in java/ql/lib/ext
2023-03-14 12:49:28 -07:00
tiferet
0d4e85ff93
Add a predicate that finds endpoints with logically-inconsistent characteristics, and exclude such endpoints from both positive and negative examples extracted for the codex prompt.
2023-03-14 12:49:28 -07:00
tiferet
1211197914
Fix codeql-pack.lock.yml so it's not looking for an ML model
2023-03-14 12:49:28 -07:00
tiferet
41df8df182
Typo fix
2023-03-14 12:49:28 -07:00
tiferet
125245aa62
Delete TODO items that are done
2023-03-14 12:49:28 -07:00
tiferet
8bb2b2eaea
Have each EndpointType keep track of the sink/source kind for this endpoint type as used in Models as Data
2023-03-14 12:49:28 -07:00
tiferet
27efe524da
Fix the extraction of data for the data extension YML file.
2023-03-14 12:49:28 -07:00
tiferet
ae4668c488
Add data needed for the data extension YML file to ExtractSinkCandidatesWithFlow.ql: first pass.
2023-03-14 12:49:28 -07:00