Before, results from `dca` would look something like
## + py/meta/alerts/remote-flow-sources-reach
- django/django@c2250cf_cb8f: tests/messages_tests/urls.py:38:16:38:48
reachable with taint-tracking from RemoteFlowSource
- django/django@c2250cf_cb8f: tests/messages_tests/urls.py:38:9:38:12
reachable with taint-tracking from RemoteFlowSource
now it should make it easier to spot _what_ it is that actually changed,
since we pretty-print the node.
along with tests, but no implementations (to ease reviewing).
---
I've put quite some thinking into what to call our concept for this.
[JS has `CookieDefinition`](581f4ed757/javascript/ql/src/semmle/javascript/frameworks/HTTP.qll (L148-L187)), but I couldn't find a matching concept in any other languages.
We used to call this [`CookieSet`](f07a7bf8cf/python/ql/src/semmle/python/web/Http.qll (L76)) (and had a corresponding `CookieGet`).
But for headers, [Go calls this `HeaderWrite`](cd1e14ed09/ql/src/semmle/go/concepts/HTTP.qll (L97-L131)) and [JS calls this `HeaderDefinition`](581f4ed757/javascript/ql/src/semmle/javascript/frameworks/HTTP.qll (L23-L46))
I think it would be really cool if we have a naming scheme that means the name for getting the value of a header on a incoming request is obvious. I think `HeaderWrite`/`HeaderRead` fulfils this best. We could go with `HeaderSet`/`HeaderGet`, but they feel a bit too vague to me. For me, I'm so used to talking about def-use, that I would immediately go for `HeaderDefinition` and `HeaderUse`, which could work, but is kinda strange.
So in the end that means I went with `CookieWrite`, since that allows using a consistent naming scheme for the future :)
I made `FileSystemWriteAccess` be a subclass of `FileSystemAccess` (like in [JS](64001cc02c/javascript/ql/src/semmle/javascript/Concepts.qll (L68-L74))), but then I started wondering about how I could give a good result for `getAPathArgument`, and what would a good result even be? The argument to the `open` call, or the object that the `write` method is called on? I can't see how doing either of these enables us to do anything useful...
So I looked closer at how JS uses `FileSystemWriteAccess`:
1. as sink for zip-slip: 7c51dff0f7/javascript/ql/src/semmle/javascript/security/dataflow/ZipSlipCustomizations.qll (L121)
2. as sink for downloading unsafe files (identified through their extension) through non-secure connections: 89ef6ea4eb/javascript/ql/src/semmle/javascript/security/dataflow/InsecureDownloadCustomizations.qll (L134-L150)
3. as sink for writing untrusted data to a local file 93b1e59d62/javascript/ql/src/semmle/javascript/security/dataflow/HttpToFileAccessCustomizations.qll (L43-L46)
for the 2 first sinks, it's important that `getAPathArgument` has a proper result... so that solves the problem, and highlights that it _can_ be important to give proper results for `getAPathArgument` (if possible).
So I'm trying to do best effort for `f = open(...); f.write(...)`, but with this current code we won't always be able to give a result (as highlighted by the tests). It will also be the case that there are multiple `FileSystemAccess` with the same path-argument, which could be a little strange.
overall, I'm not super confident about the way this new concept and implementation turned out, but it also seems like the best I could come up with right now...
The obvious alternative solution is to NOT make `FileSystemWriteAccess` a subclass of `FileSystemAccess`, but I'm not very tempted to go down this path, given the examples of this being useful above, and just the general notion that we should be able to model writes as being a specialized kind of `FileSystemAccess`.
This required quite some changes in the expected output. I think it's much more
clear what the selected nodes are now 👍 (but it was a bit boring work to fix
this up)
I considered using `getInput` like in JS, but things like signature verification
has multiple inputs (message and signature).
Using getAnInput also aligns better with Decoding/Encoding.
The meat of this PR is described in the new python/ql/test/experimental/meta/InlineTaintTest.qll file:
> Defines a InlineExpectationsTest for checking whether any arguments in
> `ensure_tainted` and `ensure_not_tainted` calls are tainted.
>
> Also defines query predicates to ensure that:
> - if any arguments to `ensure_not_tainted` are tainted, their annotation is marked with `SPURIOUS`.
> - if any arguments to `ensure_tainted` are not tainted, their annotation is marked with `MISSING`.
>
> The functionality of this module is tested in `ql/test/experimental/meta/inline-taint-test-demo`.
I did spend some time to figure out how to best write `minimumSecureKeySize`
predicate. I wanted to write once and for all the recommended sizes for each
cryptosystem.
I considered making the predicate such as
```codeql
int minimumSecureKeySize() {
this.getName() = "RSA" and result = 2048
or
this.getName() = "DSA" and result = 2048
or
this.getName() = "ECC" and result = 244
}
```
but then it would be impossible to add a new model without also being able to
modify the body of this predicate -- which seems like a bad way to start off a
brand new way of modeling things.
So I considered if we could add it to the non-range class, such as
```codeql
class RSAKeyGeneration extends KeyGeneration {
RSAKeyGeneration() { this.getName() = "RSA" }
override int minimumSecureKeySize() { result = 2048 }
}
```
This has the major problem that when you're writing the models for a new
API (and therefore extending KeyGeneration::Range), there is no way for you to
see that you need to take this extra step :| (also problem about how we should
define `minimumSecureKeySize` on `KeyGeneration` class then, since if we make it
abstract, we effectively disable the ability to refine `KeyGeneration` since any
subclass must provide an implementation.)
So, therefore I ended up with this solution ;)
Since I really want to use our existing infrastructure to model that we can
recognize something as a request handler without it having a route, we need this
as a separate concept. All tests have been adjusted.
The early modeling was based on flask, where all request-handling is based on
handling requests from a specific route. But with the standard library handling
and handlers without routes, the naming had to change.
I'm not even trying to model it properly right now, and don't have a specific
use-case for it RIGHT NOW. I think we could want this in the future, but I think
it's probably better to model it when we know what we want to use it for.
We might need to rework this a bit when we also start to handle redirects. I
could see a world where we simply allow http redirects to be subclasses of http
responses, and need to manually exclude them from queries (or create
HttpContentResponse to model the HttpResponses that will contain a body). Let us
see where the wind will take us.
I looked through JS and Go libraries, but I didn't feel their modeling would map
very well to Python.
Overall idea is that `test/experimental/meta/ConceptsTest.qll` will set up
inline expectation tests for all the classes defined in `Concepts.qll`, so any
time you model a new instance of Concepts, you simply just import that
file. That makes the tests a little verbose, but allows us to share test-setup
between all the different frameworks we model.
Note that since the definitions of SystemCommandExecution subclasses are
scattered across multieple framework modeling qll files, it think it makes the
most sense to have the tests for each framework in one location.
I'm not 100% convinced about if this is the right choice or not (especially when
we want to write tests for sanitizers), but for now I'm going to try it out at
least.