Now that `TInstruction` is shared between IR stages, several of the per-stage IR construction predicates can now be moved into the `Raw` interface exposed only by the initial construction of IR from the ASTs. This also removed a couple predicates that were not used previously at all.
Each stage of the IR reuses the majority of the instructions from previous stages. Previously, we've been wrapping each reused old instruction in a branch of the `TInstruction` type for the next stage. This causes use to create roughly three times as many `TInstruction` objects as we actually need.
Now that IPA union types are supported in the compiler, we can share a single `TInstruction` IPA type across stages. We create a single `TInstruction` IPA type, with individual branches of this type for instructions created directly from the AST (`TRawInstruction`) and for instructions added by each stage of SSA construction (`T*PhiInstruction`, `T*ChiInstruction`, `T*UnreachedInstruction`). Each stage then defines a `TStageInstruction` type that is a union of all of the branches that can appear in that particular stage. The public `Instruction` class for each phase extends the `TStageInstruction` type for that stage.
The interface that each stage exposes to the pyrameterized modules in the IR is now split into three pieces:
- The `Raw` module, exposed only by the original IR construction stage. This module identifies which functions have IR, which `TRawInstruction`s exist, and which `IRVariable`s exist.
- The `SSA` module, exposed only by the two SSA construction stages. This identifiers which `Phi`, `Chi`, and `Unreached` instructions exist.
- The global module, exposed by all three stages. This module has all of the predicates whose implementation is different for each stage, like gathering definitions of `MemoryOperand`s.
Similarly, there is now a single `TIRFunction` IPA type that is shared across all three stages. There is a single `IRFunctionBase` class that exposes the stage-indepdendent predicates; the `IRFunction` class for each stage extends `IRFunctionBase`.
Most of the other changes are largely mechanical.
- Added experimental/Security/CWE/CWE-094/SpelInjection.ql
and a couple of libraries
- Added a qhelp file with a few examples
- Added tests and stubs for Spring
This QL
```codeql
import python
import semmle.python.dataflow.TaintTracking
import semmle.python.security.strings.Untrusted
from CollectionKind ck
where
ck.(DictKind).getMember() instanceof StringKind
or
ck.getMember().(DictKind).getMember() instanceof StringKind
select ck, ck.getAQlClass(), ck.getMember().getAQlClass()
```
generates these 6 results.
```
1 {externally controlled string} ExternalStringDictKind UntrustedStringKind
2 {externally controlled string} StringDictKind UntrustedStringKind
3 [{externally controlled string}] SequenceKind ExternalStringDictKind
4 [{externally controlled string}] SequenceKind StringDictKind
5 {{externally controlled string}} DictKind ExternalStringDictKind
6 {{externally controlled string}} DictKind StringDictKind
```
StringDictKind was only used in *one* place in our library code. As illustrated
above, it pollutes our set of TaintKinds. Effectively, every time we make a
flow-step for dictionaries with tainted strings as values, we do it TWICE --
once for ExternalStringDictKind, and once for StringDictKind... that is just a
waste.
If I wanted to use my own TaintKind and not have any interaction with
`UntrustedStringKind` that wouldn't be possible today since these standard http
libraries import it directly. (also, I wouldn't get any sources of my custom
TaintKind from turbogears or bottle). I changed them to use the same pattern of
`ExternalStringKind` as everything else does.