This commit implements the language-neutral IR type system for C#. It mostly follows the same pattern as C++, modified to fit the C# type system. All object references, pointers, and lvalues are represented as `IRAddress` types. All structs and generic parameters are implemented as `IRBlobType`. Function addresses get a single `IRFunctionAddressType`.
I had to fix a couple places in the original IR type system where I didn't realize I was still depending on language-specific types. As part of this, `CSharpType` and `CppType` now have a `hasUnspecifiedType()` predicate, which is equivalent to `hasType()`, except that it holds only for the unspecified version of the type. This predicate can go away once we remove the IR's references to the underlying `Type` objects.
All C# IR tests pass without modification, but only because this commit continues to print the name of `IRUnknownType` as `null`, and `IRFunctionAddressType` as `glval<null>`. These will be fixed separately in a subsequent commit in this PR.
The C++ IR currently has a very clunky way of specifying the type of an IR entity (`Instruction`, `Operand`, `IRVariable`, etc.). There are three separate predicates: `getType()`, `isGLValue()`, and `getSize()`. All three are necessary, rather than just having a `getType()` predicate, because some IR entities have types that are not represented via an existing `Type` object in the AST. Examples include the type for an lvalue returned from a `VariableAddress` instruction, the type for an array slice being zero-initialized in a variable initializer, and several others. It is very easy for QL code to just check the `getType()` predicate, while forgetting to use `isGLValue()` to determine if that type is the actual type of the entity (the prvalue case) or the type referred to by a glvalue entity. Furthermore, the C++ type system creates potentially many different `Type` objects for the same underlying type (e.g. typedefs, using declarations, `const`/`volatile` qualifiers, etc.), making it more difficult to tell when two entities have semantically equivalent types.
In addition, other languages for which we want to enable the IR have somewhat different type systems. The various language type systems differ in their structure, although they tend to share the basic building blocks necessary for the IR.
To address all of the above problems, I've introduced a new class hierarchy, rooted at the class `IRType`, that represents a bare-bones type system that is independent of source language (at least across C/C++/C#/Java). A type's identity is based on its kind (signed integer, unsigned integer, floating-point, Boolean, blob, etc.), size and in the case of blob types, a "tag" to differentiate between different classes and structs. No distinction is made between, say `signed int` and plain `int`, or between different language integer types that have the same signedness and size (e.g. `unsigned int` vs. `wchar_t` on Linux). `IRType` is intended for use by language-agnostic IR-based analyses, including range analysis, dataflow, SSA construction, and alias analysis. The set of available `IRType`s is determined by predicate provided by the language library implementation (e.g. `hasSignedIntegerType(int byteSize)`.
In addition to `IRType`, each language now defines a type alias named `LanguageType`, representing the type of an IR entity in more language-specific terms. The only predicate requried on `LanguageType` is `getIRType()`, which returns the single `IRType` object for the language-neutral representation of that `LanguageType`. All other predicates on and subclasses of `LanguageType` are language-specific. There may be many instances of `LanguageType` that map to a given `IRType`, to allow for typedefs, etc.
Most of the changes are mechanical changes in the IR construction code, to return the correct type for each IR entity. SSA construction has also been updated to avoid dependencies on language-specific types.
I have not yet removed the original `getType()` predicates that just return `Type`. These can be removed once we move the remaining existing libraries to use `IRType`.
Test results are, by design, pretty much unchanged. Once case changed for inline asm, because the previously IR generation for it played a little fast and loose with the input/output expressions. The test case now includes both input and output variables. The generated IR for `Conditional_LValue` is now more correct, because we now have a way to represent an lvalue of an lvalue. `syntax-zoo` is still a hot mess. Most of the changed outputs are due to wobble from having multiple functions with the same name, but with a slightly different order of evaluation due to the type changes. Others are wobble from already-invalid IR. A couple non-wobbly places have improved slightly, though.
The C# part of this change is waiting for #2005 to be merged, since that has some of the necessary C# implementation.
Generation of IDs for namespace members has been fixed to generate
unique IDs for variables of the same name but in different namespaces.
Update the same_name test to validate this.
The key insight here is that `HC_FieldCons` and `HC_Array` are
functionally determined by the things that arise in another
recursive call. Lifting them to their own predicate, therefore,
reduces nonlinearity and constrains the join order in a way that
cannot be asymptotically bad -- and, indeed, makes quite a big
difference in practice.
Switching `security.TaintTracking` to use `DefaultTaintTracking` causes
us to lose a result from `UnboundedWrite.ql`, while this commit restores
it:
diff --git a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
index 1eba0e52f0e..d947b33b9d9 100644
--- a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
+++ b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
@@ -1,2 +1,3 @@
+| main.c:54:7:54:12 | call to strcat | This 'call to strcat' with input from $@ may overflow the destination. | main.c:93:15:93:18 | argv | argv |
| main.c:99:9:99:12 | call to gets | This 'call to gets' with input from $@ may overflow the destination. | main.c:99:9:99:12 | call to gets | call to gets |
| main.c:213:17:213:19 | buf | This 'scanf string argument' with input from $@ may overflow the destination. | main.c:213:17:213:19 | buf | buf |
This is one step toward implementing the taint-tracking wrapper in terms
of `Instruction` rather than `Expr`.
This leads to a few duplicate results in `TaintedAllocationSize.ql`
because the library now considers `sizeof(int)` to be just as
predictable as `4`, whereas the `security.TaintTracking` library does
not consider `sizeof` to be predictable. I think it's simpler to accept
the duplicate results since they are ultimately a quirk of the query,
not the library.
The following is the diff between (a) replacing `TaintTracking.qll` with
a link to `DefaultTaintTracking.qll` and (b) additionally applying this
commit.
diff --git a b
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/TaintedAllocationSize/TaintedAllocationSize.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/TaintedAllocationSize/TaintedAllocationSize.expected
@@ -1,5 +1,8 @@
| test.cpp:42:31:42:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:43:31:43:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:43:38:43:63 | ... * ... | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:45:31:45:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:48:25:48:30 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:49:17:49:30 | new[] | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:52:21:52:27 | call to realloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
| test.cpp:52:35:52:60 | ... * ... | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
--- a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-190/CERT/INT04-C/int04.expected
+++ b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-190/CERT/INT04-C/int04.expected
@@ -1 +1,2 @@
| int04c.c:21:29:21:51 | ... * ... | This allocation size is derived from $@ and might overflow | int04c.c:14:30:14:35 | call to getenv | user input (getenv) |
+| int04c.c:22:33:22:38 | call to malloc | This allocation size is derived from $@ and might overflow | int04c.c:14:30:14:35 | call to getenv | user input (getenv) |
The user knows that an expression functionally determines its
hashCons value, and that an expression functionally determines
its number of children, but this is not provable from the
definitions, and so not usable by the optimiser. By storing
the result of those known-functional calls in a variable,
rather than repeating the call, we enable better join orders.