Commit Graph

4510 Commits

Author SHA1 Message Date
Ian Lynagh
142e1cb9fb C++: Implement AggregateLiteral.mayBeImpure() 2019-09-25 10:34:30 +01:00
Ziemowit Laski
a6d619cfe1 [zlaski/what-buffer-function] Rename CustomModels to Models 2019-09-24 18:17:34 -07:00
Ziemowit Laski
7e14e2a950 [zlaski/what-buffer-function] Rename references to BufferFunction to ArrayFunction. 2019-09-24 18:02:14 -07:00
Dave Bartolomeo
300e580874 C++: Implement language-neutral IR type system
The C++ IR currently has a very clunky way of specifying the type of an IR entity (`Instruction`, `Operand`, `IRVariable`, etc.). There are three separate predicates: `getType()`, `isGLValue()`, and `getSize()`. All three are necessary, rather than just having a `getType()` predicate, because some IR entities have types that are not represented via an existing `Type` object in the AST. Examples include the type for an lvalue returned from a `VariableAddress` instruction, the type for an array slice being zero-initialized in a variable initializer, and several others. It is very easy for QL code to just check the `getType()` predicate, while forgetting to use `isGLValue()` to determine if that type is the actual type of the entity (the prvalue case) or the type referred to by a glvalue entity. Furthermore, the C++ type system creates potentially many different `Type` objects for the same underlying type (e.g. typedefs, using declarations, `const`/`volatile` qualifiers, etc.), making it more difficult to tell when two entities have semantically equivalent types.

In addition, other languages for which we want to enable the IR have somewhat different type systems. The various language type systems differ in their structure, although they tend to share the basic building blocks necessary for the IR.

To address all of the above problems, I've introduced a new class hierarchy, rooted at the class `IRType`, that represents a bare-bones type system that is independent of source language (at least across C/C++/C#/Java). A type's identity is based on its kind (signed integer, unsigned integer, floating-point, Boolean, blob, etc.), size and in the case of blob types, a "tag" to differentiate between different classes and structs. No distinction is made between, say `signed int` and plain `int`, or between different language integer types that have the same signedness and size (e.g. `unsigned int` vs. `wchar_t` on Linux). `IRType` is intended for use by language-agnostic IR-based analyses, including range analysis, dataflow, SSA construction, and alias analysis. The set of available `IRType`s is determined by predicate provided by the language library implementation (e.g. `hasSignedIntegerType(int byteSize)`.

In addition to `IRType`, each language now defines a type alias named `LanguageType`, representing the type of an IR entity in more language-specific terms. The only predicate requried on `LanguageType` is `getIRType()`, which returns the single `IRType` object for the language-neutral representation of that `LanguageType`. All other predicates on and subclasses of `LanguageType` are language-specific. There may be many instances of `LanguageType` that map to a given `IRType`, to allow for typedefs, etc.

Most of the changes are mechanical changes in the IR construction code, to return the correct type for each IR entity. SSA construction has also been updated to avoid dependencies on language-specific types.

I have not yet removed the original `getType()` predicates that just return `Type`. These can be removed once we move the remaining existing libraries to use `IRType`.

Test results are, by design, pretty much unchanged. Once case changed for inline asm, because the previously IR generation for it played a little fast and loose with the input/output expressions. The test case now includes both input and output variables. The generated IR for `Conditional_LValue` is now more correct, because we now have a way to represent an lvalue of an lvalue. `syntax-zoo` is still a hot mess. Most of the changed outputs are due to wobble from having multiple functions with the same name, but with a slightly different order of evaluation due to the type changes. Others are wobble from already-invalid IR. A couple non-wobbly places have improved slightly, though.

The C# part of this change is waiting for #2005 to be merged, since that has some of the necessary C# implementation.
2019-09-23 16:14:00 -07:00
Jonas Jensen
22e57a6559 Merge pull request #1860 from matt-gretton-dann/add-using-aliases
Add support for using aliases
2019-09-23 16:53:51 +02:00
Jonas Jensen
898976121b Merge pull request #1987 from geoffw0/toomanyformat
CPP: WrongNumberOfFormatArguments.ql Fix
2019-09-23 16:05:11 +02:00
Jonas Jensen
a34c0d4200 C++: Autoformat TranslatedExpr.qll 2019-09-23 15:39:32 +02:00
Robert Marsh
90c91a78f8 Merge pull request #1976 from pavgust/fix/hashcons-perf
C++: HashCons: Further performance improvements
2019-09-23 06:37:03 -07:00
Jonas Jensen
cd5f3b84a8 C++: Make sure there's a Instruction for each Expr
This change ensures that all `Expr`s (except parentheses) have a
`TranslatedExpr` with a `getResult` that's one of its own instructions,
not an instruction from one of its operands. This means that when we
translate back and forth between `Expr` and `Instruction`, like in
`DataFlow::exprNode`, we will not conflate `e` with `&e` or `... = e`.
2019-09-23 15:23:31 +02:00
Matthew Gretton-Dann
4606587fe8 C++: Apply style guide to TypedefType.qll 2019-09-23 13:57:50 +01:00
Matthew Gretton-Dann
af3b0d9e73 C++: Update stats. 2019-09-23 13:57:50 +01:00
Matthew Gretton-Dann
5468b8def7 C++: Add support for C++ using aliases
Previously these were identified as typedefs.
2019-09-23 13:57:50 +01:00
Geoffrey White
9100ab9360 CPP: Autoformat. 2019-09-20 15:30:59 +01:00
Geoffrey White
f7607313e7 CPP: Fix FPs. 2019-09-20 15:12:55 +01:00
Pavel Avgustinov
1c971d3f88 HashCons: Further performance improvements
The key insight here is that `HC_FieldCons` and `HC_Array` are
functionally determined by the things that arise in another
recursive call. Lifting them to their own predicate, therefore,
reduces nonlinearity and constrains the join order in a way that
cannot be asymptotically bad -- and, indeed, makes quite a big
difference in practice.
2019-09-20 12:00:33 +01:00
Robert Marsh
d3f2d8169e Merge pull request #1967 from jbj/tainttracking-ir-2
C++: DefaultTaintTracking flow from a to a[i]
2019-09-19 15:00:29 -07:00
Robert Marsh
9c6a0ffc48 Merge pull request #1979 from nickrolfe/wrong_type_uninstantiated
C++: ignore uninstantiated templates in WrongTypeFormatArguments.ql
2019-09-19 14:51:45 -07:00
Nick Rolfe
56f4f86921 C++: ignore uninstantiated templates in WrongTypeFormatArguments.ql 2019-09-19 21:18:47 +01:00
Robert Marsh
fd88f7a3ce Merge pull request #1884 from jbj/dataflow-addressof
C++: Data flow through address-of operator (&)
2019-09-19 09:15:43 -07:00
Jonas Jensen
29c93488bc C++: DefaultTaintTracking flow from a to a[i]
Switching `security.TaintTracking` to use `DefaultTaintTracking` causes
us to lose a result from `UnboundedWrite.ql`, while this commit restores
it:

diff --git a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
index 1eba0e52f0e..d947b33b9d9 100644
--- a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
+++ b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-120/CERT/STR35-C/UnboundedWrite.expected
@@ -1,2 +1,3 @@
+| main.c:54:7:54:12 | call to strcat | This 'call to strcat' with input from $@ may overflow the destination. | main.c:93:15:93:18 | argv | argv |
 | main.c:99:9:99:12 | call to gets | This 'call to gets' with input from $@ may overflow the destination. | main.c:99:9:99:12 | call to gets | call to gets |
 | main.c:213:17:213:19 | buf | This 'scanf string argument' with input from $@ may overflow the destination. | main.c:213:17:213:19 | buf | buf |
2019-09-19 14:52:40 +02:00
Jonas Jensen
34a5368101 C++: Ignore templates in AmbiguouslySignedBitField
If it's possible that the type is not fully resolved, it's better to
avoid giving an alert.

This fixes a FP in https://github.com/heremaps/flatdata.
2019-09-19 14:21:53 +02:00
Jonas Jensen
30d1c327cf C++: Implement predictableInstruction without Expr
This is one step toward implementing the taint-tracking wrapper in terms
of `Instruction` rather than `Expr`.

This leads to a few duplicate results in `TaintedAllocationSize.ql`
because the library now considers `sizeof(int)` to be just as
predictable as `4`, whereas the `security.TaintTracking` library does
not consider `sizeof` to be predictable. I think it's simpler to accept
the duplicate results since they are ultimately a quirk of the query,
not the library.

The following is the diff between (a) replacing `TaintTracking.qll` with
a link to `DefaultTaintTracking.qll` and (b) additionally applying this
commit.

diff --git a b
--- a/cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/TaintedAllocationSize/TaintedAllocationSize.expected
+++ b/cpp/ql/test/query-tests/Security/CWE/CWE-190/semmle/TaintedAllocationSize/TaintedAllocationSize.expected
@@ -1,5 +1,8 @@
 | test.cpp:42:31:42:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:43:31:43:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
 | test.cpp:43:38:43:63 | ... * ... | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:45:31:45:36 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
 | test.cpp:48:25:48:30 | call to malloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
 | test.cpp:49:17:49:30 | new[] | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
+| test.cpp:52:21:52:27 | call to realloc | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
 | test.cpp:52:35:52:60 | ... * ... | This allocation size is derived from $@ and might overflow | test.cpp:39:21:39:24 | argv | user input (argv) |
--- a/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-190/CERT/INT04-C/int04.expected
+++ b/semmlecode-cpp-tests/DO_NOT_DISTRIBUTE/security-tests/CWE-190/CERT/INT04-C/int04.expected
@@ -1 +1,2 @@
 | int04c.c:21:29:21:51 | ... * ... | This allocation size is derived from $@ and might overflow | int04c.c:14:30:14:35 | call to getenv | user input (getenv) |
+| int04c.c:22:33:22:38 | call to malloc | This allocation size is derived from $@ and might overflow | int04c.c:14:30:14:35 | call to getenv | user input (getenv) |
2019-09-19 13:11:27 +02:00
Jonas Jensen
307b92feed C++: Unknown template literals are constant 2019-09-19 10:23:26 +02:00
Jonas Jensen
9b805c01cc Merge pull request #1951 from pavgust/fix/hashcons-perf
C++: Fix HashCons library performance
2019-09-19 08:10:34 +02:00
Jonas Jensen
7d8396fa65 C++: Constant template pointer-to-member literals 2019-09-18 14:44:25 +02:00
Jonas Jensen
c90fd32a78 C++: Pointer-to-member-function is constant 2019-09-18 13:55:56 +02:00
Pavel Avgustinov
eca31908ab HashCons: Make some functionality apparent.
The user knows that an expression functionally determines its
hashCons value, and that an expression functionally determines
its number of children, but this is not provable from the
definitions, and so not usable by the optimiser. By storing
the result of those known-functional calls in a variable,
rather than repeating the call, we enable better join orders.
2019-09-18 12:54:48 +01:00
Pavel Avgustinov
03502863cf Distribute a recursive call into a recursive disjunction.
As the linearity of the disjuncts is different, this enables us to
pick better join orders for each disjunct separately.
2019-09-18 12:54:48 +01:00
Tom Hvitved
d8074ddfa6 Sync files 2019-09-18 13:36:15 +02:00
Jonas Jensen
571c96bb2f C++: Autoformat five files
These files have come out of autoformat since the big commit that
autoformatted everything.
2019-09-18 11:55:19 +02:00
Jonas Jensen
fd6d06fe6f C++: Data flow through address-of operator (&)
The data flow library conflates pointers and their objects in some
places but not others. For example, a member function call `x.f()` will
cause flow from `x` of type `T` to `this` of type `T*` inside `f`. It
might be ideal to avoid that conflation, but that's not realistic
without using the IR.

We've had good experience in the taint tracking library with conflating
pointers and objects, and it improves results for field flow, so perhaps
it's time to try it out for all data flow.
2019-09-17 13:16:34 +02:00
Dave Bartolomeo
21f6ab787d C++: Rename predicates in FunctionInputsAndOutputs.qll and add QLDoc 2019-09-16 12:06:06 -07:00
Geoffrey White
3df31e6ccf CPP: Tiny qldoc fixes. 2019-09-16 16:52:48 +01:00
Dave Bartolomeo
553238a9e8 Merge pull request #1922 from jbj/qlcfg-const-pointer-to-member
C++: Add PointerToFieldLiteral class
2019-09-13 10:44:52 -07:00
Jonas Jensen
7cfbe88e7b C++: IR DataFlow::Node.toString consistency
The `toString` for IR data-flow nodes are now similar to AST data-flow
nodes. This should make it easier to use the IR as a drop-in replacement
in the future. There are still differences because the IR data flow
library takes conversions into account.

I did not attempt to align the new nodes we use for field flow. That can
come later, when we add field flow to IR data flow.
2019-09-13 14:33:31 +02:00
Jonas Jensen
562bffe710 C++: Simplify toString of ImplicitParameterNode
This string looked out of place compared to `ExplicitParameterNode`,
whose string is simply the name of the parameter and therefore
indistinguishable from an access to the parameter without looking at the
location also. This has not been a problem so far, and if we want to
distinguish more clearly between initial values and accesses at some
point, we should do it for `ExplicitParameterNode` and
`UninitializedNode` too.
2019-09-13 14:33:26 +02:00
Tom Hvitved
f5cae9b6ea Merge pull request #1881 from aschackmull/java/pathgraph-nodes
Java/C++/C#: Add nodes predicate to PathGraph.
2019-09-13 10:32:47 +02:00
Dave Bartolomeo
e8cf3f876e Merge pull request #1660 from zlaski-semmle/zlaski/builtin-va-list
Add a `__builtin_va_list` type, to complement `__builtin_va_*`
2019-09-12 14:04:55 -07:00
Dave Bartolomeo
9072f6231f Merge pull request #1928 from jbj/autoformat-ssa
C++: Autoformat IR SSA files
2019-09-12 14:03:20 -07:00
zlaski-semmle
45640395a9 Merge pull request #1803 from geoffw0/qldoceg9
CPP: Add syntax examples to QLDoc in Variable.qll
2019-09-12 12:32:58 -07:00
Jonas Jensen
0c092e21b0 C++: Autoformat IR SSA files
One autoformat omission had also slipped into
`DefaultTaintTracking.qll`.
2019-09-12 15:45:08 +02:00
Jonas Jensen
10270cb36d C++: Turn a comment into QLDoc 2019-09-12 15:44:04 +02:00
Jonas Jensen
c7e6081079 C++: Add DataFlow::instructionNode
This is for symmetry with `exprNode` etc., and it should be handy for
the same reasons. I found one caller of `asInstruction` that got simpler
by using the new predicate instead.
2019-09-12 11:44:17 +02:00
Anders Schack-Mulligen
95e2f162d9 Java/C++/C#: Adjust toString of empty accesspath. 2019-09-12 11:00:49 +02:00
Anders Schack-Mulligen
0a4b15d40b Java/C++/C#: Add nodes predicate to PathGraph. 2019-09-12 11:00:49 +02:00
semmle-qlci
10076a6b2b Merge pull request #1886 from jbj/ir-taint-shared
Approved by rdmarsh2
2019-09-12 06:48:24 +01:00
Robert Marsh
e71a39f6b6 Merge pull request #1912 from jbj/tainttracking-ir-1
C++: Stub replacement for security.TaintTracking
2019-09-11 13:44:39 -07:00
Geoffrey White
d1cc28e253 CPP: Address review comments. 2019-09-11 17:14:05 +01:00
Geoffrey White
ee07c705a4 CPP: More review suggestions. 2019-09-11 17:14:05 +01:00
Geoffrey White
8134d80c46 CPP: Review suggestions. 2019-09-11 17:14:05 +01:00