From b8eb2ab10077af3cb2041940ee4ef8cdc59f4da9 Mon Sep 17 00:00:00 2001 From: Jeroen Ketema Date: Fri, 4 Oct 2024 15:37:22 +0200 Subject: [PATCH 1/2] C++: Add some documentation on the printed IR --- .../ir/implementation/aliased_ssa/PrintIR.qll | 106 ++++++++++++++++++ .../cpp/ir/implementation/raw/PrintIR.qll | 106 ++++++++++++++++++ .../implementation/unaliased_ssa/PrintIR.qll | 106 ++++++++++++++++++ 3 files changed, 318 insertions(+) diff --git a/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll b/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll index c4b18d9cb61..5e634a7c322 100644 --- a/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll +++ b/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll @@ -6,6 +6,112 @@ * uses, however, it is better to write a query that imports `PrintIR.qll`, extends * `PrintIRConfiguration`, and overrides `shouldPrintDeclaration()` to select a subset of declarations * to dump. + * + * Anatomy of a printed IR instruction + * + * An instruction: + * + * ``` + * # 2281| v2281_19(void) = Call[~String] : func:r2281_18, this:r2281_17 + * ``` + * + * The prefix `# 2281|` specifies that this instruction was generated by the C++ source code on line 2281. + * Scrolling up in the printed output, one will eventually find the name of the file to which the line + * belongs. + * + * `v2281_19(void)` is the result of the instruction. Here, `v` means this is a void result or operand (so + * there should be no later uses of the result; see below for other possible values). The `2281_19` is a + * unique ID for the result. This is usually just the line number plus a small integer suffix to make it + * unique within the function. The type of the result is `void`. In this case, it is `void`, because + * `~String` returns `void`. The type of the result is usually just the name of the appropriate C++ type, + * but it will sometimes be a type like `glval`, which means result holds a glvalue, which at the + * IR level works like a pointer. In other words, in the source code the type was `int`, but it is really + * more like an `int*`. We see this, for example, in `x = y;`, where `x` is a glvalue. + * + * `Call` is the opcode of the instruction. Common opcodes include: + * + * * Arithmetic operations: `Add`, `Sub`, `Mul`, etc. + * * Memory access operations: `Load`, `Store`. + * * Function calls: `Call`. + * * Literals: `Constant`. + * * Variable addresses: `VariableAddress`. + * * Function entry points: `EnterFunction`. + * * Return form a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a + * `Store` to a special `#return` variable. + * * Stack unwinding for C++ function that throw and where the exception escapes the function: `Unwind`. + * * Common exit point for `Unwind` and `Return`: `ExitFunction`. + * * SSA-related opcodes: `Phi`, `Chi`. + * + * `[~String]` denotes additional information. The information might be present earlier in the IR, as is the case + * for `Call`, where it is the name of the called function. This is also the case for `Load` and `Store`, where it + * is the name of the variable that loaded or stored (if known). In the case of `Constant`, `FieldAddress`, and + * `VariableAddress`, the information between brackets does not occur earlier. + * + * `func:r2281_18` and `this:r28281_17` are the operands of the instruction. The `func:` prefix denotes the operand + * that holds the address of the called function. The `this:` prefix denotes the argument to the special `this` + * parameter of an instance member function. `r2281_18`, `r2281_17` are the unique IDs of the operands. Each of these + * matches the ID of a previously seen result, showing where that value came from. The `r` means that these are + * "register" operands (see below). + * + * Result and operand kinds: + * + * Every result and operand is one of these three kinds: + * + * * `r` "register". These operands are not stored in any particular memory location. We can think of them as + * temporary values created during the evaluation of an expression. A register operand almost always has one + * use, often in the same block as its definition. + * * `m` "memory". These operands represents accesses to a specific memory location. The location could be a + * local variable, a global variable, a field of an object, an element of an array, or any memory that we happen + * to have a pointer to. These only occur as the result of a `Store`, the source operand of a `Load` or on the + * SSA instructions (`Phi`, `Chi`). + * * `v` "void". Really just a register operand, but we mark register operands of type void with this special prefix + * so we know that there is no actual value there. + * + * Branches in the IR: + * + * The IR is divided into basic blocks. At the end of each block, there are one or more edges showing the possible + * control flow successors of the block. + * + * ``` + * # 44| v44_3(void) = ConditionalBranch : r44_2 + * #-----| False -> Block 4 + * #-----| True -> Block 3 + * ``` + * Here we have a block that ends with a conditional branch. The two edges show where the control flows to depending + * on whether the condition is true or false. + * + * SSA instructions: + * + * We use `Phi` instructions in SSA to create a single definition for a variable that might be assigned on multiple + * control flow paths. The `Phi` instruction merges the potential values of that variable from each predecessor edge, + * and the resulting definition is then used wherever that variable is accessed later on. + * + * When dealing with aliased memory, we use the `Chi` instruction to create a single definition for memory that might + * or might not have been updated by a store, depending on the actual address that was written to. For example, take: + * + * ```cpp + * int x = 5; + * int y = 7; + * int* p = condition ? &x : &y; + * p = 6; + * return x; + * ``` + * + * At the point where we store to `*p`, we do not know whether `p` points to `x` or `y`. Thus, we do not know whether + * `return x;` is going to return the value that `x` was originally initialized to (5), or whether it will return 6, + * because it was overwritten by `*p = 6;`. We insert a `Chi` instruction immediately after the store to `*p`: + * + * ``` + * r2(int) = Constant[6] + * r3(int*) = <> + * m4(int) = Store : &r3, r2 // Stores the constant 6 to *p + * m5(unknown) = Chi : total:m1, partial:m4 + * ``` + * The `partial:` operand represents the memory that was just stored. The `total:` operand represents the previous + * contents of all of the memory that `p` might have pointed to (in this case, both `x` and `y`). The result of the + * `Chi` represents the new contents of whatever memory the `total:` operand referred to. We usually do not know exactly + * which parts of that memory were overwritten, but it does model that any of that memory could have been modified, so + * that later instructions do not assume that the memory was unchanged. */ private import internal.IRInternal diff --git a/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll b/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll index c4b18d9cb61..5e634a7c322 100644 --- a/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll +++ b/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll @@ -6,6 +6,112 @@ * uses, however, it is better to write a query that imports `PrintIR.qll`, extends * `PrintIRConfiguration`, and overrides `shouldPrintDeclaration()` to select a subset of declarations * to dump. + * + * Anatomy of a printed IR instruction + * + * An instruction: + * + * ``` + * # 2281| v2281_19(void) = Call[~String] : func:r2281_18, this:r2281_17 + * ``` + * + * The prefix `# 2281|` specifies that this instruction was generated by the C++ source code on line 2281. + * Scrolling up in the printed output, one will eventually find the name of the file to which the line + * belongs. + * + * `v2281_19(void)` is the result of the instruction. Here, `v` means this is a void result or operand (so + * there should be no later uses of the result; see below for other possible values). The `2281_19` is a + * unique ID for the result. This is usually just the line number plus a small integer suffix to make it + * unique within the function. The type of the result is `void`. In this case, it is `void`, because + * `~String` returns `void`. The type of the result is usually just the name of the appropriate C++ type, + * but it will sometimes be a type like `glval`, which means result holds a glvalue, which at the + * IR level works like a pointer. In other words, in the source code the type was `int`, but it is really + * more like an `int*`. We see this, for example, in `x = y;`, where `x` is a glvalue. + * + * `Call` is the opcode of the instruction. Common opcodes include: + * + * * Arithmetic operations: `Add`, `Sub`, `Mul`, etc. + * * Memory access operations: `Load`, `Store`. + * * Function calls: `Call`. + * * Literals: `Constant`. + * * Variable addresses: `VariableAddress`. + * * Function entry points: `EnterFunction`. + * * Return form a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a + * `Store` to a special `#return` variable. + * * Stack unwinding for C++ function that throw and where the exception escapes the function: `Unwind`. + * * Common exit point for `Unwind` and `Return`: `ExitFunction`. + * * SSA-related opcodes: `Phi`, `Chi`. + * + * `[~String]` denotes additional information. The information might be present earlier in the IR, as is the case + * for `Call`, where it is the name of the called function. This is also the case for `Load` and `Store`, where it + * is the name of the variable that loaded or stored (if known). In the case of `Constant`, `FieldAddress`, and + * `VariableAddress`, the information between brackets does not occur earlier. + * + * `func:r2281_18` and `this:r28281_17` are the operands of the instruction. The `func:` prefix denotes the operand + * that holds the address of the called function. The `this:` prefix denotes the argument to the special `this` + * parameter of an instance member function. `r2281_18`, `r2281_17` are the unique IDs of the operands. Each of these + * matches the ID of a previously seen result, showing where that value came from. The `r` means that these are + * "register" operands (see below). + * + * Result and operand kinds: + * + * Every result and operand is one of these three kinds: + * + * * `r` "register". These operands are not stored in any particular memory location. We can think of them as + * temporary values created during the evaluation of an expression. A register operand almost always has one + * use, often in the same block as its definition. + * * `m` "memory". These operands represents accesses to a specific memory location. The location could be a + * local variable, a global variable, a field of an object, an element of an array, or any memory that we happen + * to have a pointer to. These only occur as the result of a `Store`, the source operand of a `Load` or on the + * SSA instructions (`Phi`, `Chi`). + * * `v` "void". Really just a register operand, but we mark register operands of type void with this special prefix + * so we know that there is no actual value there. + * + * Branches in the IR: + * + * The IR is divided into basic blocks. At the end of each block, there are one or more edges showing the possible + * control flow successors of the block. + * + * ``` + * # 44| v44_3(void) = ConditionalBranch : r44_2 + * #-----| False -> Block 4 + * #-----| True -> Block 3 + * ``` + * Here we have a block that ends with a conditional branch. The two edges show where the control flows to depending + * on whether the condition is true or false. + * + * SSA instructions: + * + * We use `Phi` instructions in SSA to create a single definition for a variable that might be assigned on multiple + * control flow paths. The `Phi` instruction merges the potential values of that variable from each predecessor edge, + * and the resulting definition is then used wherever that variable is accessed later on. + * + * When dealing with aliased memory, we use the `Chi` instruction to create a single definition for memory that might + * or might not have been updated by a store, depending on the actual address that was written to. For example, take: + * + * ```cpp + * int x = 5; + * int y = 7; + * int* p = condition ? &x : &y; + * p = 6; + * return x; + * ``` + * + * At the point where we store to `*p`, we do not know whether `p` points to `x` or `y`. Thus, we do not know whether + * `return x;` is going to return the value that `x` was originally initialized to (5), or whether it will return 6, + * because it was overwritten by `*p = 6;`. We insert a `Chi` instruction immediately after the store to `*p`: + * + * ``` + * r2(int) = Constant[6] + * r3(int*) = <> + * m4(int) = Store : &r3, r2 // Stores the constant 6 to *p + * m5(unknown) = Chi : total:m1, partial:m4 + * ``` + * The `partial:` operand represents the memory that was just stored. The `total:` operand represents the previous + * contents of all of the memory that `p` might have pointed to (in this case, both `x` and `y`). The result of the + * `Chi` represents the new contents of whatever memory the `total:` operand referred to. We usually do not know exactly + * which parts of that memory were overwritten, but it does model that any of that memory could have been modified, so + * that later instructions do not assume that the memory was unchanged. */ private import internal.IRInternal diff --git a/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll b/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll index c4b18d9cb61..5e634a7c322 100644 --- a/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll +++ b/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll @@ -6,6 +6,112 @@ * uses, however, it is better to write a query that imports `PrintIR.qll`, extends * `PrintIRConfiguration`, and overrides `shouldPrintDeclaration()` to select a subset of declarations * to dump. + * + * Anatomy of a printed IR instruction + * + * An instruction: + * + * ``` + * # 2281| v2281_19(void) = Call[~String] : func:r2281_18, this:r2281_17 + * ``` + * + * The prefix `# 2281|` specifies that this instruction was generated by the C++ source code on line 2281. + * Scrolling up in the printed output, one will eventually find the name of the file to which the line + * belongs. + * + * `v2281_19(void)` is the result of the instruction. Here, `v` means this is a void result or operand (so + * there should be no later uses of the result; see below for other possible values). The `2281_19` is a + * unique ID for the result. This is usually just the line number plus a small integer suffix to make it + * unique within the function. The type of the result is `void`. In this case, it is `void`, because + * `~String` returns `void`. The type of the result is usually just the name of the appropriate C++ type, + * but it will sometimes be a type like `glval`, which means result holds a glvalue, which at the + * IR level works like a pointer. In other words, in the source code the type was `int`, but it is really + * more like an `int*`. We see this, for example, in `x = y;`, where `x` is a glvalue. + * + * `Call` is the opcode of the instruction. Common opcodes include: + * + * * Arithmetic operations: `Add`, `Sub`, `Mul`, etc. + * * Memory access operations: `Load`, `Store`. + * * Function calls: `Call`. + * * Literals: `Constant`. + * * Variable addresses: `VariableAddress`. + * * Function entry points: `EnterFunction`. + * * Return form a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a + * `Store` to a special `#return` variable. + * * Stack unwinding for C++ function that throw and where the exception escapes the function: `Unwind`. + * * Common exit point for `Unwind` and `Return`: `ExitFunction`. + * * SSA-related opcodes: `Phi`, `Chi`. + * + * `[~String]` denotes additional information. The information might be present earlier in the IR, as is the case + * for `Call`, where it is the name of the called function. This is also the case for `Load` and `Store`, where it + * is the name of the variable that loaded or stored (if known). In the case of `Constant`, `FieldAddress`, and + * `VariableAddress`, the information between brackets does not occur earlier. + * + * `func:r2281_18` and `this:r28281_17` are the operands of the instruction. The `func:` prefix denotes the operand + * that holds the address of the called function. The `this:` prefix denotes the argument to the special `this` + * parameter of an instance member function. `r2281_18`, `r2281_17` are the unique IDs of the operands. Each of these + * matches the ID of a previously seen result, showing where that value came from. The `r` means that these are + * "register" operands (see below). + * + * Result and operand kinds: + * + * Every result and operand is one of these three kinds: + * + * * `r` "register". These operands are not stored in any particular memory location. We can think of them as + * temporary values created during the evaluation of an expression. A register operand almost always has one + * use, often in the same block as its definition. + * * `m` "memory". These operands represents accesses to a specific memory location. The location could be a + * local variable, a global variable, a field of an object, an element of an array, or any memory that we happen + * to have a pointer to. These only occur as the result of a `Store`, the source operand of a `Load` or on the + * SSA instructions (`Phi`, `Chi`). + * * `v` "void". Really just a register operand, but we mark register operands of type void with this special prefix + * so we know that there is no actual value there. + * + * Branches in the IR: + * + * The IR is divided into basic blocks. At the end of each block, there are one or more edges showing the possible + * control flow successors of the block. + * + * ``` + * # 44| v44_3(void) = ConditionalBranch : r44_2 + * #-----| False -> Block 4 + * #-----| True -> Block 3 + * ``` + * Here we have a block that ends with a conditional branch. The two edges show where the control flows to depending + * on whether the condition is true or false. + * + * SSA instructions: + * + * We use `Phi` instructions in SSA to create a single definition for a variable that might be assigned on multiple + * control flow paths. The `Phi` instruction merges the potential values of that variable from each predecessor edge, + * and the resulting definition is then used wherever that variable is accessed later on. + * + * When dealing with aliased memory, we use the `Chi` instruction to create a single definition for memory that might + * or might not have been updated by a store, depending on the actual address that was written to. For example, take: + * + * ```cpp + * int x = 5; + * int y = 7; + * int* p = condition ? &x : &y; + * p = 6; + * return x; + * ``` + * + * At the point where we store to `*p`, we do not know whether `p` points to `x` or `y`. Thus, we do not know whether + * `return x;` is going to return the value that `x` was originally initialized to (5), or whether it will return 6, + * because it was overwritten by `*p = 6;`. We insert a `Chi` instruction immediately after the store to `*p`: + * + * ``` + * r2(int) = Constant[6] + * r3(int*) = <> + * m4(int) = Store : &r3, r2 // Stores the constant 6 to *p + * m5(unknown) = Chi : total:m1, partial:m4 + * ``` + * The `partial:` operand represents the memory that was just stored. The `total:` operand represents the previous + * contents of all of the memory that `p` might have pointed to (in this case, both `x` and `y`). The result of the + * `Chi` represents the new contents of whatever memory the `total:` operand referred to. We usually do not know exactly + * which parts of that memory were overwritten, but it does model that any of that memory could have been modified, so + * that later instructions do not assume that the memory was unchanged. */ private import internal.IRInternal From ed266dac5f535e18cb7668a346d47bbae137bc04 Mon Sep 17 00:00:00 2001 From: Jeroen Ketema Date: Mon, 7 Oct 2024 22:42:18 +0200 Subject: [PATCH 2/2] C++: Address review comments --- .../semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll | 4 ++-- cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll | 4 ++-- .../code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll b/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll index 5e634a7c322..7fd66ba8441 100644 --- a/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll +++ b/cpp/ql/lib/semmle/code/cpp/ir/implementation/aliased_ssa/PrintIR.qll @@ -36,7 +36,7 @@ * * Literals: `Constant`. * * Variable addresses: `VariableAddress`. * * Function entry points: `EnterFunction`. - * * Return form a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a + * * Return from a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a * `Store` to a special `#return` variable. * * Stack unwinding for C++ function that throw and where the exception escapes the function: `Unwind`. * * Common exit point for `Unwind` and `Return`: `ExitFunction`. @@ -93,7 +93,7 @@ * int x = 5; * int y = 7; * int* p = condition ? &x : &y; - * p = 6; + * *p = 6; * return x; * ``` * diff --git a/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll b/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll index 5e634a7c322..7fd66ba8441 100644 --- a/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll +++ b/cpp/ql/lib/semmle/code/cpp/ir/implementation/raw/PrintIR.qll @@ -36,7 +36,7 @@ * * Literals: `Constant`. * * Variable addresses: `VariableAddress`. * * Function entry points: `EnterFunction`. - * * Return form a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a + * * Return from a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a * `Store` to a special `#return` variable. * * Stack unwinding for C++ function that throw and where the exception escapes the function: `Unwind`. * * Common exit point for `Unwind` and `Return`: `ExitFunction`. @@ -93,7 +93,7 @@ * int x = 5; * int y = 7; * int* p = condition ? &x : &y; - * p = 6; + * *p = 6; * return x; * ``` * diff --git a/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll b/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll index 5e634a7c322..7fd66ba8441 100644 --- a/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll +++ b/cpp/ql/lib/semmle/code/cpp/ir/implementation/unaliased_ssa/PrintIR.qll @@ -36,7 +36,7 @@ * * Literals: `Constant`. * * Variable addresses: `VariableAddress`. * * Function entry points: `EnterFunction`. - * * Return form a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a + * * Return from a function: `Return`, `ReturnVoid`. Note that the value being returned is set separately by a * `Store` to a special `#return` variable. * * Stack unwinding for C++ function that throw and where the exception escapes the function: `Unwind`. * * Common exit point for `Unwind` and `Return`: `ExitFunction`. @@ -93,7 +93,7 @@ * int x = 5; * int y = 7; * int* p = condition ? &x : &y; - * p = 6; + * *p = 6; * return x; * ``` *