The extra flow step

2025-12-16 10:13:04 +01:00 · 2020-07-22 11:52:29 -07:00
parent 12a90e9a54
commit 38bc479725
1 changed files with 102 additions and 77 deletions
--- a/codeql-dataflow-sql-injection.md
+++ b/codeql-dataflow-sql-injection.md
@@ -14,9 +14,8 @@ md_toc github <  codeql-dataflow-sql-injection.md
  - [Data flow overview and illustration](#data-flow-overview-and-illustration)
  - [Tutorial: recap, sources and sinks](#tutorial-recap-sources-and-sinks)
    - [Codeql recap](#codeql-recap)
-    - [Call to SQL query execution (the data sink)](#call-to-sql-query-execution-the-data-sink)
+    - [The Data Sink](#the-data-sink)
-    - [Non-constant query strings and untrusted data (the data source)](#non-constant-query-strings-and-untrusted-data-the-data-source)
+  - [The data flow framework](#the-data-flow-framework)
  - [Data flow overview](#data-flow-overview)
    - [Taint flow configuration](#taint-flow-configuration)
    - [Path problem setup](#path-problem-setup)
    - [Path problem query format](#path-problem-query-format)
@@ -301,114 +300,140 @@ select 1
 We'll assume the `import cpp` is in the header of our query and not rewrite it
 every time.
-Now let's find the function `executeStatement`.  In CodeQL, this uses `Function`
+XX: 
 ### The Data Sink
 Now let's find the function `sqlite3_exec`.  In CodeQL, this uses `Function`
 and a `getName()` attribute.
 ```ql
 from Function f
-where f.getName() = "executeStatement" 
+where f.getName() = "sqlite3_exec" 
 select f
 ```
 This should find one result, 
 ```ql
-void executeStatement(const bsl::string &sQuery);
+SQLITE_API int sqlite3_exec(
  sqlite3*,                                  /* An open database */
  const char *sql,                           /* SQL to be evaluated */
  int (*callback)(void*,int,char**,char**),  /* Callback function */
  void *,                                    /* 1st argument to callback */
  char **errmsg                              /* Error msg written here */
 );
 ```
-on line 5 of `simple.cc`
+in the header `sqlite3.h`.
 ### Call to SQL query execution (the data sink)
 The brings us closer to our sql statement execution.  This part of our problem is
 to identify the call
 ```c
    executeStatement(sQuery);
 ```
 and choosing the argument to `executeStatement()` as sink.  Let's start with the
 call. 
 We really need the function *call*, not the function *definition*.  Also, a call
 has no name; it does have a *target* (the function), which has a name as we saw
 above. 
 To combine these, use the auto-completion.  After typing `Function<tab>`, we see a
 list including `FunctionCall`; we can start with
 Next, let's find the calls to `sqlite3_exec` using the `FunctionCall` type
 ```ql
-from FunctionCall fc
+from FunctionCall exec
-where fc.<tab>
+where exec.getTarget().getName() = "sqlite3_exec" 
 select exec
 ```
-Now, we are looking for the call's *target*; completion shows `getTarget()`,
+This finds our call in `add-user.c`, 
 and we can finish that to 
-```ql
+    rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);
 from FunctionCall fc
 where fc.getTarget().getName() = "executeStatement" 
 select fc
 ```
-Now that we have the function call, let's get the argument to it.  We don't care
+We are interested in the `query` argument, which we can get using `.getArgument`:
 about the exact type of the argument, so an `Expr` is a good choice.  Arguments
 are part of the function *call* and using completion finds `getArgument` and some
 others.  Our query now becomes
 ```ql
-from FunctionCall fc, Expr sink
+from FunctionCall exec, Expr query
 where
-    fc.getTarget().getName() = "executeStatement" and
+    exec.getTarget().getName() = "sqlite3_exec" and
-    fc.getArgument(0) = sink
+    query = exec.getArgument(1)
-select fc, sink
+select exec, query
 ```
 and it finds the call and the argument:
-    1	call to executeStatement	sQuery
+### The Data Source
-For reuse, we can turn this into a predicate.  Contents of `from` become arguments
+The external data enters through the call
-to the predicate, the `where` becomes the body, the `select` is dropped:
+
    count = read(STDIN_FILENO, buf, BUFSIZE);
 We thus want the `buf` argument to the call of the `read` function.  Together, this is 
 ```ql
-predicate sqliSink(FunctionCall fc, Expr sink) {
+from FunctionCall read, Expr buf
-    fc.getTarget().getName() = "executeStatement" and
+where
-    fc.getArgument(0) = sink
+    read.getTarget().getName() = "read" and
-}
+    buf = read.getArgument(1)
-
+select read, buf
 from FunctionCall fc, Expr sink
 where sqliSource(fc, sink)
 select fc, sink
 ```
-This successfully identifies our (potentially) unsafe use of a string in a SQL
+### The extra flow step
-query.
+The codeql data flow library traverses *visible* source code fairly well, but flow
 through opaque functions requires additional support.  Functions for which only a
 headers is available are opaque, and we have one of these here: the call to
 `snprintf`.  Once we get this call, there are *two* nodes to identify: the inflow
 and outflow.
-### Non-constant query strings and untrusted data (the data source)
+Let's start with `snprintf`.  If we try
-If we consider what we mean by "non-constant" strings and untrusted data, what we
+```ql
-really care about is whether an attacker can provide (part of) the query string.
+from FunctionCall printf
 where printf.getTarget().getName() = "snprintf"
 select printf
 ```
 we get zero results.  This is puzzling; if we visit the `add-user.c` source and
 follow the definition of `snprintf`, it turns out to be a macro on MacOS:
 ```c
 #undef snprintf
 #define snprintf(str, len, ...) \
  __builtin___snprintf_chk (str, len, 0, __darwin_obsz(str), __VA_ARGS__)
 #endif
 ```
-Thus, before we get into the the full dataflow details, let's identify the sources
+Fortunately, the underlying function `__builtin___snprintf_chk` has `snprintf` in
-of problematic data.  This part of our problem is to identify (at least) `argv`,
+the name.  So instead of working with C macros from codeql, we generalize our
-`iUUID`, and `sObjectName` as *sources*.  For this example, all variables
+query using a name pattern with `.matches`:
-represent values that would ordinarily come from external sources and are thus
+```ql
-untrusted.  This simplifies our query; we can simply identify *uses* of variables as
+from FunctionCall printf
-taint sources.
+where printf.getTarget().getName().matches("%snprintf%")
 select printf
 ```
-A `Variable` refers to a definition; with completion we find `VariableAccess`,
+This identifies our call
-which is what we want.  Further, we don't care about variables in libraries, only
+
-in the main program.  Put together, this query lists 12 results, including
+    snprintf(query, bufsize, "INSERT INTO users VALUES (%d, '%s')", id, info);
-destructor calls for some of the variables:
+    
 and we need the inflow and outflow nodes next.  `query` is the outflow, `info` is
 the inflow.
 In the `snprintf` macro call, those have indices 0 and 4.  In the underlying function
 `__builtin___snprintf_chk`, the indices are 0 and 6.  Using the latter:
 ```ql
 from FunctionCall printf, Expr out, Expr into
 where
    printf.getTarget().getName().matches("%snprintf%") and
    printf.getArgument(0) = out and
    printf.getArgument(6) = into
 select printf, out, into
 ```
 This correctly identifies the call and the extra flow arguments.
 <!-- !-- Practice exercise: !-- Very specific: shifted index for macro.
 Generalize this to consider !-- all trailing arguments as sources.  -->
 Practice exercise: If you are using linux or windows, generalize this query for
 the `snprintf` arguments found there.  One way to do this is using `or`:
 ```ql
-from  VariableAccess va
+printf.getTarget().getName().matches("%snprintf%") and
-where va.getLocation().getFile().getShortName() = "simple"
+(
-select va, va.getTarget() as definition
+  // mac version
 or
 // linux version
 or
 // windows version
 )
 ```
 Note that our query structure will extend to more complex cases lateron; only the
 source identification will need updating.
 ## Data flow overview
-
+## The data flow framework
-To use global data flow and taint tracking we need to 
+The previous queries identify our source and sink.  To use global data flow and
 taint tracking we need some additional codeql setup:
 - a taint flow configuration 
 - use path queries
 - add extra taint steps for taint flow