The extra flow step

2025-12-16 02:03:05 +01:00 · 2020-07-22 11:52:29 -07:00
parent 12a90e9a54
commit 38bc479725
1 changed files with 102 additions and 77 deletions
--- a/codeql-dataflow-sql-injection.md
+++ b/codeql-dataflow-sql-injection.md
@@ -14,9 +14,8 @@ md_toc github <  codeql-dataflow-sql-injection.md
  - [Data flow overview and illustration](#data-flow-overview-and-illustration)
  - [Tutorial: recap, sources and sinks](#tutorial-recap-sources-and-sinks)
    - [Codeql recap](#codeql-recap)
-    - [Call to SQL query execution (the data sink)](#call-to-sql-query-execution-the-data-sink)
-    - [Non-constant query strings and untrusted data (the data source)](#non-constant-query-strings-and-untrusted-data-the-data-source)
-  - [Data flow overview](#data-flow-overview)
+    - [The Data Sink](#the-data-sink)
+  - [The data flow framework](#the-data-flow-framework)
    - [Taint flow configuration](#taint-flow-configuration)
    - [Path problem setup](#path-problem-setup)
    - [Path problem query format](#path-problem-query-format)
@@ -301,114 +300,140 @@ select 1
 We'll assume the `import cpp` is in the header of our query and not rewrite it
 every time.

-Now let's find the function `executeStatement`.  In CodeQL, this uses `Function`
+XX: 
+
+### The Data Sink
+Now let's find the function `sqlite3_exec`.  In CodeQL, this uses `Function`
 and a `getName()` attribute.

 ```ql
 from Function f
-where f.getName() = "executeStatement" 
+where f.getName() = "sqlite3_exec" 
 select f
 ```

 This should find one result, 
 ```ql
-void executeStatement(const bsl::string &sQuery);
+SQLITE_API int sqlite3_exec(
+  sqlite3*,                                  /* An open database */
+  const char *sql,                           /* SQL to be evaluated */
+  int (*callback)(void*,int,char**,char**),  /* Callback function */
+  void *,                                    /* 1st argument to callback */
+  char **errmsg                              /* Error msg written here */
+);
 ```
-on line 5 of `simple.cc`
-
-### Call to SQL query execution (the data sink)
-
-The brings us closer to our sql statement execution.  This part of our problem is
-to identify the call
-```c
-    executeStatement(sQuery);
-```
-and choosing the argument to `executeStatement()` as sink.  Let's start with the
-call. 
-
-We really need the function *call*, not the function *definition*.  Also, a call
-has no name; it does have a *target* (the function), which has a name as we saw
-above. 
-
-To combine these, use the auto-completion.  After typing `Function<tab>`, we see a
-list including `FunctionCall`; we can start with
+in the header `sqlite3.h`.

+Next, let's find the calls to `sqlite3_exec` using the `FunctionCall` type
 ```ql
-from FunctionCall fc
-where fc.<tab>
+from FunctionCall exec
+where exec.getTarget().getName() = "sqlite3_exec" 
+select exec
 ```

-Now, we are looking for the call's *target*; completion shows `getTarget()`,
-and we can finish that to 
+This finds our call in `add-user.c`, 

-```ql
-from FunctionCall fc
-where fc.getTarget().getName() = "executeStatement" 
-select fc
-```
+    rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);

-Now that we have the function call, let's get the argument to it.  We don't care
-about the exact type of the argument, so an `Expr` is a good choice.  Arguments
-are part of the function *call* and using completion finds `getArgument` and some
-others.  Our query now becomes
+We are interested in the `query` argument, which we can get using `.getArgument`:
 ```ql
-from FunctionCall fc, Expr sink
+from FunctionCall exec, Expr query
 where
-    fc.getTarget().getName() = "executeStatement" and
-    fc.getArgument(0) = sink
-select fc, sink
+    exec.getTarget().getName() = "sqlite3_exec" and
+    query = exec.getArgument(1)
+select exec, query
 ```
-and it finds the call and the argument:

-    1	call to executeStatement	sQuery
+### The Data Source

-For reuse, we can turn this into a predicate.  Contents of `from` become arguments
-to the predicate, the `where` becomes the body, the `select` is dropped:
+The external data enters through the call
+
+    count = read(STDIN_FILENO, buf, BUFSIZE);
+
+We thus want the `buf` argument to the call of the `read` function.  Together, this is 

 ```ql
-predicate sqliSink(FunctionCall fc, Expr sink) {
-    fc.getTarget().getName() = "executeStatement" and
-    fc.getArgument(0) = sink
-}
-
-from FunctionCall fc, Expr sink
-where sqliSource(fc, sink)
-select fc, sink
+from FunctionCall read, Expr buf
+where
+    read.getTarget().getName() = "read" and
+    buf = read.getArgument(1)
+select read, buf
 ```

-This successfully identifies our (potentially) unsafe use of a string in a SQL
-query.
+### The extra flow step
+The codeql data flow library traverses *visible* source code fairly well, but flow
+through opaque functions requires additional support.  Functions for which only a
+headers is available are opaque, and we have one of these here: the call to
+`snprintf`.  Once we get this call, there are *two* nodes to identify: the inflow
+and outflow.

-### Non-constant query strings and untrusted data (the data source)
-If we consider what we mean by "non-constant" strings and untrusted data, what we
-really care about is whether an attacker can provide (part of) the query string.
+Let's start with `snprintf`.  If we try
+```ql
+from FunctionCall printf
+where printf.getTarget().getName() = "snprintf"
+select printf
+```
+we get zero results.  This is puzzling; if we visit the `add-user.c` source and
+follow the definition of `snprintf`, it turns out to be a macro on MacOS:
+```c
+#undef snprintf
+#define snprintf(str, len, ...) \
+  __builtin___snprintf_chk (str, len, 0, __darwin_obsz(str), __VA_ARGS__)
+#endif
+```

-Thus, before we get into the the full dataflow details, let's identify the sources
-of problematic data.  This part of our problem is to identify (at least) `argv`,
-`iUUID`, and `sObjectName` as *sources*.  For this example, all variables
-represent values that would ordinarily come from external sources and are thus
-untrusted.  This simplifies our query; we can simply identify *uses* of variables as
-taint sources.
+Fortunately, the underlying function `__builtin___snprintf_chk` has `snprintf` in
+the name.  So instead of working with C macros from codeql, we generalize our
+query using a name pattern with `.matches`:
+```ql
+from FunctionCall printf
+where printf.getTarget().getName().matches("%snprintf%")
+select printf
+```

-A `Variable` refers to a definition; with completion we find `VariableAccess`,
-which is what we want.  Further, we don't care about variables in libraries, only
-in the main program.  Put together, this query lists 12 results, including
-destructor calls for some of the variables:
+This identifies our call
+
+    snprintf(query, bufsize, "INSERT INTO users VALUES (%d, '%s')", id, info);
+    
+and we need the inflow and outflow nodes next.  `query` is the outflow, `info` is
+the inflow.
+
+In the `snprintf` macro call, those have indices 0 and 4.  In the underlying function
+`__builtin___snprintf_chk`, the indices are 0 and 6.  Using the latter:
+```ql
+from FunctionCall printf, Expr out, Expr into
+where
+    printf.getTarget().getName().matches("%snprintf%") and
+    printf.getArgument(0) = out and
+    printf.getArgument(6) = into
+select printf, out, into
+```
+
+This correctly identifies the call and the extra flow arguments.
+
+<!-- !-- Practice exercise: !-- Very specific: shifted index for macro.
+ Generalize this to consider !-- all trailing arguments as sources.  -->
+
+
+Practice exercise: If you are using linux or windows, generalize this query for
+the `snprintf` arguments found there.  One way to do this is using `or`:

 ```ql
-from  VariableAccess va
-where va.getLocation().getFile().getShortName() = "simple"
-select va, va.getTarget() as definition
+printf.getTarget().getName().matches("%snprintf%") and
+(
+  // mac version
+or
+ // linux version
+or
+ // windows version
+)
 ```

-Note that our query structure will extend to more complex cases lateron; only the
-source identification will need updating.
-
-## Data flow overview


-
-To use global data flow and taint tracking we need to 
+## The data flow framework
+The previous queries identify our source and sink.  To use global data flow and
+taint tracking we need some additional codeql setup:
 - a taint flow configuration 
 - use path queries
 - add extra taint steps for taint flow