diff --git a/README.org b/README.org index 836b67e..fb8b2ff 100644 --- a/README.org +++ b/README.org @@ -51,7 +51,6 @@ CodeQL’s query language and type system more intuitive. See overview of [[https://en.wikipedia.org/wiki/Functional_programming][functional programming]] for related context. - * Repository Layout ** Core Structure - Repository is based on: https://github.com/github/vscode-codeql-starter.git @@ -69,16 +68,49 @@ * Possible Reading Orders ** Data Flow +*** Review: SQLite Injection Workshop, Java + We begin with a recap of the Java-based injection example, focusing on the + vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual + CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace + tainted data through the program. Next, we explore the out-of-the-box query + [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an + inspection of the relevant base classes and framework modeling in + [[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]]. + + - start with SqlTainted.ql, note that it won't find our injection + + - break / comment the pre-done additions in + .github/codeql/extensions/sqlite-db/models/sqlite.model.yml + *** Debugging data flow config (instead of taint flow), Java We can illustrate taint-flow debugging in the Java SQL injection sample - [[./codeql-sqlite-java/TaintFlowDebugging.ql]] - - [[./codeql-sqlite-java/TaintFlowDebugging.md]] + - following [[./codeql-sqlite-java/TaintFlowDebugging.md]] *** TODO Debugging data flow config (instead of taint flow), C A corresponding example for C is planned, using a simplified query to trace value propagation in [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]]. Unlike Java, C may require manual modeling even to visualize basic flows. + - C detail + + Dataflow node types vs. AST vs. CFG, but more choices for the C versions: + after call, pointer. + - asDefiningArgument(), asExpr(), asIndirectArgument() + - asExpr() in C now may cause the path to fail, even though sink and source + are found + - getAQlClass() to get precise type + - ql/actions/ql/src/Debug/partial.ql + - ql/cpp/ql/lib/CHANGELOG.md + 176:* Deleted the deprecated `explorationLimit` predicate from + `DataFlow::Configuration`, use `FlowExploration` instead. + - codeql-sqlite-java/TaintFlowDebugging.md + 54:int explorationLimit() { result = 100 } + 58:module MyPartialFlow = MyFlow::FlowExplorationFwd; + + - Debugging docs: + https://codeql.github.com/docs/writing-codeql-queries/debugging-data-flow-queries-using-partial-flow/#debugging-data-flow-queries-using-partial-flow + + ** Modeling There are two primary approaches to modeling: direct use of CodeQL predicates and the models-as-data system. The models-as-data system is implemented in QL @@ -95,7 +127,34 @@ flow annotations from documentation or code examples, then generate valid YAML model entries automatically. - As diagram: + - *XX* models-as-data is good for simple but large quantity APIs. For anything + complicated, use CodeQL + - The CodeQL parser is optimized for reading large CodeQL files. E.g., 14,000 + predicates are no problem. + - At this scale, you're generating. The type checking you get from CodeQL is + much more extensive than models-as-data. models-as-data is text; CodeQL is a + type-checked language. + +*** TODO MaD (models as data) resources + + https://codeql.github.com/docs/codeql-language-guides/customizing-library-models-for-cpp/ + https://docs.github.com/en/code-security/codeql-for-vs-code/using-the-advanced-functionality-of-the-codeql-for-vs-code-extension/using-the-codeql-model-editor#testing-codeql-model-packs-in-vs-code + https://docs.github.com/en/code-security/codeql-cli/codeql-cli-manual/database-analyze#--model-packsnamerange + examples: https://github.com/github/codeql/blob/main/cpp/ql/lib/ext/Windows.model.yml#L8 + + documentation for the specific possible values of MaD columns other than the + most generic spec can be found here: + https://github.com/github/codeql/blob/main/cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll#L35 + This is covered in more detail in + - java workshop [[file:codeql-sqlite-java/README.org::*Supplement CodeQL: Add to models-as-data][Supplement CodeQL: Add to models-as-data]] + - c workshop [[file:codeql-dataflow-sql-injection-c/README.org::*supplement codeql: Add to models-as-data][supplement codeql: Add to models-as-data]] + - cpp codeql lib [[file:ql/cpp/ql/lib/semmle/code/cpp/dataflow/internal/ExternalFlowExtensions.qll::This module provides extensible predicates for defining MaD models.]] + - java codeql lib [[file:ql/java/ql/lib/semmle/code/java/dataflow/internal/ExternalFlowExtensions.qll::This module provides extensible predicates for defining MaD models.]] + + each language has one of these ExternalFlow lib files and each includes more + description on what the potential values actually mean + +*** Modeling overview as diagram #+BEGIN_SRC text +----------------------+ | Modeling in | @@ -119,7 +178,7 @@ +---------v---------+ +-----------v-----------+ | Java: built-in | | Java: Jedis + Console | | includes .qll hook | | GUI modeling examples | - +--------------------+ +------------------------+ + +--------------------+ +-----------------------+ | | Manual setup needed for: v @@ -142,7 +201,6 @@ +-------------------------------+ #+END_SRC - *** Review: SQLite Injection Workshop, Java We begin with a recap of the Java-based injection example, focusing on the vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual @@ -152,6 +210,11 @@ inspection of the relevant base classes and framework modeling in [[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]]. + - start with SqlTainted.ql, note that it won't find our injection + + - break / comment the pre-done additions in + .github/codeql/extensions/sqlite-db/models/sqlite.model.yml + *** Customizations via codeql (Java) To customize CodeQL for Java, we identify and extend base classes to add custom flow sources and sinks. A general explanation of this approach is @@ -163,6 +226,44 @@ customization process can be found in [[./codeql-dataflow-sql-injection-c/incoming.codeql-customizations-workshop.md][incoming.codeql-customizations-workshop.md]]. + - illustrate what source, sink find using QueryInjectionFlowConfig in + SqlInjectionQuery.qll + - sink ok + - no source + + - find the base class of source, so we know what to extend + + - import gotcha + I used + + import semmle.code.java.dataflow.FlowSources as Sources + + class ReadLine extends Sources::RemoteFlowSource { + + Does this work too or is private better? + + - Q: how to run all the CWE* queries against some file? + + - packs at https://github.com/advanced-security/codeql-bundle + + - how to run all the CWE* queries against some file? + -- the codeql database analyze command can take several arguments, including a directory or query spec + To get the full options, run + 0:$ codeql database analyze -vvvv -h + Usage: codeql database analyze [OPTIONS] -- [...] + Analyze a database, producing meaningful results in the context of the source code. + + Run a query suite (or some individual queries) against a CodeQL database, producing results, styled as + alerts or paths, in SARIF or another interpreted format. + + This command combines the effect of the codeql database run-queries and codeql database interpret-result + + - How do you install/include the CodeQL bundles with the modified Customizations.qll? + + That part we have not deciphered in detail. the CLI tool at + https://github.com/advanced-security/codeql-bundle does this -- but it's a + black box + *** Customizations via Model Editor: Jedis Example (Java Redis client) The Jedis example is a straightforward case with no unexpected behavior. Although the library contains many functions, they follow a simple @@ -196,6 +297,19 @@ and predicates -- can be identified by inspecting representative queries like [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]]. + - find existing readline modling + #+BEGIN_SRC text + hohn@ghm3 ~/work-gh/codeql-lab + 0:$ rg -il 'readline' ql/java --type=yaml + ql/java/ql/lib/ext/com.google.common.io.model.yml + ql/java/ql/lib/ext/org.apache.cxf.helpers.model.yml + ql/java/ql/lib/ext/java.io.model.yml + ql/java/ql/lib/ext/generated/java.io.model.yml + ql/java/ql/lib/ext/generated/kotlinstdlib.model.yml + ql/java/ql/lib/ext/generated/jenkins.model.yml + ql/java/ql/lib/ext/generated/org.apache.commons.io.model.yml + ql/java/ql/lib/ext/experimental/com.google.common.io.model.yml + #+END_SRC *** Review: SQLite Injection Workshop (C) This is the C version of the injection workshop, based on @@ -270,6 +384,16 @@ in: [[./codeql-dataflow-sql-injection-c/README.org]] + - same workflow as Java: extend RemoteFlowSource, do it in Customizations.qll + to affect all queries. + - model pack existence has to be explicitly specified + - Options to control the model packs to be used + #+BEGIN_SRC text + --model-packs=... + A list of CodeQL pack names, each with an optional version range, to be used as model packs to customize the queries that are about to be evaluated. + #+END_SRC + + ** TODO CodeQL Bundling This section will provide a detailed walkthrough of the CodeQL bundling process using the CLI tool at https://github.com/advanced-security/codeql-bundle. This @@ -281,6 +405,119 @@ from source. Notes and scripts will be collected in [[file:codeql-bundling/README.org::XX: continue]]. + CodeQL bundle info: + - original bundles found at: + https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.22.2 + - custom bundler found: + https://github.com/advanced-security/codeql-bundle?tab=readme-ov-file#codeql-customization-packs + + - use a Customizations.qll pack + (https://github.com/advanced-security/codeql-bundle?tab=readme-ov-file#codeql-customization-packs + these do get created as a separate pack from the rest of the lib) + + - This part of the custom bundle tool documentation + (https://github.com/advanced-security/codeql-bundle/blob/main/codeql_bundle/helpers/bundle.py#L329) + explains how the tool reverses dependencies: the built-in libraries are + modified to depend on the custom library. The custom bundle must include a + file named Customizations (a convention enforced by the bundler), but it can + also contain additional libraries with arbitrary names. + + - Generating CodeQL Models for API Endpoints + + To support automatic generation of API endpoint models in a CodeQL workshop + (without using the model editor), you can leverage the existing + infrastructure used in the =cpp/ql/lib/ext/generated= directory of the CodeQL + repo: + + - File Generation :: + Individual model files are generated using the following script: + https://github.com/github/codeql/blob/main/misc/scripts/models-as-data/generate_mad.py + + - Bulk Generation :: + For batch processing, use: + https://github.com/github/codeql/blob/main/misc/scripts/models-as-data/bulk_generate_mad.py + + This script requires: + - A language-specific YAML config file — for C++: + https://github.com/github/codeql/blob/main/cpp/bulk_generation_targets.yml + - A DCA run (Data-Collection Analysis) to provide the necessary input data. + + These tools allow you to programmatically produce model files similar to + those found in =ql/lib/ext/generated=, making them suitable for automated or + instructional use cases. + + - Updated MAD Generator (no more DCA step) + + The script `generate_mad.py` replaces the older DCA-based workflow. It runs a set of + language-specific CodeQL queries directly against a database and emits `.model.yml` files. + + - Queries used: + - CaptureSummaryModels.ql + - CaptureSinkModels.ql + - CaptureSourceModels.ql + - CaptureNeutralModels.ql + - CaptureTypeBasedSummaryModels.ql (optional) + + - These queries are located in: + /ql/src/utils/modelgenerator/ + + - Output files are written to: + /ql/lib/ext/generated//*.model.yml + + - Example usage: + #+BEGIN_SRC sh + python3 generate_mad.py --language cpp /path/to/db --with-sinks --with-sources --with-summaries + #+END_SRC + + There is no longer any need for intermediate `.dca.json` files or a "DCA run". + + A compact shell script illustrating the steps is in + [[./models-as-data/generate-mad-core]] + + - [ ] A compact shell/csvtk script illustrating the steps is in + [[./models-as-data/generate-mad-core.csvtk]] + brew install csvtk + + - [ ] A compact shell/[[https://github.com/medialab/xan?tab=readme-ov-file#quick-tour][xan]] script illustrating the steps is in + [[./models-as-data/generate-mad-core.xan]] + brew install xan + + https://github.com/github/codeql/tree/main/misc/scripts/models-as-data + + - [ ] bundling semantics + good + - pack a_1 + - depends b_1 + - depends b_2 + - depends java-all + + good + - pack a_1 + - depends b_1 + - depends b_2 + - depends java-all + - depends my-custom + + cycle, actual current situation. OK for libraries, not packs? + Is this import hierarchy + - pack a_1 + - depends b_1 + - depends b_2 + - depends java-all + - depends my-custom + - depends java-all + + turned into? + - pack a_1 + - depends b_1 + - depends b_2 + - depends java-all-custom precompiled + + The fundamental distinction: Customizations.qll can *insert under* the stdlib. + Other packs are *on top of* the stdlib. + + There is a transient dependency inserted. See + codeql-bundle/codeql_bundle/helpers/bundle.py * Tool Setup Some scripts are used here, found in [[./bin/]]. To ensure the ones written in Python have access to prerequites, set up a virtual environment via diff --git a/codeql-dataflow-sql-injection-c/Explore.ql b/codeql-dataflow-sql-injection-c/Explore.ql new file mode 100644 index 0000000..72c134e --- /dev/null +++ b/codeql-dataflow-sql-injection-c/Explore.ql @@ -0,0 +1,15 @@ +/** +* @name SQLI Vulnerability +* @description Using untrusted strings in a sql query allows sql injection attacks. +* @ kind path-problem +* @id cpp/sqlivulnerable +* @problem.severity warning +*/ + +import cpp +// import semmle.code.cpp.dataflow.new.TaintTracking + + +from FunctionCall exec +where exec.getTarget().getName().matches("%snprintf%") +select exec, exec.getTarget().getName(), exec.getAnArgument() diff --git a/codeql-dataflow-sql-injection-c/FlowExploration.ql b/codeql-dataflow-sql-injection-c/FlowExploration.ql new file mode 100644 index 0000000..0aefc1f --- /dev/null +++ b/codeql-dataflow-sql-injection-c/FlowExploration.ql @@ -0,0 +1,55 @@ +/** +* @name SQLI Vulnerability +* @description Using untrusted strings in a sql query allows sql injection attacks. +* @kind path-problem +* @id cpp/sqlivulnerable +* @problem.severity warning +*/ + +import cpp +import semmle.code.cpp.dataflow.new.TaintTracking + +module SqliFlowConfig implements DataFlow::ConfigSig { + + predicate isSource(DataFlow::Node source) { + // count = read(STDIN_FILENO, buf, BUFSIZE); + exists(FunctionCall read | + read.getTarget().getName() = "read" and + ( + read.getArgument(1) = source.asDefiningArgument() + or + read.getArgument(1) = source.asExpr() + ) + ) + } + + predicate isBarrier(DataFlow::Node sanitizer) { none() } + + predicate isSink(DataFlow::Node sink) { + // rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg); + exists(FunctionCall exec | + exec.getTarget().getName() = "sqlite3_exec" and + exec.getArgument(1) = sink.asIndirectArgument() + ) + } +} + +int explorationLimit() { result = 100 } + +// We break the flow chain by switching from TaintFlow to DataFlow +module MyFlow = DataFlow::Global; + +module MyPartialFlow = MyFlow::FlowExplorationFwd; + +import MyPartialFlow::PartialPathGraph + +from MyPartialFlow::PartialPathNode start, MyPartialFlow::PartialPathNode end +where MyPartialFlow::partialFlow(start, end, _) +select end, start, end, "Sql injection from $@", start, "here" + +// note: using the pathgraph gives a more readable output, in the form +// 'from here' 'to there' + +// This query goes up to add-user.c:80:73. +// This indicates that the flow is not crossing the snprintf, so this is where +// further exploration is needed. See Explore.ql \ No newline at end of file diff --git a/codeql-dataflow-sql-injection-c/qlpack.yml b/codeql-dataflow-sql-injection-c/qlpack.yml index 5a30be8..f5d0c89 100644 --- a/codeql-dataflow-sql-injection-c/qlpack.yml +++ b/codeql-dataflow-sql-injection-c/qlpack.yml @@ -1,4 +1,4 @@ -name: codeql-workshop/cpp-sql-injection +name: codeql-workshop/cpp-sql-injection-c version: 0.0.1 dependencies: codeql/cpp-all: "*" diff --git a/codeql-sqlite-java/AddCustomization.ql b/codeql-sqlite-java/AddCustomization.ql new file mode 100644 index 0000000..db29d7a --- /dev/null +++ b/codeql-sqlite-java/AddCustomization.ql @@ -0,0 +1,30 @@ +import java + +// // Find the source +// class ReadLine extends MethodCall { +// ReadLine() { +// exists(MethodCall g | +// g.getMethod().hasQualifiedName("java.io", "Console", "readLine") and +// this = g +// ) +// } +// } +// from ReadLine rl +// select rl + +private import semmle.code.java.dataflow.FlowSources + +// Find the source +class ReadLine extends RemoteFlowSource { + ReadLine() { + exists(MethodCall g | + g.getMethod().hasQualifiedName("java.io", "Console", "readLine") and + this.asExpr() = g + ) + } + override string getSourceType() { result = "readline input parameter" } + +} +from ReadLine rl +select rl + diff --git a/models-as-data/generate-mad-core b/models-as-data/generate-mad-core new file mode 100644 index 0000000..ff4ee66 --- /dev/null +++ b/models-as-data/generate-mad-core @@ -0,0 +1,78 @@ +#!/bin/bash +# generate_mad_core.sh +# Minimal MAD generator for a given CodeQL database and language + +set -euo pipefail + +# --- Config --- +DB="$1" # Path to CodeQL database +LANG="$2" # Language, e.g., cpp, java +OUT_DIR="$3" # Output directory, relative to repo root +CODEQL="$(which codeql)" # CodeQL CLI +REPO_ROOT="$(git rev-parse --show-toplevel)" + +QUERY_DIR="$REPO_ROOT/$LANG/ql/src/utils/modelgenerator" +TMP_DIR="$(mktemp -d)" +BQRS_FILE="$TMP_DIR/out.bqrs" + +# Map query name to predicate name +declare -A QUERIES=( + ["CaptureSinkModels.ql"]="isSink" + ["CaptureSourceModels.ql"]="isSource" + ["CaptureSummaryModels.ql"]="isSummary" + ["CaptureNeutralModels.ql"]="isNeutral" +) + +# Minimal YAML output template +write_yaml() { + local ns="$1" + local pred="$2" + local body="$3" + local sanitized="${ns//[\/:]/-}" + mkdir -p "$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUT_DIR" + cat < "$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUT_DIR/${sanitized}.model.yml" +# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT. +extensions: + - addsTo: + pack: codeql/${LANG}-all + predicate: $pred + rows: +$body +EOF + echo "Wrote: $REPO_ROOT/$LANG/ql/lib/ext/generated/$OUT_DIR/${sanitized}.model.yml" +} + +# Run queries and convert output to addsTo rows +for query in "${!QUERIES[@]}"; do + echo "Running $query..." + "$CODEQL" query run \ + "$QUERY_DIR/$query" \ + --database "$DB" \ + --output "$BQRS_FILE" + + # Extract result rows as text (CSV-like) + RAW_ROWS=$("$CODEQL" bqrs decode --format=csv --output=- "$BQRS_FILE" | tail -n +2) + + # Group by namespace, format for YAML + declare -A ROWS=() + while IFS= read -r line; do + IFS=';' read -ra FIELDS <<< "$line" + ns="${FIELDS[0]}" + quoted=() + for f in "${FIELDS[@]}"; do + if [[ "$f" != "true" && "$f" != "false" ]]; then + quoted+=("\"$f\"") + else + cap="${f^}" # capitalize + quoted+=("$cap") + fi + done + ROWS["$ns"]+=$'\n'" - [${quoted[*]}]" + done <<< "$RAW_ROWS" + + for ns in "${!ROWS[@]}"; do + write_yaml "$ns" "${QUERIES[$query]}" "${ROWS[$ns]}" + done +done + +rm -rf "$TMP_DIR" diff --git a/models-as-data/generate-mad-core.csvtk b/models-as-data/generate-mad-core.csvtk new file mode 100644 index 0000000..771430a --- /dev/null +++ b/models-as-data/generate-mad-core.csvtk @@ -0,0 +1,82 @@ +#!/bin/bash +# generate_mad_csvtk.sh — Full CSVTK-based MAD generator + +set -euo pipefail + +DB="$1" # Path to CodeQL DB +LANG="$2" # e.g. cpp +OUTDIR="$3" # e.g. mylib +CODEQL="$(which codeql)" +REPO_ROOT="$(git rev-parse --show-toplevel)" +QUERY_DIR="$REPO_ROOT/$LANG/ql/src/utils/modelgenerator" +TARGET_ROOT="$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUTDIR" +TMP_DIR="$(mktemp -d)" + +mkdir -p "$TARGET_ROOT" + +declare -A QUERIES=( + ["CaptureSinkModels.ql"]="isSink" + ["CaptureSourceModels.ql"]="isSource" + ["CaptureSummaryModels.ql"]="isSummary" + ["CaptureNeutralModels.ql"]="isNeutral" +) + +# Quoting + capitalization logic as an inline function for csvtk +quote_expr=' +function q(x) { + return (x == "true" || x == "false") ? toupper(substr(x, 1, 1)) substr(x, 2) : "\"" x "\"" +} +[q($1), q($2), q($3), q($4)] +' + +for query in "${!QUERIES[@]}"; do + echo "Running $query..." + BQRS_FILE="$TMP_DIR/out.bqrs" + CSV_FILE="$TMP_DIR/out.csv" + + "$CODEQL" query run "$QUERY_DIR/$query" \ + --database "$DB" \ + --output "$BQRS_FILE" + + "$CODEQL" bqrs decode --format=csv --output="$CSV_FILE" "$BQRS_FILE" + tail -n +2 "$CSV_FILE" > "$TMP_DIR/noheader.csv" + + # Add header for csvtk compatibility + head -n1 "$CSV_FILE" | grep -q ',' || echo "namespace;f1;f2;f3;f4" > "$TMP_DIR/head.csv" + cat "$TMP_DIR/head.csv" "$TMP_DIR/noheader.csv" > "$TMP_DIR/input.csv" + + # Mutate quoted fields + csvtk mutate -t -n quoted1,quoted2,quoted3,quoted4 -e ' + if ($f1=="true" || $f1=="false") ucfirst($f1); else "\"" + $f1 + "\"" + ' -e ' + if ($f2=="true" || $f2=="false") ucfirst($f2); else "\"" + $f2 + "\"" + ' -e ' + if ($f3=="true" || $f3=="false") ucfirst($f3); else "\"" + $f3 + "\"" + ' -e ' + if ($f4=="true" || $f4=="false") ucfirst($f4); else "\"" + $f4 + "\"" + ' "$TMP_DIR/input.csv" > "$TMP_DIR/quoted.csv" + + # Group by namespace + csvtk cut -t -f namespace "$TMP_DIR/quoted.csv" | tail -n +2 | sort -u | while read -r ns; do + safe_ns=$(echo "$ns" | tr '/:' '--') + out="$TARGET_ROOT/$safe_ns.model.yml" + + echo "# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT." > "$out" + echo "extensions:" >> "$out" + echo " - addsTo:" >> "$out" + echo " pack: codeql/$LANG-all" >> "$out" + echo " predicate: ${QUERIES[$query]}" >> "$out" + echo " rows:" >> "$out" + + # Extract all quoted fields for this namespace + csvtk grep -t -f namespace -p "$ns" "$TMP_DIR/quoted.csv" | + csvtk cut -t -f quoted1,quoted2,quoted3,quoted4 | + tail -n +2 | # remove header + sed 's/^/ - [/' | sed 's/$/]/' >> "$out" + + echo "Wrote $out" + done +done + +rm -rf "$TMP_DIR" + diff --git a/models-as-data/generate-mad-core.xan b/models-as-data/generate-mad-core.xan new file mode 100644 index 0000000..84d3447 --- /dev/null +++ b/models-as-data/generate-mad-core.xan @@ -0,0 +1,66 @@ +#!/bin/bash +# Model generator using `xan` for CSV processing + +set -euo pipefail + +DB="$1" # CodeQL database path +LANG="$2" # Language (e.g. cpp) +OUTDIR="$3" # Output directory name under lib/ext/generated/ +CODEQL="$(which codeql)" +REPO_ROOT="$(git rev-parse --show-toplevel)" +QUERY_DIR="$REPO_ROOT/$LANG/ql/src/utils/modelgenerator" +TARGET_ROOT="$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUTDIR" +TMP_DIR="$(mktemp -d)" + +mkdir -p "$TARGET_ROOT" + +declare -A QUERIES=( + ["CaptureSinkModels.ql"]="isSink" + ["CaptureSourceModels.ql"]="isSource" + ["CaptureSummaryModels.ql"]="isSummary" + ["CaptureNeutralModels.ql"]="isNeutral" +) + +for query in "${!QUERIES[@]}"; do + echo "Running $query..." + BQRS_FILE="$TMP_DIR/out.bqrs" + CSV_FILE="$TMP_DIR/result.csv" + + "$CODEQL" query run "$QUERY_DIR/$query" \ + --database "$DB" \ + --output "$BQRS_FILE" + + "$CODEQL" bqrs decode --format=csv --output="$CSV_FILE" "$BQRS_FILE" + + echo "Grouping rows by namespace..." + + xan map ' + let q = |x| -> if (x == "true" || x == "false") { upper(x) } else { fmt("\"{}\"", x) }; + fmt(" - [{}]", join(", ", [q(f1), q(f2), q(f3), q(f4)])) + ' row "$CSV_FILE" \ + | xan groupby namespace 'collect(row) as rows' \ + | xan explode rows \ + | xan select namespace,row \ + | xan groupby namespace 'collect(row) as block' \ + | xan explode block \ + | while IFS=',' read -r ns row; do + safe_ns=$(echo "$ns" | tr '/:' '--' | tr -d '"') + out="$TARGET_ROOT/$safe_ns.model.yml" + if [[ ! -f "$out" ]]; then + cat < "$out" +# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT. +extensions: + - addsTo: + pack: codeql/$LANG-all + predicate: ${QUERIES[$query]} + rows: +EOF + fi + echo "$row" >> "$out" + done + + echo "Wrote models to: $TARGET_ROOT/" +done + +rm -rf "$TMP_DIR" +