wip: many revisions

2025-12-16 18:03:08 +01:00 · 2025-08-06 15:56:48 -07:00
parent 07c9d15a76
commit 269be51b58
8 changed files with 569 additions and 6 deletions
--- a/README.org
+++ b/README.org
@@ -51,7 +51,6 @@
    CodeQL’s query language and type system more intuitive.
    See overview of [[https://en.wikipedia.org/wiki/Functional_programming][functional programming]] for related context.

-
 * Repository Layout
 ** Core Structure
   - Repository is based on: https://github.com/github/vscode-codeql-starter.git
@@ -69,16 +68,49 @@
 * Possible Reading Orders

 ** Data Flow 
+*** Review: SQLite Injection Workshop, Java
+    We begin with a recap of the Java-based injection example, focusing on the
+    vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
+    CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace
+    tainted data through the program. Next, we explore the out-of-the-box query
+    [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an
+    inspection of the relevant base classes and framework modeling in
+    [[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].
+
+    - start with SqlTainted.ql, note that it won't find our injection
+
+    - break / comment the pre-done additions in
+      .github/codeql/extensions/sqlite-db/models/sqlite.model.yml
+
 *** Debugging data flow config (instead of taint flow), Java
    We can illustrate taint-flow debugging in the Java SQL injection sample
    - [[./codeql-sqlite-java/TaintFlowDebugging.ql]]
-    - [[./codeql-sqlite-java/TaintFlowDebugging.md]]
+    - following [[./codeql-sqlite-java/TaintFlowDebugging.md]]

 *** TODO Debugging data flow config (instead of taint flow), C
    A corresponding example for C is planned, using a simplified query to trace
    value propagation in [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
    Unlike Java, C may require manual modeling even to visualize basic flows.

+    - C detail
+      + Dataflow node types vs. AST vs. CFG, but more choices for the C versions:
+        after call, pointer.
+        - asDefiningArgument(), asExpr(), asIndirectArgument()
+        - asExpr() in C now may cause the path to fail, even though sink and source
+          are found
+    - getAQlClass() to get precise type
+    - ql/actions/ql/src/Debug/partial.ql
+    - ql/cpp/ql/lib/CHANGELOG.md
+      176:* Deleted the deprecated `explorationLimit` predicate from
+      `DataFlow::Configuration`, use `FlowExploration<explorationLimit>` instead.
+    - codeql-sqlite-java/TaintFlowDebugging.md
+      54:int explorationLimit() { result = 100 }
+      58:module MyPartialFlow = MyFlow::FlowExplorationFwd<explorationLimit/0>;
+
+    - Debugging docs:
+      https://codeql.github.com/docs/writing-codeql-queries/debugging-data-flow-queries-using-partial-flow/#debugging-data-flow-queries-using-partial-flow
+
+
 ** Modeling
   There are two primary approaches to modeling: direct use of CodeQL predicates
   and the models-as-data system. The models-as-data system is implemented in QL
@@ -95,7 +127,34 @@
   flow annotations from documentation or code examples, then generate valid YAML
   model entries automatically.

-   As diagram:
+   - *XX* models-as-data is good for simple but large quantity APIs.  For anything
+     complicated, use CodeQL
+   - The CodeQL parser is optimized for reading large CodeQL files.  E.g., 14,000
+     predicates are no problem.
+   - At this scale, you're generating.  The type checking you get from CodeQL is
+     much more extensive than models-as-data.  models-as-data is text; CodeQL is a
+     type-checked language.
+
+*** TODO MaD (models as data) resources
+
+    https://codeql.github.com/docs/codeql-language-guides/customizing-library-models-for-cpp/
+    https://docs.github.com/en/code-security/codeql-for-vs-code/using-the-advanced-functionality-of-the-codeql-for-vs-code-extension/using-the-codeql-model-editor#testing-codeql-model-packs-in-vs-code
+    https://docs.github.com/en/code-security/codeql-cli/codeql-cli-manual/database-analyze#--model-packsnamerange
+    examples: https://github.com/github/codeql/blob/main/cpp/ql/lib/ext/Windows.model.yml#L8
+
+    documentation for the specific possible values of MaD columns other than the
+    most generic spec can be found here:
+    https://github.com/github/codeql/blob/main/cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll#L35
+    This is covered in more detail in
+    - java workshop [[file:codeql-sqlite-java/README.org::*Supplement CodeQL: Add to models-as-data][Supplement CodeQL: Add to models-as-data]] 
+    - c workshop [[file:codeql-dataflow-sql-injection-c/README.org::*supplement codeql: Add to models-as-data][supplement codeql: Add to models-as-data]] 
+    - cpp codeql lib [[file:ql/cpp/ql/lib/semmle/code/cpp/dataflow/internal/ExternalFlowExtensions.qll::This module provides extensible predicates for defining MaD models.]]
+    - java codeql lib [[file:ql/java/ql/lib/semmle/code/java/dataflow/internal/ExternalFlowExtensions.qll::This module provides extensible predicates for defining MaD models.]]
+
+    each language has one of these ExternalFlow lib files and each includes more
+    description on what the potential values actually mean
+
+*** Modeling overview as diagram
   #+BEGIN_SRC text
                                       +----------------------+
                                       |     Modeling in      |
@@ -119,7 +178,7 @@
         +---------v---------+                                       +-----------v-----------+
         | Java: built-in     |                                      | Java: Jedis + Console |
         | includes .qll hook |                                      | GUI modeling examples |
-         +--------------------+                                      +------------------------+
+         +--------------------+                                      +-----------------------+
                   |
                   | Manual setup needed for:
                   v
@@ -142,7 +201,6 @@
     +-------------------------------+
   #+END_SRC

-
 *** Review: SQLite Injection Workshop, Java
    We begin with a recap of the Java-based injection example, focusing on the
    vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
@@ -152,6 +210,11 @@
    inspection of the relevant base classes and framework modeling in
    [[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].

+    - start with SqlTainted.ql, note that it won't find our injection
+
+    - break / comment the pre-done additions in
+      .github/codeql/extensions/sqlite-db/models/sqlite.model.yml
+      
 *** Customizations via codeql (Java)
    To customize CodeQL for Java, we identify and extend base classes to add
    custom flow sources and sinks. A general explanation of this approach is
@@ -163,6 +226,44 @@
    customization process can be found in
    [[./codeql-dataflow-sql-injection-c/incoming.codeql-customizations-workshop.md][incoming.codeql-customizations-workshop.md]].

+    - illustrate what source, sink find using QueryInjectionFlowConfig in
+      SqlInjectionQuery.qll
+      - sink ok
+      - no source
+
+    - find the base class of source, so we know what to extend
+
+    - import gotcha
+      I used 
+
+      import semmle.code.java.dataflow.FlowSources as Sources
+
+      class ReadLine extends Sources::RemoteFlowSource {
+
+      Does this work too or is private better?
+
+    - Q: how to run all the CWE* queries against some file?
+
+    - packs at https://github.com/advanced-security/codeql-bundle
+
+     - how to run all the CWE* queries against some file?
+       -- the codeql database analyze command can take several arguments, including a directory or query spec
+       To get the full options, run
+       0:$  codeql database analyze -vvvv -h
+       Usage: codeql database analyze [OPTIONS] -- <database> [<query|dir|suite|pack>...]
+       Analyze a database, producing meaningful results in the context of the source code.
+
+       Run a query suite (or some individual queries) against a CodeQL database, producing results, styled as
+       alerts or paths, in SARIF or another interpreted format.
+
+       This command combines the effect of the codeql database run-queries and codeql database interpret-result
+
+     - How do you install/include the CodeQL bundles with the modified Customizations.qll?
+
+       That part we have not deciphered in detail.  the CLI tool at
+       https://github.com/advanced-security/codeql-bundle does this -- but it's a
+       black box
+
 *** Customizations via Model Editor: Jedis Example (Java Redis client)
    The Jedis example is a straightforward case with no unexpected
    behavior. Although the library contains many functions, they follow a simple
@@ -196,6 +297,19 @@
    and predicates -- can be identified by inspecting representative queries like
    [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].

+    - find existing readline modling
+      #+BEGIN_SRC text
+        hohn@ghm3 ~/work-gh/codeql-lab
+        0:$ rg -il 'readline' ql/java --type=yaml
+        ql/java/ql/lib/ext/com.google.common.io.model.yml
+        ql/java/ql/lib/ext/org.apache.cxf.helpers.model.yml
+        ql/java/ql/lib/ext/java.io.model.yml
+        ql/java/ql/lib/ext/generated/java.io.model.yml
+        ql/java/ql/lib/ext/generated/kotlinstdlib.model.yml
+        ql/java/ql/lib/ext/generated/jenkins.model.yml
+        ql/java/ql/lib/ext/generated/org.apache.commons.io.model.yml
+        ql/java/ql/lib/ext/experimental/com.google.common.io.model.yml
+      #+END_SRC

 *** Review: SQLite Injection Workshop (C)
    This is the C version of the injection workshop, based on
@@ -270,6 +384,16 @@
    in:
    [[./codeql-dataflow-sql-injection-c/README.org]]
       
+    - same workflow as Java: extend RemoteFlowSource, do it in Customizations.qll
+      to affect all queries.
+    - model pack existence has to be explicitly specified
+    - Options to control the model packs to be used
+      #+BEGIN_SRC text
+        --model-packs=<name@range>...
+        A list of CodeQL pack names, each with an optional version range, to be used as model packs to customize the queries that are about to be evaluated.
+      #+END_SRC
+
+
 ** TODO CodeQL Bundling
   This section will provide a detailed walkthrough of the CodeQL bundling process
   using the CLI tool at https://github.com/advanced-security/codeql-bundle. This
@@ -281,6 +405,119 @@
   from source. Notes and scripts will be collected in
   [[file:codeql-bundling/README.org::XX: continue]].

+   CodeQL bundle info:
+   - original bundles found at:
+     https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.22.2
+   - custom bundler found:
+     https://github.com/advanced-security/codeql-bundle?tab=readme-ov-file#codeql-customization-packs
+
+   - use a Customizations.qll pack
+     (https://github.com/advanced-security/codeql-bundle?tab=readme-ov-file#codeql-customization-packs
+     these do get created as a separate pack from the rest of the lib)
+
+   - This part of the custom bundle tool documentation
+     (https://github.com/advanced-security/codeql-bundle/blob/main/codeql_bundle/helpers/bundle.py#L329)
+     explains how the tool reverses dependencies: the built-in libraries are
+     modified to depend on the custom library. The custom bundle must include a
+     file named Customizations (a convention enforced by the bundler), but it can
+     also contain additional libraries with arbitrary names.
+
+   - Generating CodeQL Models for API Endpoints
+
+     To support automatic generation of API endpoint models in a CodeQL workshop
+     (without using the model editor), you can leverage the existing
+     infrastructure used in the =cpp/ql/lib/ext/generated= directory of the CodeQL
+     repo:
+
+     - File Generation ::
+       Individual model files are generated using the following script:  
+       https://github.com/github/codeql/blob/main/misc/scripts/models-as-data/generate_mad.py
+
+     - Bulk Generation :: 
+       For batch processing, use:  
+       https://github.com/github/codeql/blob/main/misc/scripts/models-as-data/bulk_generate_mad.py
+
+       This script requires:
+       - A language-specific YAML config file — for C++:  
+         https://github.com/github/codeql/blob/main/cpp/bulk_generation_targets.yml
+       - A DCA run (Data-Collection Analysis) to provide the necessary input data.
+
+     These tools allow you to programmatically produce model files similar to
+     those found in =ql/lib/ext/generated=, making them suitable for automated or
+     instructional use cases.
+   
+   - Updated MAD Generator (no more DCA step)
+
+     The script `generate_mad.py` replaces the older DCA-based workflow. It runs a set of
+     language-specific CodeQL queries directly against a database and emits `.model.yml` files.
+  
+     - Queries used:
+       - CaptureSummaryModels.ql
+       - CaptureSinkModels.ql
+       - CaptureSourceModels.ql
+       - CaptureNeutralModels.ql
+       - CaptureTypeBasedSummaryModels.ql (optional)
+  
+     - These queries are located in:
+       <language>/ql/src/utils/modelgenerator/
+  
+     - Output files are written to:
+       <language>/ql/lib/ext/generated/<folder>/*.model.yml
+  
+     - Example usage:
+       #+BEGIN_SRC sh
+         python3 generate_mad.py --language cpp /path/to/db --with-sinks --with-sources --with-summaries
+       #+END_SRC
+  
+     There is no longer any need for intermediate `.dca.json` files or a "DCA run".
+
+     A compact shell script illustrating the steps is in
+     [[./models-as-data/generate-mad-core]]
+
+   - [ ] A compact shell/csvtk script illustrating the steps is in
+     [[./models-as-data/generate-mad-core.csvtk]]
+     brew install csvtk
+
+   - [ ] A compact shell/[[https://github.com/medialab/xan?tab=readme-ov-file#quick-tour][xan]] script illustrating the steps is in
+     [[./models-as-data/generate-mad-core.xan]]
+     brew install xan
+
+     https://github.com/github/codeql/tree/main/misc/scripts/models-as-data
+
+   - [ ] bundling semantics
+     good
+     - pack a_1
+       - depends b_1
+       - depends b_2
+         - depends java-all
+
+     good
+     - pack a_1
+       - depends b_1
+       - depends b_2
+         - depends java-all
+           - depends my-custom
+
+     cycle, actual current situation.  OK for libraries, not packs?
+     Is this import hierarchy
+     - pack a_1
+       - depends b_1
+       - depends b_2
+         - depends java-all
+           - depends my-custom
+             - depends java-all
+
+     turned into?
+     - pack a_1
+       - depends b_1
+       - depends b_2
+         - depends java-all-custom precompiled
+
+     The fundamental distinction:  Customizations.qll can *insert under* the stdlib.
+     Other packs are *on top of* the stdlib.
+ 
+     There is a transient dependency inserted.  See
+     codeql-bundle/codeql_bundle/helpers/bundle.py
 * Tool Setup
  Some scripts are used here, found in [[./bin/]].  To ensure the ones written in
  Python have access to prerequites, set up a virtual environment via 
--- a/codeql-dataflow-sql-injection-c/Explore.ql
+++ b/codeql-dataflow-sql-injection-c/Explore.ql
@@ -0,0 +1,15 @@
+/**
+* @name SQLI Vulnerability
+* @description Using untrusted strings in a sql query allows sql injection attacks.
+* @ kind path-problem
+* @id cpp/sqlivulnerable
+* @problem.severity warning
+*/
+
+import cpp
+// import semmle.code.cpp.dataflow.new.TaintTracking
+
+
+from FunctionCall exec
+where exec.getTarget().getName().matches("%snprintf%")
+select exec, exec.getTarget().getName(), exec.getAnArgument()
--- a/codeql-dataflow-sql-injection-c/FlowExploration.ql
+++ b/codeql-dataflow-sql-injection-c/FlowExploration.ql
@@ -0,0 +1,55 @@
+/**
+* @name SQLI Vulnerability
+* @description Using untrusted strings in a sql query allows sql injection attacks.
+* @kind path-problem
+* @id cpp/sqlivulnerable
+* @problem.severity warning
+*/
+
+import cpp
+import semmle.code.cpp.dataflow.new.TaintTracking
+
+module SqliFlowConfig implements DataFlow::ConfigSig {
+
+    predicate isSource(DataFlow::Node source) {
+        // count = read(STDIN_FILENO, buf, BUFSIZE);
+        exists(FunctionCall read |
+            read.getTarget().getName() = "read" and
+            (
+            read.getArgument(1) = source.asDefiningArgument()
+                or
+            read.getArgument(1) = source.asExpr()
+            )
+        )
+    }
+
+    predicate isBarrier(DataFlow::Node sanitizer) { none() }
+
+    predicate isSink(DataFlow::Node sink) {
+        // rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);
+        exists(FunctionCall exec |
+            exec.getTarget().getName() = "sqlite3_exec" and
+            exec.getArgument(1) = sink.asIndirectArgument()
+        )
+    }
+}
+
+int explorationLimit() { result = 100 }
+
+// We break the flow chain by switching from TaintFlow to DataFlow
+module MyFlow = DataFlow::Global<SqliFlowConfig>;
+
+module MyPartialFlow = MyFlow::FlowExplorationFwd<explorationLimit/0>;
+
+import MyPartialFlow::PartialPathGraph
+
+from MyPartialFlow::PartialPathNode start, MyPartialFlow::PartialPathNode end
+where MyPartialFlow::partialFlow(start, end, _)
+select end, start, end, "Sql injection from $@", start, "here"
+
+// note: using the pathgraph gives a more readable output, in the form
+// 'from here' 'to there' 
+
+// This query goes up to add-user.c:80:73.  
+// This indicates that the flow is not crossing the snprintf, so this is where 
+// further exploration is needed.  See Explore.ql
--- a/codeql-dataflow-sql-injection-c/qlpack.yml
+++ b/codeql-dataflow-sql-injection-c/qlpack.yml
@@ -1,4 +1,4 @@
-name: codeql-workshop/cpp-sql-injection
+name: codeql-workshop/cpp-sql-injection-c
 version: 0.0.1
 dependencies:
  codeql/cpp-all: "*"
--- a/codeql-sqlite-java/AddCustomization.ql
+++ b/codeql-sqlite-java/AddCustomization.ql
@@ -0,0 +1,30 @@
+import java
+
+// // Find the source
+// class ReadLine extends MethodCall {
+//     ReadLine() { 
+//         exists(MethodCall g | 
+//             g.getMethod().hasQualifiedName("java.io", "Console", "readLine") and
+//             this = g
+//         )
+//     }
+// }
+// from ReadLine rl
+// select rl
+
+private import semmle.code.java.dataflow.FlowSources
+
+// Find the source
+class ReadLine extends RemoteFlowSource {
+    ReadLine() { 
+        exists(MethodCall g | 
+            g.getMethod().hasQualifiedName("java.io", "Console", "readLine") and
+            this.asExpr() = g
+        )
+    }
+      override string getSourceType() { result = "readline input parameter" }     
+
+}
+from ReadLine rl
+select rl
+
--- a/models-as-data/generate-mad-core
+++ b/models-as-data/generate-mad-core
@@ -0,0 +1,78 @@
+#!/bin/bash
+# generate_mad_core.sh
+# Minimal MAD generator for a given CodeQL database and language
+
+set -euo pipefail
+
+# --- Config ---
+DB="$1"                          # Path to CodeQL database
+LANG="$2"                        # Language, e.g., cpp, java
+OUT_DIR="$3"                     # Output directory, relative to repo root
+CODEQL="$(which codeql)"         # CodeQL CLI
+REPO_ROOT="$(git rev-parse --show-toplevel)"
+
+QUERY_DIR="$REPO_ROOT/$LANG/ql/src/utils/modelgenerator"
+TMP_DIR="$(mktemp -d)"
+BQRS_FILE="$TMP_DIR/out.bqrs"
+
+# Map query name to predicate name
+declare -A QUERIES=(
+    ["CaptureSinkModels.ql"]="isSink"
+    ["CaptureSourceModels.ql"]="isSource"
+    ["CaptureSummaryModels.ql"]="isSummary"
+    ["CaptureNeutralModels.ql"]="isNeutral"
+)
+
+# Minimal YAML output template
+write_yaml() {
+    local ns="$1"
+    local pred="$2"
+    local body="$3"
+    local sanitized="${ns//[\/:]/-}"
+    mkdir -p "$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUT_DIR"
+    cat <<EOF > "$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUT_DIR/${sanitized}.model.yml"
+# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
+extensions:
+  - addsTo:
+      pack: codeql/${LANG}-all
+      predicate: $pred
+      rows:
+$body
+EOF
+    echo "Wrote: $REPO_ROOT/$LANG/ql/lib/ext/generated/$OUT_DIR/${sanitized}.model.yml"
+}
+
+# Run queries and convert output to addsTo rows
+for query in "${!QUERIES[@]}"; do
+    echo "Running $query..."
+    "$CODEQL" query run \
+              "$QUERY_DIR/$query" \
+              --database "$DB" \
+              --output "$BQRS_FILE"
+
+    # Extract result rows as text (CSV-like)
+    RAW_ROWS=$("$CODEQL" bqrs decode --format=csv --output=- "$BQRS_FILE" | tail -n +2)
+
+    # Group by namespace, format for YAML
+    declare -A ROWS=()
+    while IFS= read -r line; do
+        IFS=';' read -ra FIELDS <<< "$line"
+        ns="${FIELDS[0]}"
+        quoted=()
+        for f in "${FIELDS[@]}"; do
+            if [[ "$f" != "true" && "$f" != "false" ]]; then
+                quoted+=("\"$f\"")
+            else
+                cap="${f^}"  # capitalize
+                quoted+=("$cap")
+            fi
+        done
+        ROWS["$ns"]+=$'\n'"      - [${quoted[*]}]"
+    done <<< "$RAW_ROWS"
+
+    for ns in "${!ROWS[@]}"; do
+        write_yaml "$ns" "${QUERIES[$query]}" "${ROWS[$ns]}"
+    done
+done
+
+rm -rf "$TMP_DIR"
--- a/models-as-data/generate-mad-core.csvtk
+++ b/models-as-data/generate-mad-core.csvtk
@@ -0,0 +1,82 @@
+#!/bin/bash
+# generate_mad_csvtk.sh — Full CSVTK-based MAD generator
+
+set -euo pipefail
+
+DB="$1"         # Path to CodeQL DB
+LANG="$2"       # e.g. cpp
+OUTDIR="$3"     # e.g. mylib
+CODEQL="$(which codeql)"
+REPO_ROOT="$(git rev-parse --show-toplevel)"
+QUERY_DIR="$REPO_ROOT/$LANG/ql/src/utils/modelgenerator"
+TARGET_ROOT="$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUTDIR"
+TMP_DIR="$(mktemp -d)"
+
+mkdir -p "$TARGET_ROOT"
+
+declare -A QUERIES=(
+    ["CaptureSinkModels.ql"]="isSink"
+    ["CaptureSourceModels.ql"]="isSource"
+    ["CaptureSummaryModels.ql"]="isSummary"
+    ["CaptureNeutralModels.ql"]="isNeutral"
+)
+
+# Quoting + capitalization logic as an inline function for csvtk
+quote_expr='
+function q(x) {
+  return (x == "true" || x == "false") ? toupper(substr(x, 1, 1)) substr(x, 2) : "\"" x "\""
+}
+[q($1), q($2), q($3), q($4)]
+'
+
+for query in "${!QUERIES[@]}"; do
+    echo "Running $query..."
+    BQRS_FILE="$TMP_DIR/out.bqrs"
+    CSV_FILE="$TMP_DIR/out.csv"
+
+    "$CODEQL" query run "$QUERY_DIR/$query" \
+              --database "$DB" \
+              --output "$BQRS_FILE"
+
+    "$CODEQL" bqrs decode --format=csv --output="$CSV_FILE" "$BQRS_FILE"
+    tail -n +2 "$CSV_FILE" > "$TMP_DIR/noheader.csv"
+
+    # Add header for csvtk compatibility
+    head -n1 "$CSV_FILE" | grep -q ',' || echo "namespace;f1;f2;f3;f4" > "$TMP_DIR/head.csv"
+    cat "$TMP_DIR/head.csv" "$TMP_DIR/noheader.csv" > "$TMP_DIR/input.csv"
+
+    # Mutate quoted fields
+    csvtk mutate -t -n quoted1,quoted2,quoted3,quoted4 -e '
+    if ($f1=="true" || $f1=="false") ucfirst($f1); else "\"" + $f1 + "\""
+  ' -e '
+    if ($f2=="true" || $f2=="false") ucfirst($f2); else "\"" + $f2 + "\""
+  ' -e '
+    if ($f3=="true" || $f3=="false") ucfirst($f3); else "\"" + $f3 + "\""
+  ' -e '
+    if ($f4=="true" || $f4=="false") ucfirst($f4); else "\"" + $f4 + "\""
+  ' "$TMP_DIR/input.csv" > "$TMP_DIR/quoted.csv"
+
+    # Group by namespace
+    csvtk cut -t -f namespace "$TMP_DIR/quoted.csv" | tail -n +2 | sort -u | while read -r ns; do
+        safe_ns=$(echo "$ns" | tr '/:' '--')
+        out="$TARGET_ROOT/$safe_ns.model.yml"
+
+        echo "# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT." > "$out"
+        echo "extensions:" >> "$out"
+        echo "  - addsTo:" >> "$out"
+        echo "      pack: codeql/$LANG-all" >> "$out"
+        echo "      predicate: ${QUERIES[$query]}" >> "$out"
+        echo "      rows:" >> "$out"
+
+        # Extract all quoted fields for this namespace
+        csvtk grep -t -f namespace -p "$ns" "$TMP_DIR/quoted.csv" |
+            csvtk cut -t -f quoted1,quoted2,quoted3,quoted4 |
+            tail -n +2 | # remove header
+            sed 's/^/        - [/' | sed 's/$/]/' >> "$out"
+
+        echo "Wrote $out"
+    done
+done
+
+rm -rf "$TMP_DIR"
+
--- a/models-as-data/generate-mad-core.xan
+++ b/models-as-data/generate-mad-core.xan
@@ -0,0 +1,66 @@
+#!/bin/bash
+# Model generator using `xan` for CSV processing
+
+set -euo pipefail
+
+DB="$1"         # CodeQL database path
+LANG="$2"       # Language (e.g. cpp)
+OUTDIR="$3"     # Output directory name under lib/ext/generated/
+CODEQL="$(which codeql)"
+REPO_ROOT="$(git rev-parse --show-toplevel)"
+QUERY_DIR="$REPO_ROOT/$LANG/ql/src/utils/modelgenerator"
+TARGET_ROOT="$REPO_ROOT/$LANG/ql/lib/ext/generated/$OUTDIR"
+TMP_DIR="$(mktemp -d)"
+
+mkdir -p "$TARGET_ROOT"
+
+declare -A QUERIES=(
+    ["CaptureSinkModels.ql"]="isSink"
+    ["CaptureSourceModels.ql"]="isSource"
+    ["CaptureSummaryModels.ql"]="isSummary"
+    ["CaptureNeutralModels.ql"]="isNeutral"
+)
+
+for query in "${!QUERIES[@]}"; do
+    echo "Running $query..."
+    BQRS_FILE="$TMP_DIR/out.bqrs"
+    CSV_FILE="$TMP_DIR/result.csv"
+
+    "$CODEQL" query run "$QUERY_DIR/$query" \
+              --database "$DB" \
+              --output "$BQRS_FILE"
+
+    "$CODEQL" bqrs decode --format=csv --output="$CSV_FILE" "$BQRS_FILE"
+
+    echo "Grouping rows by namespace..."
+
+    xan map '
+    let q = |x| -> if (x == "true" || x == "false") { upper(x) } else { fmt("\"{}\"", x) };
+    fmt("        - [{}]", join(", ", [q(f1), q(f2), q(f3), q(f4)]))
+  ' row "$CSV_FILE" \
+        | xan groupby namespace 'collect(row) as rows' \
+        | xan explode rows \
+        | xan select namespace,row \
+        | xan groupby namespace 'collect(row) as block' \
+        | xan explode block \
+        | while IFS=',' read -r ns row; do
+        safe_ns=$(echo "$ns" | tr '/:' '--' | tr -d '"')
+        out="$TARGET_ROOT/$safe_ns.model.yml"
+        if [[ ! -f "$out" ]]; then
+            cat <<EOF > "$out"
+# THIS FILE IS AN AUTO-GENERATED MODELS AS DATA FILE. DO NOT EDIT.
+extensions:
+  - addsTo:
+      pack: codeql/$LANG-all
+      predicate: ${QUERIES[$query]}
+      rows:
+EOF
+        fi
+        echo "$row" >> "$out"
+    done
+
+    echo "Wrote models to: $TARGET_ROOT/"
+done
+
+rm -rf "$TMP_DIR"
+