* codeql-lab: Centralized Git Repository for CodeQL Development

** Overview
   codeql-lab is a consolidated Git repository that collects all relevant
   CodeQL components, resources, and tooling into a single
   version-controlled location.

** Purpose
   The goal of this repository is to provide an integrated development
   environment (“lab”) for CodeQL research, experimentation, and custom
   query development. It simplifies setup by maintaining all required
   submodules, configuration files, and datasets in one place.

** Repository Location
   The primary repository is hosted at:
   https://github.com/hohn/codeql-lab

** Intended Use Cases
   - Local experimentation with CodeQL queries and libraries.
   - End-to-end testing of custom model data and query logic.
     This includes writing and validating custom data flow models,
     adjusting model coverage, and confirming that query results behave
     as expected across controlled datasets. The lab setup supports rapid
     iteration on QL logic, helping detect unintended changes and enabling
     reproducible evaluations of taint tracking, control flow, or API usage
     patterns.
   - Structured collaboration and controlled updates across all
     CodeQL-related artifacts.
   - Simplified onboarding and reproducible setup for new contributors or
     analysis environments.

* Prerequisites

  Working with this repository assumes prior experience with:

  - *Git, Bash, and standard Unix command-line tools*. These are used
    throughout and are required for setup and day-to-day tasks.
    Tools such as [[https://man.archlinux.org/man/rg.1][ripgrep]], [[https://www.gnu.org/software/bash/][GNU Bash]], and [[https://en.wikipedia.org/wiki/Grep][grep/regex workflows]] are assumed.

  - *At least one supported programming language*, such as C, C++, Java,
    Python, Go, or Ruby. A solid understanding of the target language is
    necessary to interpret analysis results and write effective queries.
    See general background on [[https://en.wikipedia.org/wiki/Programming_language][programming languages]] if needed.

  - *Basic familiarity with program structure concepts*, including
    [[https://en.wikipedia.org/wiki/Abstract_syntax_tree][abstract syntax trees (ASTs)]], [[https://en.wikipedia.org/wiki/Control-flow_graph][control-flow graphs (CFGs)]], and
    [[https://en.wikipedia.org/wiki/Data-flow_analysis][data-flow graphs (DFGs)]]. These are core to how CodeQL models code behavior.

  - *Optional but helpful*: familiarity with structural or functional
    programming languages (e.g. [[https://en.wikipedia.org/wiki/Lisp_(programming_language)][Lisp]] or [[https://en.wikipedia.org/wiki/OCaml][OCaml]]) can make working with
    CodeQL’s query language and type system more intuitive.
    See overview of [[https://en.wikipedia.org/wiki/Functional_programming][functional programming]] for related context.

* Repository Layout
** Core Structure
   - Repository is based on: https://github.com/github/vscode-codeql-starter.git
   - All development work is done on the branch: qllab
   - CodeQL version is pinned via the =ql/= submodule:
     : commit 4d681f05bd671f8b5e31624f16a2b4d75e61c071 (tag: codeql-cli/v2.22.0)
   - A prebuilt CodeQL CLI binary is included:
     : 1104625939  assets/codeql-osx64.zip
   - Project-specific repositories can be added directly under the root.
     Example: the C dataflow workshop in =./codeql-dataflow-sql-injection-c=

** Additional Structure Notes
   - The original upstream README.md is preserved at [[./README-vscode-codeql-starter.md]]

* Possible Reading Orders

** Data Flow 
*** Review: SQLite Injection Workshop, Java
    We begin with a recap of the Java-based injection example, focusing on the
    vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
    CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace
    tainted data through the program. Next, we explore the out-of-the-box query
    [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an
    inspection of the relevant base classes and framework modeling in
    [[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].

    - start with SqlTainted.ql, note that it won't find our injection

    - break / comment the pre-done additions in
      .github/codeql/extensions/sqlite-db/models/sqlite.model.yml

*** Debugging data flow config (instead of taint flow), Java
    We can illustrate taint-flow debugging in the Java SQL injection sample
    - [[./codeql-sqlite-java/TaintFlowDebugging.ql]]
    - following [[./codeql-sqlite-java/TaintFlowDebugging.md]]

*** TODO Debugging data flow config (instead of taint flow), C
    A corresponding example for C is planned, using a simplified query to trace
    value propagation in [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
    Unlike Java, C may require manual modeling even to visualize basic flows.

    - C detail
      + Dataflow node types vs. AST vs. CFG, but more choices for the C versions:
        after call, pointer.
        - asDefiningArgument(), asExpr(), asIndirectArgument()
        - asExpr() in C now may cause the path to fail, even though sink and source
          are found
    - getAQlClass() to get precise type
    - ql/actions/ql/src/Debug/partial.ql
    - ql/cpp/ql/lib/CHANGELOG.md
      176:* Deleted the deprecated `explorationLimit` predicate from
      `DataFlow::Configuration`, use `FlowExploration<explorationLimit>` instead.
    - codeql-sqlite-java/TaintFlowDebugging.md
      54:int explorationLimit() { result = 100 }
      58:module MyPartialFlow = MyFlow::FlowExplorationFwd<explorationLimit/0>;

    - Debugging docs:
      https://codeql.github.com/docs/writing-codeql-queries/debugging-data-flow-queries-using-partial-flow/#debugging-data-flow-queries-using-partial-flow


** Modeling
   There are two primary approaches to modeling: direct use of CodeQL predicates
   and the models-as-data system. The models-as-data system is implemented in QL
   but relies on external YAML files that are interpreted at query evaluation
   time.

   The model editor provides a GUI for managing YAML-based models, but the
   underlying format is identical to that used by the models-as-data system. In C
   and other cases where GUI support is limited or unavailable, we write these
   YAML models manually and invoke them directly from queries.

   When YAML models are written directly, the use of GPT-based tooling becomes
   very natural. GPTs can extract function signatures, parameter semantics, and
   flow annotations from documentation or code examples, then generate valid YAML
   model entries automatically.

   - *XX* models-as-data is good for simple but large quantity APIs.  For anything
     complicated, use CodeQL
   - The CodeQL parser is optimized for reading large CodeQL files.  E.g., 14,000
     predicates are no problem.
   - At this scale, you're generating.  The type checking you get from CodeQL is
     much more extensive than models-as-data.  models-as-data is text; CodeQL is a
     type-checked language.

*** TODO MaD (models as data) resources

    https://codeql.github.com/docs/codeql-language-guides/customizing-library-models-for-cpp/
    https://docs.github.com/en/code-security/codeql-for-vs-code/using-the-advanced-functionality-of-the-codeql-for-vs-code-extension/using-the-codeql-model-editor#testing-codeql-model-packs-in-vs-code
    https://docs.github.com/en/code-security/codeql-cli/codeql-cli-manual/database-analyze#--model-packsnamerange
    examples: https://github.com/github/codeql/blob/main/cpp/ql/lib/ext/Windows.model.yml#L8

    documentation for the specific possible values of MaD columns other than the
    most generic spec can be found here:
    https://github.com/github/codeql/blob/main/cpp/ql/lib/semmle/code/cpp/dataflow/ExternalFlow.qll#L35
    This is covered in more detail in
    - java workshop [[file:codeql-sqlite-java/README.org::*Supplement CodeQL: Add to models-as-data][Supplement CodeQL: Add to models-as-data]] 
    - c workshop [[file:codeql-dataflow-sql-injection-c/README.org::*supplement codeql: Add to models-as-data][supplement codeql: Add to models-as-data]] 
    - cpp codeql lib [[file:ql/cpp/ql/lib/semmle/code/cpp/dataflow/internal/ExternalFlowExtensions.qll::This module provides extensible predicates for defining MaD models.]]
    - java codeql lib [[file:ql/java/ql/lib/semmle/code/java/dataflow/internal/ExternalFlowExtensions.qll::This module provides extensible predicates for defining MaD models.]]

    each language has one of these ExternalFlow lib files and each includes more
    description on what the potential values actually mean

*** Modeling overview as diagram
   #+BEGIN_SRC text
                                       +----------------------+
                                       |     Modeling in      |
                                       |       CodeQL         |
                                       +----------+-----------+
                                                  |
                   +------------------------------+------------------------------+
                   |                                                             |
          +--------v--------+                                          +---------v---------+
          | Direct CodeQL   |                                          |  Models-as-Data   |
          | (QL predicates) |                                          |  (YAML + QL eval) |
          +--------+--------+                                          +---------+---------+
                   |                                                             |
                   |                                                             |
        +----------v----------+                                  +---------------v---------------+
        | Manual customization|                                  |     YAML models via GUI       |
        | via Customizations.qll                                 |    (Model Editor frontend)    |
        +----------+----------+                                  +---------------+---------------+
                   |                                                             |
                   |                                                             |
         +---------v---------+                                       +-----------v-----------+
         | Java: built-in     |                                      | Java: Jedis + Console |
         | includes .qll hook |                                      | GUI modeling examples |
         +--------------------+                                      +-----------------------+
                   |
                   | Manual setup needed for:
                   v
          +------------------------+
          |   C / C++: requires    |
          |   cpp.qll patch +      |
          |   Customizations.qll   |
          +------------------------+
                   |
                   v
     +-------------------------------+
     | Use models-as-data directly   |
     | (YAML only, no editor)        |
     +-------------------------------+
                   |
                   v
     +-------------------------------+
     | GPT-assisted YAML generation |
     | from docs, code, or examples |
     +-------------------------------+
   #+END_SRC

*** Review: SQLite Injection Workshop, Java
    We begin with a recap of the Java-based injection example, focusing on the
    vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
    CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace
    tainted data through the program. Next, we explore the out-of-the-box query
    [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an
    inspection of the relevant base classes and framework modeling in
    [[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].

    - start with SqlTainted.ql, note that it won't find our injection

    - break / comment the pre-done additions in
      .github/codeql/extensions/sqlite-db/models/sqlite.model.yml
      
*** Customizations via codeql (Java)
    To customize CodeQL for Java, we identify and extend base classes to add
    custom flow sources and sinks. A general explanation of this approach is
    available in the file [[./codeql-dataflow-sql-injection-c/README.org][README.org]], particularly
    the section
    [[file:codeql-sqlite-java/README.org::*Supplement CodeQL: Add to FlowSource or a Subclass][Supplement CodeQL: Add to FlowSource or a Subclass]]
    . For Java, [[./ql/java/ql/lib/java.qll][java.qll]] includes [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], which provides extension points for
    custom flow modeling -- this structure is common across most CodeQL-supported
    languages, with the notable exception of C. Further details on this
    customization process can be found in
    [[./codeql-sqlite-java/incoming.codeql-customizations-workshop.md][incoming.codeql-customizations-workshop.md]].

*** Customizations via Model Editor: Jedis Example (Java Redis client)
    The Jedis example is a straightforward case with no unexpected
    behavior. Although the library contains many functions, they follow a simple
    and repetitive pattern, making it ideal for large-scale modeling. The CodeQL
    model editor can be used to efficiently define sources and sinks for such
    cases. A detailed explanation is provided 
    in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Modeling Jedis as a Dependency in Model Editor][Modeling Jedis as a Dependency in Model Editor]], while validation of 
    the modeled sink is discussed in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Verifying the Modeled Sink][Verifying the Modeled Sink]]. 
    Finally, the query-level usage of these models can be seen
    in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Identify usage of injection-related models in existing queries][Identify usage of injection-related models in existing queries]].

*** Customizations via Model Editor: Single-function case (Java SQLite sample)
    We extend the Java SQLite example using the model editor, with both the
    necessary data and specification already available. This example highlights a
    subtle issue with the model editor: the method =java.io.Console.readLine()= is
    already modeled as a taint *step* and therefore does not appear in the editor
    interface, even though we need it modeled as a *source*. This requires special
    handling. The relevant extensions are defined in
    [[./.github/codeql/extensions/sqlite-db/codeql-pack.yml]], and the extension data
    is provided in
    [[./.github/codeql/extensions/sqlite-db/models/sqlite.model.yml]]. A detailed
    explanation is available in [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]].

    To support this, we explain how the "models-as-data" system works
    internally. A diagnostic query can be used to enumerate currently recognized
    sources and sinks. From there, the relevant entry points -- such as QL classes
    and predicates -- can be identified by inspecting representative queries like
    [[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].

    - find existing readline modling
      #+BEGIN_SRC text
        hohn@ghm3 ~/work-gh/codeql-lab
        0:$ rg -il 'readline' ql/java --type=yaml
        ql/java/ql/lib/ext/com.google.common.io.model.yml
        ql/java/ql/lib/ext/org.apache.cxf.helpers.model.yml
        ql/java/ql/lib/ext/java.io.model.yml
        ql/java/ql/lib/ext/generated/java.io.model.yml
        ql/java/ql/lib/ext/generated/kotlinstdlib.model.yml
        ql/java/ql/lib/ext/generated/jenkins.model.yml
        ql/java/ql/lib/ext/generated/org.apache.commons.io.model.yml
        ql/java/ql/lib/ext/experimental/com.google.common.io.model.yml
      #+END_SRC

*** Review: SQLite Injection Workshop (C)
    This is the C version of the injection workshop, based on
    [[./codeql-dataflow-sql-injection-c/add-user.c]]. It
    serves as the basis for both the "models-as-data" manual modeling and the
    extension via =Customizations.qll=.

*** (PARTIAL) Use models-as-data QL code directly (no graphical editor)
    This section focuses on using the models-as-data system *without* the
    graphical model editor. While model definition files and supporting data
    already exist, we manually write YAML files to add or override flow
    behavior. This approach is especially relevant for C, where graphical tooling
    is limited or nonexistent.

    As reinforcement, we reuse the C version of the SQLite injection workshop:
    - The code sample is at
      [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
    - The accompanying query is
      [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/SqlInjection.ql]].

    For structural reference, see the Java version’s documentation (not the editor
    interface): [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]].  There is no separate
    C-specific walkthrough because the YAML structure and logic are nearly
    identical.

    For workshop use, we extend the example by modeling key functions manually:
    - Add a source model for: =count = read(STDIN_FILENO, buf, BUFSIZE);=
    - Add a sink model for: =rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);=

    We demonstrate how to define YAML-based models for standard functions like
    =read()= and verify their effect using the out-of-the-box query:
    [[./ql/cpp/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].

    As an additional teaching case, we introduce the higher-level, redundant
    function =char* get_user_info()= as a custom source—even though it internally
    calls a function already modeled as a source—to illustrate how user-defined
    extensions affect propagation logic.

*** (PARTIAL) Extending Queries with Customizations.qll for C
    The manual YAML modeling approach described earlier works well for isolated or
    prototype cases. However, for idiomatic, large-scale, or reusable CodeQL
    analysis, it is often preferable to define custom dataflow logic directly in
    QL—using =Customizations.qll=.

    Most CodeQL-supported languages (e.g., Java, Python) include built-in support
    for this mechanism. For example, Java’s primary entry point [[./ql/java/ql/lib/java.qll][java.qll]]
    automatically imports [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], exposing extension points for
    user-defined sources, sinks, and flow steps.

    In contrast, C and C++ do *not* support this out of the box. To enable it, you
    must manually patch the language pack and (optionally) rebuild the CodeQL
    bundle.

    This section is *partially complete*: we document the required source-level QL
    changes, but the bundling process is still pending.

    To enable =Customizations.qll= support for C/C++, perform the following:

    1. Modify =ql/cpp/ql/lib/cpp.qll= to import your =Customizations.qll= module.
    2. Create and populate =ql/cpp/ql/lib/Customizations.qll= with new
       source/sink/flow logic.
    3. *For full deployment:* Rebuild the CodeQL bundle to include the updated
       QL files.
       - This allows portable use in CLI runs and IDE workflows.
       - Once bundled, C/C++ customization behaves like any other supported
         language.
    4. *For workshops and local development:* No bundling is needed.
       - If you run queries directly from the modified source tree, the changes
         take effect immediately.

    A working demonstration of this modification (without bundling) is provided
    in:
    [[./codeql-dataflow-sql-injection-c/README.org]]
       
    - same workflow as Java: extend RemoteFlowSource, do it in Customizations.qll
      to affect all queries.
    - model pack existence has to be explicitly specified
    - Options to control the model packs to be used
      #+BEGIN_SRC text
        --model-packs=<name@range>...
        A list of CodeQL pack names, each with an optional version range, to be used as model packs to customize the queries that are about to be evaluated.
      #+END_SRC


** TODO CodeQL Bundling
   This section will provide a detailed walkthrough of the CodeQL bundling process
   using the CLI tool at https://github.com/advanced-security/codeql-bundle. This
   tool enables custom pack composition and is necessary when extending language
   libraries (e.g., adding `Customizations.qll` support for C/C++).

   While the official tool is somewhat of a black box, we will demystify the
   underlying structure and show how to build, inspect, and deploy custom bundles
   from source. Notes and scripts will be collected in
   [[file:codeql-bundling/README.org::XX: continue]].

   CodeQL bundle info:
   - original bundles found at:
     https://github.com/github/codeql-action/releases/tag/codeql-bundle-v2.22.2
   - custom bundler found:
     https://github.com/advanced-security/codeql-bundle?tab=readme-ov-file#codeql-customization-packs

   - use a Customizations.qll pack
     (https://github.com/advanced-security/codeql-bundle?tab=readme-ov-file#codeql-customization-packs
     these do get created as a separate pack from the rest of the lib)

   - This part of the custom bundle tool documentation
     (https://github.com/advanced-security/codeql-bundle/blob/main/codeql_bundle/helpers/bundle.py#L329)
     explains how the tool reverses dependencies: the built-in libraries are
     modified to depend on the custom library. The custom bundle must include a
     file named Customizations (a convention enforced by the bundler), but it can
     also contain additional libraries with arbitrary names.

   - Generating CodeQL Models for API Endpoints

     To support automatic generation of API endpoint models in a CodeQL workshop
     (without using the model editor), you can leverage the existing
     infrastructure used in the =cpp/ql/lib/ext/generated= directory of the CodeQL
     repo:

     - File Generation ::
       Individual model files are generated using the following script:  
       https://github.com/github/codeql/blob/main/misc/scripts/models-as-data/generate_mad.py

     - Bulk Generation :: 
       For batch processing, use:  
       https://github.com/github/codeql/blob/main/misc/scripts/models-as-data/bulk_generate_mad.py

       This script requires:
       - A language-specific YAML config file — for C++:  
         https://github.com/github/codeql/blob/main/cpp/bulk_generation_targets.yml
       - A DCA run (Data-Collection Analysis) to provide the necessary input data.

     These tools allow you to programmatically produce model files similar to
     those found in =ql/lib/ext/generated=, making them suitable for automated or
     instructional use cases.
   
   - Updated MAD Generator (no more DCA step)

     The script `generate_mad.py` replaces the older DCA-based workflow. It runs a set of
     language-specific CodeQL queries directly against a database and emits `.model.yml` files.
  
     - Queries used:
       - CaptureSummaryModels.ql
       - CaptureSinkModels.ql
       - CaptureSourceModels.ql
       - CaptureNeutralModels.ql
       - CaptureTypeBasedSummaryModels.ql (optional)
  
     - These queries are located in:
       <language>/ql/src/utils/modelgenerator/
  
     - Output files are written to:
       <language>/ql/lib/ext/generated/<folder>/*.model.yml
  
     - Example usage:
       #+BEGIN_SRC sh
         python3 generate_mad.py --language cpp /path/to/db --with-sinks --with-sources --with-summaries
       #+END_SRC
  
     There is no longer any need for intermediate `.dca.json` files or a "DCA run".

     A compact shell script illustrating the steps is in
     [[./models-as-data/generate-mad-core]]

   - [ ] A compact shell/csvtk script illustrating the steps is in
     [[./models-as-data/generate-mad-core.csvtk]]
     brew install csvtk

   - [ ] A compact shell/[[https://github.com/medialab/xan?tab=readme-ov-file#quick-tour][xan]] script illustrating the steps is in
     [[./models-as-data/generate-mad-core.xan]]
     brew install xan

     https://github.com/github/codeql/tree/main/misc/scripts/models-as-data

   - [ ] bundling semantics
     good
     - pack a_1
       - depends b_1
       - depends b_2
         - depends java-all

     good
     - pack a_1
       - depends b_1
       - depends b_2
         - depends java-all
           - depends my-custom

     cycle, actual current situation.  OK for libraries, not packs?
     Is this import hierarchy
     - pack a_1
       - depends b_1
       - depends b_2
         - depends java-all
           - depends my-custom
             - depends java-all

     turned into?
     - pack a_1
       - depends b_1
       - depends b_2
         - depends java-all-custom precompiled

     The fundamental distinction:  Customizations.qll can *insert under* the stdlib.
     Other packs are *on top of* the stdlib.
 
     There is a transient dependency inserted.  See
     codeql-bundle/codeql_bundle/helpers/bundle.py
* Tool Setup
  Some scripts are used here, found in [[./bin/]].  To ensure the ones written in
  Python have access to prerequites, set up a virtual environment via 
  #+BEGIN_SRC sh 
    # 1. Create the virtualenv
    python3 -m venv ~/codeql-lab/venv

    # 2. Install any packages
    source ~/codeql-lab/venv/bin/activate
    pip install pyyaml
  #+END_SRC

  For any of these scripts to work, add them to the PATH via
  #+BEGIN_SRC sh 
    export PATH="$HOME/codeql-lab/bin:$PATH"
  #+END_SRC