Files
codeql-lab/README.org
2025-07-30 16:42:39 -07:00

284 lines
16 KiB
Org Mode
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

* codeql-lab: Centralized Git Repository for CodeQL Development
** Overview
codeql-lab is a consolidated Git repository that collects all relevant
CodeQL components, resources, and tooling into a single
version-controlled location.
** Purpose
The goal of this repository is to provide an integrated development
environment (“lab”) for CodeQL research, experimentation, and custom
query development. It simplifies setup by maintaining all required
submodules, configuration files, and datasets in one place.
** Repository Location
The primary repository is hosted at:
https://github.com/hohn/codeql-lab
** Intended Use Cases
- Local experimentation with CodeQL queries and libraries.
- End-to-end testing of custom model data and query logic.
This includes writing and validating custom data flow models,
adjusting model coverage, and confirming that query results behave
as expected across controlled datasets. The lab setup supports rapid
iteration on QL logic, helping detect unintended changes and enabling
reproducible evaluations of taint tracking, control flow, or API usage
patterns.
- Structured collaboration and controlled updates across all
CodeQL-related artifacts.
- Simplified onboarding and reproducible setup for new contributors or
analysis environments.
* Prerequisites
Working with this repository assumes prior experience with:
- *Git, Bash, and standard Unix command-line tools*. These are used
throughout and are required for setup and day-to-day tasks.
Tools such as [[https://man.archlinux.org/man/rg.1][ripgrep]], [[https://www.gnu.org/software/bash/][GNU Bash]], and [[https://en.wikipedia.org/wiki/Grep][grep/regex workflows]] are assumed.
- *At least one supported programming language*, such as C, C++, Java,
Python, Go, or Ruby. A solid understanding of the target language is
necessary to interpret analysis results and write effective queries.
See general background on [[https://en.wikipedia.org/wiki/Programming_language][programming languages]] if needed.
- *Basic familiarity with program structure concepts*, including
[[https://en.wikipedia.org/wiki/Abstract_syntax_tree][abstract syntax trees (ASTs)]], [[https://en.wikipedia.org/wiki/Control-flow_graph][control-flow graphs (CFGs)]], and
[[https://en.wikipedia.org/wiki/Data-flow_analysis][data-flow graphs (DFGs)]]. These are core to how CodeQL models code behavior.
- *Optional but helpful*: familiarity with structural or functional
programming languages (e.g. [[https://en.wikipedia.org/wiki/Lisp_(programming_language)][Lisp]] or [[https://en.wikipedia.org/wiki/OCaml][OCaml]]) can make working with
CodeQLs query language and type system more intuitive.
See overview of [[https://en.wikipedia.org/wiki/Functional_programming][functional programming]] for related context.
* Repository Layout
** Core Structure
- Repository is based on: https://github.com/github/vscode-codeql-starter.git
- All development work is done on the branch: qllab
- CodeQL version is pinned via the =ql/= submodule:
: commit 4d681f05bd671f8b5e31624f16a2b4d75e61c071 (tag: codeql-cli/v2.22.0)
- A prebuilt CodeQL CLI binary is included:
: 1104625939 assets/codeql-osx64.zip
- Project-specific repositories can be added directly under the root.
Example: the C dataflow workshop in =./codeql-dataflow-sql-injection-c=
** Additional Structure Notes
- The original upstream README.md is preserved at [[./README-vscode-codeql-starter.md]]
* Possible Reading Orders
** Data Flow
*** Debugging data flow config (instead of taint flow), Java
We can illustrate taint-flow debugging in the Java SQL injection sample
- [[./codeql-sqlite-java/TaintFlowDebugging.ql]]
- [[./codeql-sqlite-java/TaintFlowDebugging.md]]
*** TODO Debugging data flow config (instead of taint flow), C
A corresponding example for C is planned, using a simplified query to trace
value propagation in [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
Unlike Java, C may require manual modeling even to visualize basic flows.
** Modeling
There are two primary approaches to modeling: direct use of CodeQL predicates
and the models-as-data system. The models-as-data system is implemented in QL
but relies on external YAML files that are interpreted at query evaluation
time.
The model editor provides a GUI for managing YAML-based models, but the
underlying format is identical to that used by the models-as-data system. In C
and other cases where GUI support is limited or unavailable, we write these
YAML models manually and invoke them directly from queries.
When YAML models are written directly, the use of GPT-based tooling becomes
very natural. GPTs can extract function signatures, parameter semantics, and
flow annotations from documentation or code examples, then generate valid YAML
model entries automatically.
As diagram:
#+BEGIN_SRC text
+----------------------+
| Modeling in |
| CodeQL |
+----------+-----------+
|
+------------------------------+------------------------------+
| |
+--------v--------+ +---------v---------+
| Direct CodeQL | | Models-as-Data |
| (QL predicates) | | (YAML + QL eval) |
+--------+--------+ +---------+---------+
| |
| |
+----------v----------+ +---------------v---------------+
| Manual customization| | YAML models via GUI |
| via Customizations.qll | (Model Editor frontend) |
+----------+----------+ +---------------+---------------+
| |
| |
+---------v---------+ +-----------v-----------+
| Java: built-in | | Java: Jedis + Console |
| includes .qll hook | | GUI modeling examples |
+--------------------+ +------------------------+
|
| Manual setup needed for:
v
+------------------------+
| C / C++: requires |
| cpp.qll patch + |
| Customizations.qll |
+------------------------+
|
v
+-------------------------------+
| Use models-as-data directly |
| (YAML only, no editor) |
+-------------------------------+
|
v
+-------------------------------+
| GPT-assisted YAML generation |
| from docs, code, or examples |
+-------------------------------+
#+END_SRC
*** Review: SQLite Injection Workshop, Java
We begin with a recap of the Java-based injection example, focusing on the
vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace
tainted data through the program. Next, we explore the out-of-the-box query
[[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an
inspection of the relevant base classes and framework modeling in
[[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].
*** Customizations via codeql (Java)
To customize CodeQL for Java, we identify and extend base classes to add
custom flow sources and sinks. A general explanation of this approach is
available in the file [[./codeql-dataflow-sql-injection-c/README.org][README.org]], particularly
the section [[file:codeql-dataflow-sql-injection-c/README.org::*supplement codeql: Add to FlowSource or a subclass][supplement codeql: Add to FlowSource or a subclass]]. For Java,
[[./ql/java/ql/lib/java.qll][java.qll]] includes [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], which provides extension points for
custom flow modeling -- this structure is common across most CodeQL-supported
languages, with the notable exception of C. Further details on this
customization process can be found in
[[./codeql-dataflow-sql-injection-c/incoming.codeql-customizations-workshop.md][incoming.codeql-customizations-workshop.md]].
*** Customizations via Model Editor: Jedis Example (Java Redis client)
The Jedis example is a straightforward case with no unexpected
behavior. Although the library contains many functions, they follow a simple
and repetitive pattern, making it ideal for large-scale modeling. The CodeQL
model editor can be used to efficiently define sources and sinks for such
cases. A detailed explanation is provided
in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Modeling Jedis as a Dependency in Model Editor][Modeling Jedis as a Dependency in Model Editor]], while validation of
the modeled sink is discussed in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Verifying the Modeled Sink][Verifying the Modeled Sink]].
Finally, the query-level usage of these models can be seen
in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Identify usage of injection-related models in existing queries][Identify usage of injection-related models in existing queries]].
*** Customizations via Model Editor: Single-function case (Java SQLite sample)
We extend the Java SQLite example using the model editor, with both the
necessary data and specification already available. This example highlights a
subtle issue with the model editor: the method =java.io.Console.readLine()= is
already modeled as a taint *step* and therefore does not appear in the editor
interface, even though we need it modeled as a *source*. This requires special
handling. The relevant extensions are defined in
[[./.github/codeql/extensions/sqlite-db/codeql-pack.yml]], and the extension data
is provided in
[[./.github/codeql/extensions/sqlite-db/models/sqlite.model.yml]]. A detailed
explanation is available in [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]].
To support this, we explain how the "models-as-data" system works
internally. A diagnostic query can be used to enumerate currently recognized
sources and sinks. From there, the relevant entry points -- such as QL classes
and predicates -- can be identified by inspecting representative queries like
[[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].
*** Review: SQLite Injection Workshop (C)
This is the C version of the injection workshop, based on
[[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]]. It
serves as the basis for both the "models-as-data" manual modeling and the
extension via Customizations.qll.
*** Use models-as-data QL code directly (no graphical editor)
This section focuses on applying the models-as-data system without using the
graphical model editor. While model definition files and supporting data
already exist, we manually author YAML files for new models. This approach is
especially relevant for C, where graphical tooling is limited or nonexistent.
As reinforcement, we use the C version of the SQLite injection workshop:
- The code sample is at [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
- The accompanying query is [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/SqlInjection.ql]].
We extend this example by modeling key functions manually:
- Add a source model for =count = read(STDIN_FILENO, buf, BUFSIZE);=
- Add a sink model for =rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);=
For reference, see the Java versions structure (but not the graphical
editor): [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]], and the corresponding
C-specific walkthrough: [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]].
We demonstrate how to define YAML-based models for standard functions like
=read()= and verify their effect using the out-of-the-box query
[[./ql/cpp/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]]. As an additional example, we introduce the higher-level,
redundant =char* get_user_info()= as a custom source—even though it internally
calls a function already modeled as a source—to illustrate how user-defined
extensions propagate through the query logic.
*** Extending Queries with Customizations.qll for C
The manual YAML modeling approach from the previous section works well for
isolated cases. However, to integrate seamlessly with idiomatic CodeQL
queries, we show how to extend the standard QL libraries via
=Customizations.qll=
While most CodeQL-supported languages provide out-of-the-box support for
=Customizations.qll=, C and C++ do not include this by default. However, it is
possible to enable such support by building a custom CodeQL bundle. This can
be done using the CLI tool at
https://github.com/advanced-security/codeql-bundle. Since the tool functions
largely as a black box, we provide a more detailed illustration of the
underlying steps.
A working demonstration is available in
[[./codeql-dataflow-sql-injection-c/README.org]]. In languages like Java,
=Customizations.qll= is included automatically via imports from
=<language>.qll=, such as [[./ql/java/ql/lib/java.qll][java.qll]] importing [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], which defines
user-extensible predicates for flow modeling.
For C/C++, the process requires explicit modification:
1. Modify =ql/cpp/ql/lib/cpp.qll= to import =Customizations.qll=.
2. Create and populate =ql/cpp/ql/lib/Customizations.qll= with custom sources/sinks or extensions.
3. Rebuild the CodeQL bundle to include these changes.
This customization enables consistent user-defined flow modeling across
languages, making it possible to reuse modeling patterns from Java or Python
in C/C++ contexts.
** TODO CodeQL Bundling
This section will provide a detailed walkthrough of the CodeQL bundling process
using the CLI tool at https://github.com/advanced-security/codeql-bundle. This
tool enables custom pack composition and is necessary when extending language
libraries (e.g., adding `Customizations.qll` support for C/C++).
While the official tool is somewhat of a black box, we will demystify the
underlying structure and show how to build, inspect, and deploy custom bundles
from source. Notes and scripts will be collected in
[[file:codeql-bundling/README.org::XX: continue]].
* Tool Setup
Some scripts are used here, found in [[./bin/]]. To ensure the ones written in
Python have access to prerequites, set up a virtual environment via
#+BEGIN_SRC sh
# 1. Create the virtualenv
python3 -m venv ~/codeql-lab/venv
# 2. Install any packages
source ~/codeql-lab/venv/bin/activate
pip install pyyaml
#+END_SRC
For any of these scripts to work, add them to the PATH via
#+BEGIN_SRC sh
export PATH="$HOME/codeql-lab/bin:$PATH"
#+END_SRC