Files
codeql-lab/README.org

301 lines
16 KiB
Org Mode
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

* codeql-lab: Centralized Git Repository for CodeQL Development
** Overview
codeql-lab is a consolidated Git repository that collects all relevant
CodeQL components, resources, and tooling into a single
version-controlled location.
** Purpose
The goal of this repository is to provide an integrated development
environment (“lab”) for CodeQL research, experimentation, and custom
query development. It simplifies setup by maintaining all required
submodules, configuration files, and datasets in one place.
** Repository Location
The primary repository is hosted at:
https://github.com/hohn/codeql-lab
** Intended Use Cases
- Local experimentation with CodeQL queries and libraries.
- End-to-end testing of custom model data and query logic.
This includes writing and validating custom data flow models,
adjusting model coverage, and confirming that query results behave
as expected across controlled datasets. The lab setup supports rapid
iteration on QL logic, helping detect unintended changes and enabling
reproducible evaluations of taint tracking, control flow, or API usage
patterns.
- Structured collaboration and controlled updates across all
CodeQL-related artifacts.
- Simplified onboarding and reproducible setup for new contributors or
analysis environments.
* Prerequisites
Working with this repository assumes prior experience with:
- *Git, Bash, and standard Unix command-line tools*. These are used
throughout and are required for setup and day-to-day tasks.
Tools such as [[https://man.archlinux.org/man/rg.1][ripgrep]], [[https://www.gnu.org/software/bash/][GNU Bash]], and [[https://en.wikipedia.org/wiki/Grep][grep/regex workflows]] are assumed.
- *At least one supported programming language*, such as C, C++, Java,
Python, Go, or Ruby. A solid understanding of the target language is
necessary to interpret analysis results and write effective queries.
See general background on [[https://en.wikipedia.org/wiki/Programming_language][programming languages]] if needed.
- *Basic familiarity with program structure concepts*, including
[[https://en.wikipedia.org/wiki/Abstract_syntax_tree][abstract syntax trees (ASTs)]], [[https://en.wikipedia.org/wiki/Control-flow_graph][control-flow graphs (CFGs)]], and
[[https://en.wikipedia.org/wiki/Data-flow_analysis][data-flow graphs (DFGs)]]. These are core to how CodeQL models code behavior.
- *Optional but helpful*: familiarity with structural or functional
programming languages (e.g. [[https://en.wikipedia.org/wiki/Lisp_(programming_language)][Lisp]] or [[https://en.wikipedia.org/wiki/OCaml][OCaml]]) can make working with
CodeQLs query language and type system more intuitive.
See overview of [[https://en.wikipedia.org/wiki/Functional_programming][functional programming]] for related context.
* Repository Layout
** Core Structure
- Repository is based on: https://github.com/github/vscode-codeql-starter.git
- All development work is done on the branch: qllab
- CodeQL version is pinned via the =ql/= submodule:
: commit 4d681f05bd671f8b5e31624f16a2b4d75e61c071 (tag: codeql-cli/v2.22.0)
- A prebuilt CodeQL CLI binary is included:
: 1104625939 assets/codeql-osx64.zip
- Project-specific repositories can be added directly under the root.
Example: the C dataflow workshop in =./codeql-dataflow-sql-injection-c=
** Additional Structure Notes
- The original upstream README.md is preserved at [[./README-vscode-codeql-starter.md]]
* Possible Reading Orders
** Data Flow
*** Debugging data flow config (instead of taint flow), Java
We can illustrate taint-flow debugging in the Java SQL injection sample
- [[./codeql-sqlite-java/TaintFlowDebugging.ql]]
- [[./codeql-sqlite-java/TaintFlowDebugging.md]]
*** TODO Debugging data flow config (instead of taint flow), C
A corresponding example for C is planned, using a simplified query to trace
value propagation in [[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
Unlike Java, C may require manual modeling even to visualize basic flows.
** Modeling
There are two primary approaches to modeling: direct use of CodeQL predicates
and the models-as-data system. The models-as-data system is implemented in QL
but relies on external YAML files that are interpreted at query evaluation
time.
The model editor provides a GUI for managing YAML-based models, but the
underlying format is identical to that used by the models-as-data system. In C
and other cases where GUI support is limited or unavailable, we write these
YAML models manually and invoke them directly from queries.
When YAML models are written directly, the use of GPT-based tooling becomes
very natural. GPTs can extract function signatures, parameter semantics, and
flow annotations from documentation or code examples, then generate valid YAML
model entries automatically.
As diagram:
#+BEGIN_SRC text
+----------------------+
| Modeling in |
| CodeQL |
+----------+-----------+
|
+------------------------------+------------------------------+
| |
+--------v--------+ +---------v---------+
| Direct CodeQL | | Models-as-Data |
| (QL predicates) | | (YAML + QL eval) |
+--------+--------+ +---------+---------+
| |
| |
+----------v----------+ +---------------v---------------+
| Manual customization| | YAML models via GUI |
| via Customizations.qll | (Model Editor frontend) |
+----------+----------+ +---------------+---------------+
| |
| |
+---------v---------+ +-----------v-----------+
| Java: built-in | | Java: Jedis + Console |
| includes .qll hook | | GUI modeling examples |
+--------------------+ +------------------------+
|
| Manual setup needed for:
v
+------------------------+
| C / C++: requires |
| cpp.qll patch + |
| Customizations.qll |
+------------------------+
|
v
+-------------------------------+
| Use models-as-data directly |
| (YAML only, no editor) |
+-------------------------------+
|
v
+-------------------------------+
| GPT-assisted YAML generation |
| from docs, code, or examples |
+-------------------------------+
#+END_SRC
*** Review: SQLite Injection Workshop, Java
We begin with a recap of the Java-based injection example, focusing on the
vulnerable code in [[./codeql-sqlite-java/AddUser.java][AddUser.java]]. Following that, we examine a fully manual
CodeQL query available in [[./codeql-sqlite-java/full-query.ql][full-query.ql]], which was written to explicitly trace
tainted data through the program. Next, we explore the out-of-the-box query
[[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]] included in the standard CodeQL packs, and conclude with an
inspection of the relevant base classes and framework modeling in
[[./codeql-sqlite-java/Illustrations.ql][Illustrations.ql]].
*** Customizations via codeql (Java)
To customize CodeQL for Java, we identify and extend base classes to add
custom flow sources and sinks. A general explanation of this approach is
available in the file [[./codeql-dataflow-sql-injection-c/README.org][README.org]], particularly
the section [[file:codeql-dataflow-sql-injection-c/README.org::*supplement codeql: Add to FlowSource or a subclass][supplement codeql: Add to FlowSource or a subclass]]. For Java,
[[./ql/java/ql/lib/java.qll][java.qll]] includes [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], which provides extension points for
custom flow modeling -- this structure is common across most CodeQL-supported
languages, with the notable exception of C. Further details on this
customization process can be found in
[[./codeql-dataflow-sql-injection-c/incoming.codeql-customizations-workshop.md][incoming.codeql-customizations-workshop.md]].
*** Customizations via Model Editor: Jedis Example (Java Redis client)
The Jedis example is a straightforward case with no unexpected
behavior. Although the library contains many functions, they follow a simple
and repetitive pattern, making it ideal for large-scale modeling. The CodeQL
model editor can be used to efficiently define sources and sinks for such
cases. A detailed explanation is provided
OK
in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Modeling Jedis as a Dependency in Model Editor][Modeling Jedis as a Dependency in Model Editor]], while validation of
OK
the modeled sink is discussed in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Verifying the Modeled Sink][Verifying the Modeled Sink]].
Finally, the query-level usage of these models can be seen
OK
in [[file:~/work-gh/codeql-lab/codeql-jedis-java/README.org::*Identify usage of injection-related models in existing queries][Identify usage of injection-related models in existing queries]].
*** Customizations via Model Editor: Single-function case (Java SQLite sample)
We extend the Java SQLite example using the model editor, with both the
necessary data and specification already available. This example highlights a
subtle issue with the model editor: the method =java.io.Console.readLine()= is
already modeled as a taint *step* and therefore does not appear in the editor
interface, even though we need it modeled as a *source*. This requires special
handling. The relevant extensions are defined in
[[./.github/codeql/extensions/sqlite-db/codeql-pack.yml]], and the extension data
is provided in
[[./.github/codeql/extensions/sqlite-db/models/sqlite.model.yml]]. A detailed
*OK*
explanation is available in [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]].
To support this, we explain how the "models-as-data" system works
internally. A diagnostic query can be used to enumerate currently recognized
sources and sinks. From there, the relevant entry points -- such as QL classes
and predicates -- can be identified by inspecting representative queries like
[[./ql/java/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].
*** Review: SQLite Injection Workshop (C)
This is the C version of the injection workshop, based on
[[./codeql-dataflow-sql-injection-c/add-user.c]]. It
serves as the basis for both the "models-as-data" manual modeling and the
extension via =Customizations.qll=.
*** (PARTIAL) Use models-as-data QL code directly (no graphical editor)
This section focuses on using the models-as-data system *without* the
graphical model editor. While model definition files and supporting data
already exist, we manually write YAML files to add or override flow
behavior. This approach is especially relevant for C, where graphical tooling
is limited or nonexistent.
As reinforcement, we reuse the C version of the SQLite injection workshop:
- The code sample is at
[[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/add-user.c]].
- The accompanying query is
[[file:~/work-gh/codeql-lab/codeql-dataflow-sql-injection-c/SqlInjection.ql]].
For structural reference, see the Java versions documentation (not the editor
interface): [[file:~/work-gh/codeql-lab/codeql-sqlite-java/README.org::*Using sqlite to illustrate models-as-data][Using sqlite to illustrate models-as-data]]. There is no separate
C-specific walkthrough because the YAML structure and logic are nearly
identical.
For workshop use, we extend the example by modeling key functions manually:
- Add a source model for: =count = read(STDIN_FILENO, buf, BUFSIZE);=
- Add a sink model for: =rc = sqlite3_exec(db, query, NULL, 0, &zErrMsg);=
We demonstrate how to define YAML-based models for standard functions like
=read()= and verify their effect using the out-of-the-box query:
[[./ql/cpp/ql/src/Security/CWE/CWE-089/SqlTainted.ql][SqlTainted.ql]].
As an additional teaching case, we introduce the higher-level, redundant
function =char* get_user_info()= as a custom source—even though it internally
calls a function already modeled as a source—to illustrate how user-defined
extensions affect propagation logic.
*** (PARTIAL) Extending Queries with Customizations.qll for C
The manual YAML modeling approach described earlier works well for isolated or
prototype cases. However, for idiomatic, large-scale, or reusable CodeQL
analysis, it is often preferable to define custom dataflow logic directly in
QL—using =Customizations.qll=.
Most CodeQL-supported languages (e.g., Java, Python) include built-in support
for this mechanism. For example, Javas primary entry point [[./ql/java/ql/lib/java.qll][java.qll]]
automatically imports [[./ql/java/ql/lib/Customizations.qll][Customizations.qll]], exposing extension points for
user-defined sources, sinks, and flow steps.
In contrast, C and C++ do *not* support this out of the box. To enable it, you
must manually patch the language pack and (optionally) rebuild the CodeQL
bundle.
This section is *partially complete*: we document the required source-level QL
changes, but the bundling process is still pending.
To enable =Customizations.qll= support for C/C++, perform the following:
1. Modify =ql/cpp/ql/lib/cpp.qll= to import your =Customizations.qll= module.
2. Create and populate =ql/cpp/ql/lib/Customizations.qll= with new
source/sink/flow logic.
3. *For full deployment:* Rebuild the CodeQL bundle to include the updated
QL files.
- This allows portable use in CLI runs and IDE workflows.
- Once bundled, C/C++ customization behaves like any other supported
language.
4. *For workshops and local development:* No bundling is needed.
- If you run queries directly from the modified source tree, the changes
take effect immediately.
A working demonstration of this modification (without bundling) is provided
in:
[[./codeql-dataflow-sql-injection-c/README.org]]
** TODO CodeQL Bundling
This section will provide a detailed walkthrough of the CodeQL bundling process
using the CLI tool at https://github.com/advanced-security/codeql-bundle. This
tool enables custom pack composition and is necessary when extending language
libraries (e.g., adding `Customizations.qll` support for C/C++).
While the official tool is somewhat of a black box, we will demystify the
underlying structure and show how to build, inspect, and deploy custom bundles
from source. Notes and scripts will be collected in
[[file:codeql-bundling/README.org::XX: continue]].
* Tool Setup
Some scripts are used here, found in [[./bin/]]. To ensure the ones written in
Python have access to prerequites, set up a virtual environment via
#+BEGIN_SRC sh
# 1. Create the virtualenv
python3 -m venv ~/codeql-lab/venv
# 2. Install any packages
source ~/codeql-lab/venv/bin/activate
pip install pyyaml
#+END_SRC
For any of these scripts to work, add them to the PATH via
#+BEGIN_SRC sh
export PATH="$HOME/codeql-lab/bin:$PATH"
#+END_SRC