mirror of
https://github.com/hohn/codeql-for-firefox.git
synced 2025-12-16 07:33:04 +01:00
add build process description
This commit is contained in:
committed by
=Michael Hohn
parent
4902a3628b
commit
ef2d0ddcf9
6
.gitignore
vendored
Normal file
6
.gitignore
vendored
Normal file
@@ -0,0 +1,6 @@
|
||||
firefox-db-cpp-scan-bm-none.formatted.sarif
|
||||
firefox-db-cpp-scan-bm-none.formatted.sarif.zst
|
||||
firefox-db-cpp-scan.formatted.sarif
|
||||
firefox-db-cpp-scan.formatted.sarif.zst
|
||||
firefox-db-bm-none.tar.zst
|
||||
firefox-db.tar.zst
|
||||
137
README.org
Normal file
137
README.org
Normal file
@@ -0,0 +1,137 @@
|
||||
* Overview
|
||||
This repo hosts a large-scale CodeQL demo database for **Firefox**.
|
||||
Purpose: to demonstrate realistic CodeQL performance and scaling.
|
||||
Smaller demo repos understate costs and mislead about practical usage.
|
||||
|
||||
This is work in progress.
|
||||
|
||||
* Download Artifacts
|
||||
Base URL: https://github.com/hohn/codeql-for-firefox/releases
|
||||
|
||||
| Filename | Size | Description | URL |
|
||||
|-------------------------------------------------+---------+-----------------------------------+-----|
|
||||
| firefox-db-bm-none.tar.zst | 1.66 GB | Full CodeQL DB (build-mode=none) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db-bm-none.tar.zst][link]] |
|
||||
| firefox-db-cpp-scan-bm-none.formatted.sarif.zst | 72.1 MB | SARIF results, C++ scan (bm=none) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db-cpp-scan-bm-none.formatted.sarif.zst][link]] |
|
||||
| firefox-db-cpp-scan.formatted.sarif.zst | 986 KB | SARIF results, C++ scan (with bm) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db-cpp-scan.formatted.sarif.zst][link]] |
|
||||
| firefox-db.tar.zst | 756 MB | Full CodeQL DB (trace build mode) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db.tar.zst][link]] |
|
||||
|
||||
The **bm** abbreviation is for build mode. In particular:
|
||||
- bm=none → extraction only, no actual build
|
||||
- plain → traced build
|
||||
|
||||
* Building a CodeQL DB for Firefox (Linux, build-mode=trace-command)
|
||||
#+BEGIN_SRC sh
|
||||
cd ~/large-local-only/firefox/firefox
|
||||
./mach clobber
|
||||
export PATH=/home/hohn/large-local-only/codeql:$PATH
|
||||
source /home/hohn/.cargo/env
|
||||
|
||||
codeql database init \
|
||||
--language=cpp \
|
||||
--source-root=. \
|
||||
firefox-db
|
||||
|
||||
./mach configure
|
||||
|
||||
# Run under tracing (OOM at -j20, succeeded with -j10)
|
||||
codeql database trace-command firefox-db -- ./mach build -j10
|
||||
|
||||
codeql database finalize firefox-db
|
||||
#+END_SRC
|
||||
|
||||
**Timings**
|
||||
- Plain Firefox build: ~10 minutes
|
||||
- Build with CodeQL trace: ~57 minutes
|
||||
- Finalize: ~50 minutes
|
||||
|
||||
* Build Stats (firefox-db)
|
||||
| Phase | Directory | Size | Notes |
|
||||
|-------------------+-------------+-------+------------------------|
|
||||
| During finalize | trap/ | 16.5G | TRAP facts |
|
||||
| | db-cpp/ | 7.0G | Relational store |
|
||||
| | log/ | 2.8G | Build + extractor logs |
|
||||
| | src/ | 561M | Source snapshot |
|
||||
| After finalize | db-cpp/ | 2.5G | Relational store |
|
||||
| | log/ | 2.9G | Logs |
|
||||
| | diagnostic/ | 32K | Scratch |
|
||||
| Final DB size | firefox-db/ | 5.5G | Usable DB |
|
||||
| Distribution file | tar.zst | 757M | Compressed archive |
|
||||
|
||||
Note: numbers differ between “interim DB size” (~27 GB) and “final
|
||||
size” (5.5 GB).
|
||||
|
||||
* Building a DB with build-mode=none
|
||||
#+BEGIN_SRC sh
|
||||
cd ~/large-local-only/firefox/firefox
|
||||
./mach clobber
|
||||
rm -rf obj-x86_64-pc-linux-gnu/ firefox-db*
|
||||
export PATH=/home/hohn/large-local-only/codeql:$PATH
|
||||
source /home/hohn/.cargo/env
|
||||
|
||||
./mach configure
|
||||
|
||||
codeql database create \
|
||||
--language=cpp \
|
||||
--source-root=. \
|
||||
--threads 20 \
|
||||
--ram=50000 \
|
||||
--build-mode=none \
|
||||
firefox-db-bm-none
|
||||
|
||||
tar --use-compress-program="zstd -19 -T0" -cvf \
|
||||
firefox-db-bm-none.tar.zst firefox-db-bm-none
|
||||
#+END_SRC
|
||||
|
||||
**Results**
|
||||
- Extraction time: ~17 minutes (init → finalize)
|
||||
- TRAP import: 11m14s
|
||||
- Final DB size: ~2–3 GB relational store + 137 MB source archive
|
||||
- Compressed archive: 1.66 GB
|
||||
|
||||
* Query Suite Selection
|
||||
Available suites (CodeQL 1.4.6 cpp-queries):
|
||||
|
||||
| Suite | Queries |
|
||||
|-------------------------------+---------|
|
||||
| cpp-code-scanning.qls | 60 |
|
||||
| cpp-lgtm.qls | 108 |
|
||||
| cpp-lgtm-full.qls | 178 |
|
||||
| cpp-security-and-quality.qls | 181 |
|
||||
| cpp-security-experimental.qls | 134 |
|
||||
| cpp-security-extended.qls | 97 |
|
||||
|
||||
* Benchmarks (firefox-db, trace build)
|
||||
- DB: firefox-db
|
||||
- Suite: cpp-code-scanning (60 queries)
|
||||
- Host: Mac Studio (28c / 256 GB RAM, CodeQL 2.22.4)
|
||||
|
||||
| Walltime | CPU% | User CPU (s) | Sys CPU (s) | Max RSS (GB) | Maj PF | Min PF | Invol CS | Vol CS |
|
||||
|----------+------+--------------+-------------+--------------+--------+---------+----------+--------|
|
||||
| 22:16 | 1065 | 13775 | 464 | 124 | 124776 | 8.2 M | 48.3 M | 230647 |
|
||||
|
||||
Notes:
|
||||
- ~22 minutes for “short” suite (60 queries)
|
||||
- ~10.6 cores saturated
|
||||
- Peak RAM: 124 GB, no swap
|
||||
- Very high context switch activity
|
||||
|
||||
* Benchmarks (firefox-db-bm-none)
|
||||
- Wall time: 23m41s
|
||||
- CPU time: 13 555 s (≈953% CPU utilization, ~9.5 cores avg)
|
||||
- Max RAM: 126 GB
|
||||
- Page faults: 294 k major, 10 M minor
|
||||
- Context switches: 25.9 M invol, 248 k vol
|
||||
- SARIF output: 2.9 GB (!), vs 33 MB for build-traced DB
|
||||
|
||||
* Run Summary
|
||||
| DB | SARIF size | Log size |
|
||||
|-------------------+------------+----------|
|
||||
| firefox-db | 33 MB | 32 KB |
|
||||
| firefox-db-bm-none| 2.9 GB | 32 KB |
|
||||
|
||||
* SARIF Output Notes
|
||||
Using minimize-sarif.py:
|
||||
- firefox-db-cpp-scan-bm-none: 4.06 GB → 805 KB, 104 results
|
||||
- firefox-db-cpp-scan: 53 MB → 2.1 MB, 732 results
|
||||
|
||||
|
||||
11
count-sarif.sh
Executable file
11
count-sarif.sh
Executable file
@@ -0,0 +1,11 @@
|
||||
#!/bin/sh
|
||||
# Count SARIF results in a given file
|
||||
|
||||
if [ $# -ne 1 ]; then
|
||||
echo "Usage: $0 file.sarif" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
file="$1"
|
||||
|
||||
jq '[.runs[].results[]] | length' "$file"
|
||||
21950
firefox-db-cpp-scan-bm-none.formatted.minimal.sarif
Normal file
21950
firefox-db-cpp-scan-bm-none.formatted.minimal.sarif
Normal file
File diff suppressed because it is too large
Load Diff
62517
firefox-db-cpp-scan.formatted.minimal.sarif
Normal file
62517
firefox-db-cpp-scan.formatted.minimal.sarif
Normal file
File diff suppressed because it is too large
Load Diff
247
minimize-sarif.py
Executable file
247
minimize-sarif.py
Executable file
@@ -0,0 +1,247 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Extract only results from SARIF files while maintaining valid SARIF structure.
|
||||
Removes artifacts, conversion, invocations, and other non-essential data.
|
||||
"""
|
||||
|
||||
import json
|
||||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Dict, Any, List
|
||||
|
||||
def create_minimal_sarif(original_sarif: Dict[str, Any]) -> Dict[str, Any]:
|
||||
"""
|
||||
Create a minimal valid SARIF file containing only results and required metadata.
|
||||
|
||||
Args:
|
||||
original_sarif: Complete SARIF data structure
|
||||
|
||||
Returns:
|
||||
Minimal valid SARIF with only results
|
||||
"""
|
||||
|
||||
# Start with minimal required SARIF structure
|
||||
minimal_sarif = {
|
||||
"version": original_sarif.get("version", "2.1.0"),
|
||||
"$schema": original_sarif.get("$schema",
|
||||
"https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json"),
|
||||
"runs": []
|
||||
}
|
||||
|
||||
# Process each run
|
||||
for run in original_sarif.get("runs", []):
|
||||
minimal_run = {}
|
||||
|
||||
# Required: tool information (minimal)
|
||||
if "tool" in run:
|
||||
minimal_tool = {"driver": {}}
|
||||
|
||||
# Keep only essential tool.driver fields
|
||||
if "driver" in run["tool"]:
|
||||
driver = run["tool"]["driver"]
|
||||
minimal_driver = {}
|
||||
|
||||
# Required fields
|
||||
if "name" in driver:
|
||||
minimal_driver["name"] = driver["name"]
|
||||
else:
|
||||
minimal_driver["name"] = "Unknown Tool" # Fallback
|
||||
|
||||
# Optional but useful fields
|
||||
for field in ["version", "informationUri", "semanticVersion"]:
|
||||
if field in driver:
|
||||
minimal_driver[field] = driver[field]
|
||||
|
||||
# Keep rules if they exist (needed for result rule references)
|
||||
if "rules" in driver:
|
||||
minimal_driver["rules"] = driver["rules"]
|
||||
|
||||
minimal_tool["driver"] = minimal_driver
|
||||
|
||||
minimal_run["tool"] = minimal_tool
|
||||
else:
|
||||
# Fallback if no tool information
|
||||
minimal_run["tool"] = {"driver": {"name": "Unknown Tool"}}
|
||||
|
||||
# Main content: results
|
||||
if "results" in run:
|
||||
minimal_run["results"] = run["results"]
|
||||
else:
|
||||
minimal_run["results"] = []
|
||||
|
||||
# Optional: keep taxonomies if present (sometimes referenced in results)
|
||||
if "taxonomies" in run:
|
||||
minimal_run["taxonomies"] = run["taxonomies"]
|
||||
|
||||
# Optional: keep threadFlowLocations if present
|
||||
if "threadFlowLocations" in run:
|
||||
minimal_run["threadFlowLocations"] = run["threadFlowLocations"]
|
||||
|
||||
# Optional: keep graphs if present (sometimes used for data flow)
|
||||
if "graphs" in run:
|
||||
minimal_run["graphs"] = run["graphs"]
|
||||
|
||||
# Optional: keep logical locations if present
|
||||
if "logicalLocations" in run:
|
||||
minimal_run["logicalLocations"] = run["logicalLocations"]
|
||||
|
||||
minimal_sarif["runs"].append(minimal_run)
|
||||
|
||||
return minimal_sarif
|
||||
|
||||
def calculate_size_reduction(original_size: int, minimal_size: int) -> Dict[str, Any]:
|
||||
"""
|
||||
Calculate size reduction metrics.
|
||||
|
||||
Args:
|
||||
original_size: Size of original file in bytes
|
||||
minimal_size: Size of minimal file in bytes
|
||||
|
||||
Returns:
|
||||
Dictionary with size metrics
|
||||
"""
|
||||
reduction_bytes = original_size - minimal_size
|
||||
reduction_percent = (reduction_bytes / original_size) * 100 if original_size > 0 else 0
|
||||
|
||||
return {
|
||||
"original_size": original_size,
|
||||
"minimal_size": minimal_size,
|
||||
"reduction_bytes": reduction_bytes,
|
||||
"reduction_percent": reduction_percent,
|
||||
"original_size_mb": original_size / (1024 * 1024),
|
||||
"minimal_size_mb": minimal_size / (1024 * 1024),
|
||||
"reduction_mb": reduction_bytes / (1024 * 1024)
|
||||
}
|
||||
|
||||
def format_size(size_bytes: int) -> str:
|
||||
"""Format size in human-readable format."""
|
||||
for unit in ['B', 'KB', 'MB', 'GB']:
|
||||
if size_bytes < 1024.0:
|
||||
return f"{size_bytes:.2f} {unit}"
|
||||
size_bytes /= 1024.0
|
||||
return f"{size_bytes:.2f} TB"
|
||||
|
||||
def process_sarif_file(input_path: Path, output_path: Path = None,
|
||||
verbose: bool = False) -> None:
|
||||
"""
|
||||
Process a SARIF file to extract only results.
|
||||
|
||||
Args:
|
||||
input_path: Path to input SARIF file
|
||||
output_path: Path to output file (optional, auto-generated if not provided)
|
||||
verbose: Print detailed information
|
||||
"""
|
||||
|
||||
if not input_path.exists():
|
||||
print(f"Error: Input file '{input_path}' does not exist.", file=sys.stderr)
|
||||
return
|
||||
|
||||
# Auto-generate output path if not provided
|
||||
if output_path is None:
|
||||
output_path = input_path.parent / f"{input_path.stem}.minimal.sarif"
|
||||
|
||||
try:
|
||||
# Read original SARIF
|
||||
if verbose:
|
||||
print(f"Reading: {input_path}")
|
||||
|
||||
with open(input_path, 'r', encoding='utf-8') as f:
|
||||
original_sarif = json.load(f)
|
||||
|
||||
# Create minimal SARIF
|
||||
minimal_sarif = create_minimal_sarif(original_sarif)
|
||||
|
||||
# Count results
|
||||
total_results = sum(len(run.get("results", []))
|
||||
for run in minimal_sarif.get("runs", []))
|
||||
|
||||
# Write minimal SARIF
|
||||
with open(output_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(minimal_sarif, f, indent=2, ensure_ascii=False)
|
||||
|
||||
# Calculate size reduction
|
||||
original_size = input_path.stat().st_size
|
||||
minimal_size = output_path.stat().st_size
|
||||
metrics = calculate_size_reduction(original_size, minimal_size)
|
||||
|
||||
# Report results
|
||||
print(f"\n✓ Processed: {input_path.name}")
|
||||
print(f" Output: {output_path.name}")
|
||||
print(f" Results extracted: {total_results:,}")
|
||||
print(f" Original size: {format_size(metrics['original_size'])}")
|
||||
print(f" Minimal size: {format_size(metrics['minimal_size'])}")
|
||||
print(f" Reduction: {format_size(metrics['reduction_bytes'])} ({metrics['reduction_percent']:.1f}%)")
|
||||
|
||||
if verbose:
|
||||
# Show what was removed
|
||||
original_runs = original_sarif.get("runs", [])
|
||||
if original_runs:
|
||||
run = original_runs[0]
|
||||
removed_keys = set(run.keys()) - set(minimal_sarif["runs"][0].keys())
|
||||
if removed_keys:
|
||||
print(f" Removed sections: {', '.join(sorted(removed_keys))}")
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error: Invalid JSON in '{input_path}': {e}", file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(f"Error processing '{input_path}': {e}", file=sys.stderr)
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Extract only results from SARIF files while maintaining valid SARIF structure.",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Process single file (auto-generates .minimal.sarif output)
|
||||
%(prog)s input.sarif
|
||||
|
||||
# Process with custom output name
|
||||
%(prog)s input.sarif -o output.sarif
|
||||
|
||||
# Process multiple files
|
||||
%(prog)s file1.sarif file2.sarif file3.sarif
|
||||
|
||||
# Process with verbose output
|
||||
%(prog)s -v large-scan.sarif
|
||||
|
||||
# Process all SARIF files in current directory
|
||||
%(prog)s *.sarif
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"input_files",
|
||||
nargs="+",
|
||||
type=Path,
|
||||
help="Input SARIF file(s) to process"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-o", "--output",
|
||||
type=Path,
|
||||
help="Output file path (only valid with single input file)"
|
||||
)
|
||||
|
||||
parser.add_argument(
|
||||
"-v", "--verbose",
|
||||
action="store_true",
|
||||
help="Show detailed information about processing"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate arguments
|
||||
if args.output and len(args.input_files) > 1:
|
||||
parser.error("Cannot specify --output with multiple input files")
|
||||
|
||||
# Process files
|
||||
for input_file in args.input_files:
|
||||
process_sarif_file(
|
||||
input_file,
|
||||
args.output if len(args.input_files) == 1 else None,
|
||||
args.verbose
|
||||
)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user