add build process description

2025-12-16 07:33:04 +01:00 · 2025-09-19 18:20:59 -07:00
parent 4902a3628b
commit ef2d0ddcf9
6 changed files with 84868 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,6 @@
+firefox-db-cpp-scan-bm-none.formatted.sarif
+firefox-db-cpp-scan-bm-none.formatted.sarif.zst
+firefox-db-cpp-scan.formatted.sarif
+firefox-db-cpp-scan.formatted.sarif.zst
+firefox-db-bm-none.tar.zst
+firefox-db.tar.zst
--- a/README.org
+++ b/README.org
@@ -0,0 +1,137 @@
+* Overview
+  This repo hosts a large-scale CodeQL demo database for **Firefox**.
+  Purpose: to demonstrate realistic CodeQL performance and scaling.  
+  Smaller demo repos understate costs and mislead about practical usage.
+
+  This is work in progress.
+
+* Download Artifacts
+  Base URL: https://github.com/hohn/codeql-for-firefox/releases
+
+  | Filename                                        | Size    | Description                       | URL |
+  |-------------------------------------------------+---------+-----------------------------------+-----|
+  | firefox-db-bm-none.tar.zst                      | 1.66 GB | Full CodeQL DB (build-mode=none)  | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db-bm-none.tar.zst][link]] |
+  | firefox-db-cpp-scan-bm-none.formatted.sarif.zst | 72.1 MB | SARIF results, C++ scan (bm=none) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db-cpp-scan-bm-none.formatted.sarif.zst][link]] |
+  | firefox-db-cpp-scan.formatted.sarif.zst         | 986 KB  | SARIF results, C++ scan (with bm) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db-cpp-scan.formatted.sarif.zst][link]] |
+  | firefox-db.tar.zst                              | 756 MB  | Full CodeQL DB (trace build mode) | [[https://github.com/hohn/codeql-for-firefox/releases/download/build-artifacts-1.0/firefox-db.tar.zst][link]] |
+
+  The **bm** abbreviation is for build mode.  In particular:
+  - bm=none → extraction only, no actual build
+  - plain → traced build
+
+* Building a CodeQL DB for Firefox (Linux, build-mode=trace-command)
+  #+BEGIN_SRC sh
+    cd ~/large-local-only/firefox/firefox
+    ./mach clobber
+    export PATH=/home/hohn/large-local-only/codeql:$PATH
+    source /home/hohn/.cargo/env
+
+    codeql database init \
+      --language=cpp \
+      --source-root=. \
+      firefox-db
+
+    ./mach configure
+
+    # Run under tracing (OOM at -j20, succeeded with -j10)
+    codeql database trace-command firefox-db -- ./mach build -j10
+
+    codeql database finalize firefox-db
+  #+END_SRC
+
+  **Timings**
+  - Plain Firefox build: ~10 minutes
+  - Build with CodeQL trace: ~57 minutes
+  - Finalize: ~50 minutes
+
+* Build Stats (firefox-db)
+  | Phase             | Directory   | Size  | Notes                  |
+  |-------------------+-------------+-------+------------------------|
+  | During finalize   | trap/       | 16.5G | TRAP facts             |
+  |                   | db-cpp/     | 7.0G  | Relational store       |
+  |                   | log/        | 2.8G  | Build + extractor logs |
+  |                   | src/        | 561M  | Source snapshot        |
+  | After finalize    | db-cpp/     | 2.5G  | Relational store       |
+  |                   | log/        | 2.9G  | Logs                   |
+  |                   | diagnostic/ | 32K   | Scratch                |
+  | Final DB size     | firefox-db/ | 5.5G  | Usable DB              |
+  | Distribution file | tar.zst     | 757M  | Compressed archive     |
+
+  Note: numbers differ between “interim DB size” (~27 GB) and “final
+  size” (5.5 GB).  
+
+* Building a DB with build-mode=none
+  #+BEGIN_SRC sh
+    cd ~/large-local-only/firefox/firefox
+    ./mach clobber
+    rm -rf obj-x86_64-pc-linux-gnu/ firefox-db*
+    export PATH=/home/hohn/large-local-only/codeql:$PATH
+    source /home/hohn/.cargo/env
+
+    ./mach configure
+
+    codeql database create \
+           --language=cpp \
+           --source-root=. \
+           --threads 20 \
+           --ram=50000 \
+           --build-mode=none \
+           firefox-db-bm-none
+
+    tar --use-compress-program="zstd -19 -T0" -cvf \
+        firefox-db-bm-none.tar.zst firefox-db-bm-none
+  #+END_SRC
+
+  **Results**
+  - Extraction time: ~17 minutes (init → finalize)
+  - TRAP import: 11m14s
+  - Final DB size: ~2–3 GB relational store + 137 MB source archive
+  - Compressed archive: 1.66 GB
+
+* Query Suite Selection
+  Available suites (CodeQL 1.4.6 cpp-queries):
+
+  | Suite                         | Queries |
+  |-------------------------------+---------|
+  | cpp-code-scanning.qls         | 60      |
+  | cpp-lgtm.qls                  | 108     |
+  | cpp-lgtm-full.qls             | 178     |
+  | cpp-security-and-quality.qls  | 181     |
+  | cpp-security-experimental.qls | 134     |
+  | cpp-security-extended.qls     | 97      |
+
+* Benchmarks (firefox-db, trace build)
+  - DB: firefox-db
+  - Suite: cpp-code-scanning (60 queries)
+  - Host: Mac Studio (28c / 256 GB RAM, CodeQL 2.22.4)
+
+  | Walltime | CPU% | User CPU (s) | Sys CPU (s) | Max RSS (GB) | Maj PF | Min PF  | Invol CS | Vol CS |
+  |----------+------+--------------+-------------+--------------+--------+---------+----------+--------|
+  | 22:16    | 1065 | 13775        | 464         | 124          | 124776 | 8.2 M   | 48.3 M   | 230647 |
+
+  Notes:
+  - ~22 minutes for “short” suite (60 queries)
+  - ~10.6 cores saturated
+  - Peak RAM: 124 GB, no swap
+  - Very high context switch activity
+
+* Benchmarks (firefox-db-bm-none)
+  - Wall time: 23m41s
+  - CPU time: 13 555 s (≈953% CPU utilization, ~9.5 cores avg)
+  - Max RAM: 126 GB
+  - Page faults: 294 k major, 10 M minor
+  - Context switches: 25.9 M invol, 248 k vol
+  - SARIF output: 2.9 GB (!), vs 33 MB for build-traced DB
+
+* Run Summary
+  | DB                | SARIF size | Log size |
+  |-------------------+------------+----------|
+  | firefox-db        | 33 MB      | 32 KB    |
+  | firefox-db-bm-none| 2.9 GB     | 32 KB    |
+
+* SARIF Output Notes
+  Using minimize-sarif.py:
+  - firefox-db-cpp-scan-bm-none: 4.06 GB → 805 KB, 104 results
+  - firefox-db-cpp-scan: 53 MB → 2.1 MB, 732 results
+
+
--- a/count-sarif.sh
+++ b/count-sarif.sh
@@ -0,0 +1,11 @@
+#!/bin/sh
+# Count SARIF results in a given file
+
+if [ $# -ne 1 ]; then
+    echo "Usage: $0 file.sarif" >&2
+    exit 1
+fi
+
+file="$1"
+
+jq '[.runs[].results[]] | length' "$file"
--- a/firefox-db-cpp-scan-bm-none.formatted.minimal.sarif
+++ b/firefox-db-cpp-scan-bm-none.formatted.minimal.sarif
--- a/firefox-db-cpp-scan.formatted.minimal.sarif
+++ b/firefox-db-cpp-scan.formatted.minimal.sarif
--- a/minimize-sarif.py
+++ b/minimize-sarif.py
@@ -0,0 +1,247 @@
+#!/usr/bin/env python3
+"""
+Extract only results from SARIF files while maintaining valid SARIF structure.
+Removes artifacts, conversion, invocations, and other non-essential data.
+"""
+
+import json
+import argparse
+import sys
+from pathlib import Path
+from typing import Dict, Any, List
+
+def create_minimal_sarif(original_sarif: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Create a minimal valid SARIF file containing only results and required metadata.
+    
+    Args:
+        original_sarif: Complete SARIF data structure
+        
+    Returns:
+        Minimal valid SARIF with only results
+    """
+    
+    # Start with minimal required SARIF structure
+    minimal_sarif = {
+        "version": original_sarif.get("version", "2.1.0"),
+        "$schema": original_sarif.get("$schema", 
+                                     "https://raw.githubusercontent.com/oasis-tcs/sarif-spec/master/Schemata/sarif-schema-2.1.0.json"),
+        "runs": []
+    }
+    
+    # Process each run
+    for run in original_sarif.get("runs", []):
+        minimal_run = {}
+        
+        # Required: tool information (minimal)
+        if "tool" in run:
+            minimal_tool = {"driver": {}}
+            
+            # Keep only essential tool.driver fields
+            if "driver" in run["tool"]:
+                driver = run["tool"]["driver"]
+                minimal_driver = {}
+                
+                # Required fields
+                if "name" in driver:
+                    minimal_driver["name"] = driver["name"]
+                else:
+                    minimal_driver["name"] = "Unknown Tool"  # Fallback
+                
+                # Optional but useful fields
+                for field in ["version", "informationUri", "semanticVersion"]:
+                    if field in driver:
+                        minimal_driver[field] = driver[field]
+                
+                # Keep rules if they exist (needed for result rule references)
+                if "rules" in driver:
+                    minimal_driver["rules"] = driver["rules"]
+                
+                minimal_tool["driver"] = minimal_driver
+            
+            minimal_run["tool"] = minimal_tool
+        else:
+            # Fallback if no tool information
+            minimal_run["tool"] = {"driver": {"name": "Unknown Tool"}}
+        
+        # Main content: results
+        if "results" in run:
+            minimal_run["results"] = run["results"]
+        else:
+            minimal_run["results"] = []
+        
+        # Optional: keep taxonomies if present (sometimes referenced in results)
+        if "taxonomies" in run:
+            minimal_run["taxonomies"] = run["taxonomies"]
+        
+        # Optional: keep threadFlowLocations if present
+        if "threadFlowLocations" in run:
+            minimal_run["threadFlowLocations"] = run["threadFlowLocations"]
+        
+        # Optional: keep graphs if present (sometimes used for data flow)
+        if "graphs" in run:
+            minimal_run["graphs"] = run["graphs"]
+        
+        # Optional: keep logical locations if present
+        if "logicalLocations" in run:
+            minimal_run["logicalLocations"] = run["logicalLocations"]
+        
+        minimal_sarif["runs"].append(minimal_run)
+    
+    return minimal_sarif
+
+def calculate_size_reduction(original_size: int, minimal_size: int) -> Dict[str, Any]:
+    """
+    Calculate size reduction metrics.
+    
+    Args:
+        original_size: Size of original file in bytes
+        minimal_size: Size of minimal file in bytes
+        
+    Returns:
+        Dictionary with size metrics
+    """
+    reduction_bytes = original_size - minimal_size
+    reduction_percent = (reduction_bytes / original_size) * 100 if original_size > 0 else 0
+    
+    return {
+        "original_size": original_size,
+        "minimal_size": minimal_size,
+        "reduction_bytes": reduction_bytes,
+        "reduction_percent": reduction_percent,
+        "original_size_mb": original_size / (1024 * 1024),
+        "minimal_size_mb": minimal_size / (1024 * 1024),
+        "reduction_mb": reduction_bytes / (1024 * 1024)
+    }
+
+def format_size(size_bytes: int) -> str:
+    """Format size in human-readable format."""
+    for unit in ['B', 'KB', 'MB', 'GB']:
+        if size_bytes < 1024.0:
+            return f"{size_bytes:.2f} {unit}"
+        size_bytes /= 1024.0
+    return f"{size_bytes:.2f} TB"
+
+def process_sarif_file(input_path: Path, output_path: Path = None, 
+                      verbose: bool = False) -> None:
+    """
+    Process a SARIF file to extract only results.
+    
+    Args:
+        input_path: Path to input SARIF file
+        output_path: Path to output file (optional, auto-generated if not provided)
+        verbose: Print detailed information
+    """
+    
+    if not input_path.exists():
+        print(f"Error: Input file '{input_path}' does not exist.", file=sys.stderr)
+        return
+    
+    # Auto-generate output path if not provided
+    if output_path is None:
+        output_path = input_path.parent / f"{input_path.stem}.minimal.sarif"
+    
+    try:
+        # Read original SARIF
+        if verbose:
+            print(f"Reading: {input_path}")
+        
+        with open(input_path, 'r', encoding='utf-8') as f:
+            original_sarif = json.load(f)
+        
+        # Create minimal SARIF
+        minimal_sarif = create_minimal_sarif(original_sarif)
+        
+        # Count results
+        total_results = sum(len(run.get("results", [])) 
+                          for run in minimal_sarif.get("runs", []))
+        
+        # Write minimal SARIF
+        with open(output_path, 'w', encoding='utf-8') as f:
+            json.dump(minimal_sarif, f, indent=2, ensure_ascii=False)
+        
+        # Calculate size reduction
+        original_size = input_path.stat().st_size
+        minimal_size = output_path.stat().st_size
+        metrics = calculate_size_reduction(original_size, minimal_size)
+        
+        # Report results
+        print(f"\n✓ Processed: {input_path.name}")
+        print(f"  Output: {output_path.name}")
+        print(f"  Results extracted: {total_results:,}")
+        print(f"  Original size: {format_size(metrics['original_size'])}")
+        print(f"  Minimal size: {format_size(metrics['minimal_size'])}")
+        print(f"  Reduction: {format_size(metrics['reduction_bytes'])} ({metrics['reduction_percent']:.1f}%)")
+        
+        if verbose:
+            # Show what was removed
+            original_runs = original_sarif.get("runs", [])
+            if original_runs:
+                run = original_runs[0]
+                removed_keys = set(run.keys()) - set(minimal_sarif["runs"][0].keys())
+                if removed_keys:
+                    print(f"  Removed sections: {', '.join(sorted(removed_keys))}")
+        
+    except json.JSONDecodeError as e:
+        print(f"Error: Invalid JSON in '{input_path}': {e}", file=sys.stderr)
+    except Exception as e:
+        print(f"Error processing '{input_path}': {e}", file=sys.stderr)
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Extract only results from SARIF files while maintaining valid SARIF structure.",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Process single file (auto-generates .minimal.sarif output)
+  %(prog)s input.sarif
+  
+  # Process with custom output name
+  %(prog)s input.sarif -o output.sarif
+  
+  # Process multiple files
+  %(prog)s file1.sarif file2.sarif file3.sarif
+  
+  # Process with verbose output
+  %(prog)s -v large-scan.sarif
+  
+  # Process all SARIF files in current directory
+  %(prog)s *.sarif
+        """
+    )
+    
+    parser.add_argument(
+        "input_files",
+        nargs="+",
+        type=Path,
+        help="Input SARIF file(s) to process"
+    )
+    
+    parser.add_argument(
+        "-o", "--output",
+        type=Path,
+        help="Output file path (only valid with single input file)"
+    )
+    
+    parser.add_argument(
+        "-v", "--verbose",
+        action="store_true",
+        help="Show detailed information about processing"
+    )
+    
+    args = parser.parse_args()
+    
+    # Validate arguments
+    if args.output and len(args.input_files) > 1:
+        parser.error("Cannot specify --output with multiple input files")
+    
+    # Process files
+    for input_file in args.input_files:
+        process_sarif_file(
+            input_file,
+            args.output if len(args.input_files) == 1 else None,
+            args.verbose
+        )
+
+if __name__ == "__main__":
+    main()