# QA / Verification log — NITSurrogate-Kor

**Date:** 2026-05-22
**Environment:** macOS (darwin 25.5.0), Python 3.9.6, numpy 2.0.2, scipy 1.13.1,
pandas 2.3.3, matplotlib 3.9.4, streamlit 1.50.0. statsmodels NOT installed
(meta-regression implemented in numpy/scipy as required).
**Mode:** fully offline, no network calls.

**Overall status: ✅ PASS** — all required checks pass; no RuntimeWarnings on
default runs.

---

## 1. Syntax / parse checks

```
$ python3 -c "import ast; ast.parse(open('main.py').read())"   -> OK
$ python3 -c "import ast; ast.parse(open('app.py').read())"    -> OK
```
Both pass.

## 2. CLI help

```
$ python3 main.py --help            -> exit code 0
```
Lists all flags: `--data --chain --compare-nit --paradox --gaps --hypotheses
--top --alpha --grade-strong --grade-moderate --no-banner`. PASS.

## 3. Bare invocation (summary)

`python3 main.py` prints the 3-stage snapshot:
```
  NIT->histology   n=30  R2=0.312   grade=weak
  histology->hard  n=4   R2=0.927   grade=INVALID(paradox)
  NIT->hard        n=4   R2=0.775   grade=strong
  Weakest / most-uncertain stage : NIT->histology (R2=0.312, grade=weak)
  Trials with full NIT+histology+hard chain : 4  (histology->hard is the structural data gap)
  Dataset: 14 trials, 9 drugs, 5 NIT metrics.
```
R² all in [0,1]; grades present; structural gap surfaced. PASS.

## 4. `--chain` (core methodology)

```
stage               n      R2        R2 95% CI     slope        grade
NIT->histology     30   0.312     [0.06, 0.59]     0.172         weak
histology->hard     4   0.927     [0.00, 1.00]     0.805 INVALID(paradox)
NIT->hard           4   0.775     [0.00, 0.99]     0.173       strong
...
PTE mediation: trials with full chain = 4; PTE(raw)=2.191 -> clamped 1.000;
  plausible=False -> flagged "mediation estimate unstable (sparse hard data)".
```
- R²_trial ∈ [0,1] for every stage. ✅
- 95% CIs printed; sparse stages (n=4) show wide CIs as expected. ✅
- STE computed per stage (achievable / NOT-achievable correctly distinguished). ✅
- PTE computed and **out-of-range value flagged** rather than silently shown. ✅
- Paradox correctly forces histology→hard to INVALID. ✅

## 5. `--compare-nit` (NIT ranking)

```
NIT           n      R2        R2 95% CI        grade
ELF           4   0.954     [0.07, 1.00]       strong
FIB4         13   0.669     [0.24, 0.89]     moderate
LSM_VCTE      3   0.401          [ n/a ]         weak
MRI_PDFF      8   0.043     [0.00, 0.63]         weak
MRE           2     n/a          [ n/a ] n/a(insufficient n)
```
All 5 NITs ranked; insufficient-n metric handled gracefully; NIT→hard shown as
sparse (only FIB4 has data). PASS.

## 6. `--paradox`

```
Stage histology->hard -- paradox rows (upstream benefit, downstream harm):
  - STELLAR-4   simtuzumab   NIT=FIB4   up=+0.200  down=-0.180
```
The planted surrogate-paradox case (simtuzumab) is detected. PASS.

## 7. `--gaps`

Flags pooled weak/wide/paradox stages AND the **histology→hard structural gap
per drug class** (FGF21, FXR, GCG_GLP1, GIP_GLP1, GLP1, LOXL2, THR-beta,
panPPAR all have 0–1 hard-outcome trials → UNVALIDATED). PASS.

## 8. `--hypotheses`

Emits validation-study hypotheses, e.g.:
```
H1  Is the histology->hard surrogacy relationship validated across the MASH
    drug landscape ...?
    required : 380 adjudicated events | ≈2147 participants | follow-up ≈ 4.0 yr
H2..H9  per-drug-class histology->hard validation (FGF21, FXR, ... panPPAR)
H10..H12 NIT-qualification sub-studies for FIB4 / LSM / MRI-PDFF (R² sub-strong)
H13 MRE not estimable -> prospective paired-data registry
```
Required events (Schoenfeld, HR=0.75, α=0.05, power=80%) = 380; N ≈ 2147;
follow-up 4 yr. PASS.

## 9. Demo CSV parses

```
$ python3 -c "import pandas as pd; df=pd.read_csv('data/masld_surrogacy_demo.csv', comment='#'); print(len(df), list(df.columns))"
30 ['trial','drug','drug_class','nit_metric','delta_nit','delta_nit_se',
    'histo_metric','histo_effect','histo_se','hard_outcome','loghr','loghr_se',
    'stage_status']
```
PASS.

## 10. Flags / robustness

- `--top 3` correctly limits hypothesis list to H1–H3. ✅
- `--alpha 0.10` narrows CIs; `--grade-strong 0.8` reclassifies NIT→hard
  strong→moderate. ✅
- `--data /nonexistent.csv` -> "ERROR: data file not found", exit code 2. ✅
- `engine.load_data(io.StringIO(...))` works (Streamlit upload path). ✅

## 11. app.py (Streamlit UI)

- `ast.parse` OK; module imports cleanly (matplotlib `Agg` backend, streamlit
  1.50, reuses `main.py` engine). ✅
- Plotting helpers `_stage_scatter`, `_ste_curve`, `_pte_diagram` all run
  headlessly without error. ✅
- Not launched (per spec — must only import cleanly). ✅

## 12. Numerical hygiene

Combined run of all subcommands produced **0 bytes on stderr** (no
RuntimeWarnings). Near-singular designs guarded via `pinv(rcond=1e-10)`,
`np.nan_to_num`, and `np.errstate` in the prediction-band math. ✅

## 13. Disclaimer presence

Research/reference-only disclaimer appears in: CLI banner header, CLI footer,
README (3×), and app.py (top + bottom). ✅

---

## Retries / failures

None required. All checks passed on first full run after two intentional
refinements during development (within-NIT standardization for pooled NIT
stages to avoid scale artifacts; numeric guards to remove matmul
RuntimeWarnings; conversion of one demo row into a genuine surrogate-paradox so
`--paradox` demonstrably fires). No `FAILED:` conditions.
