# GlpHypoMine (글피하이포마인)

> **WARNING / 면책 — 본 도구는 연구·참고용입니다. 임상 의사결정에 사용 금지.**
> Generated hypotheses require expert validation before grant submission or experimental design.

GLP-1 off-target hypothesis mining engine. Pulls together GLP-1RA-class signals from PubMed, FAERS, and ClinicalTrials.gov (offline synthetic samples for the MVP), and produces ranked `(drug, organ, mechanism, population)` 4-tuple hypothesis cards highlighting *unexplored* combinations.

- **Domain**: DM + Obesity
- **Category**: Research idea generation (가설 생성)
- **Form factor**: Python 3 CLI, stdlib only
- **Date**: 2026-04-25

## Why
Pharma/academic groups overwhelmingly chase the same 5–6 GLP-1RA off-target stories (CV, MASLD, AD, addiction…). 36 organ systems × 8 GLP-1RA-class drugs = ~288 (drug, organ) cells; the long tail of mechanistically plausible *but un-trialed* combinations is where novel grant-worthy hypotheses live. GlpHypoMine surfaces that long tail by combining literature density, FAERS disproportionality, and trial coverage.

## Features (MVP)
1. **PubMed E-utilities daily incremental fetch** — GLP-1RA × 36 organ systems. *Mocked offline (`data/pubmed_sample.json`); no network calls.*
2. **FAERS quarterly dump parser** — adverse event signal extraction with PRR + ROR (Haldane-Anscombe corrected). Synthetic sample in `data/faers_sample.csv`.
3. **NER + local LLM tuple extraction** — `(drug, organ, mechanism, population)`. Mocked: production drop-in replaces `extract_tuples()` with a quantized local model.
4. **ClinicalTrials.gov cross-reference** — exclude combinations already under investigation. Synthetic registry: `data/clinicaltrials_sample.json`.
5. **Rank unexplored combos** — combined plausibility score: organ prior + log(ROR) + PubMed mechanistic citation count − already-in-trials penalty.

## Run

```bash
cd "/Users/sangjoonpark/2026 metabolic daily idea/projects/2026-04-25-1-glp-hypo-mine"

python3 main.py --help
python3 main.py stats
python3 main.py rank
python3 main.py rank --domain cardiovascular
python3 main.py rank --top 5 --json
```

No installs required. Python 3.9+ stdlib only (`json`, `csv`, `argparse`, `math`, `collections`).

## File layout

```
projects/2026-04-25-1-glp-hypo-mine/
  README.md
  main.py
  QA.md
  data/
    pubmed_sample.json         # 18 synthetic abstracts
    faers_sample.csv           # 50 synthetic adverse event reports
    clinicaltrials_sample.json # 12 synthetic ongoing/completed trials
    glp1_drugs.json            # 8 GLP-1RA-class drugs
    organ_systems.json         # 36 organ systems with KO/EN keywords
```

## Hypothesis card format

Each card includes:
- `drug`, `organ`
- `mechanism_candidates` — extracted from PubMed abstracts (or "FAERS-only signal")
- `population_candidates`
- `faers` — `{PRR, ROR, n}` if any
- `n_pubmed` — how many abstracts mention this combo
- `in_trial` — whether ClinicalTrials.gov already covers it
- `components` — score breakdown (`prior`, `ror_term`, `pub_term`, `in_trial`)
- `score` — final ranking score
- `novelty` — `"unexplored"` (no matching trial) or `"explored"`

## Sources / pointers

- PubMed E-utilities: `https://www.ncbi.nlm.nih.gov/books/NBK25497/` (production target; not called in MVP)
- FAERS quarterly data: `https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html`
- ClinicalTrials.gov API v2: `https://clinicaltrials.gov/data-api/api`
- PRR / ROR formulas: Evans 2001; van Puijenbroek 2002.

## QA checklist

- [x] Syntax check on `main.py`
- [x] `python3 main.py --help` renders
- [x] `python3 main.py rank` produces cards
- [x] `python3 main.py rank --domain cardiovascular` filters
- [x] `python3 main.py stats` reports corpus counts
- [x] All five `data/*` files load cleanly
- [x] Disclaimer printed in CLI output and at top of README

See `QA.md` for the recorded verification log.

## Caveats

- All upstream data is **synthetic**. Numbers, PMIDs, NCT IDs, FAERS case IDs are fabricated and do **not** correspond to real reports.
- `ORGAN_PRIOR` weights are author-graded heuristics, not literature-derived priors.
- Mechanism plausibility scoring is intentionally simple — replace with a calibrated model before any downstream use.
