GPU Util % utilisation
GPU Temp °C die
Unified GB of 128 · 8 GB guard
Throughput tok / second
TTFT ms · first token
throughput & first-token from the active lane
Active Lane idle no warm brain

← Models

What it's for
  • A local clinical-Q&A console behind your own retrieval layer, fully offline
  • Medical-reasoning experiments where the visible think-chain is the point
  • Picking a quant variant by workload shape, not just RAM budget

Audience — Local-LLM power users and clinical-informatics builders who want an offline medical-reasoning model on a consumer GPU with the reasoning trace visible — not a hosted API and not a medical device.

Quant economics quality × speed per build
Variant Perplexity tok/s MedMCQA (n=50, mcq_letter)
Q4_K_M 16.550 43.6 0.42
Q5_K_M sweet spot 16.242 36.4 0.52
Q6_K 16.014 32.8 0.46
Q8_0 16.296 28.4 0.48
F16 16.268 15.9 0.48

Perplexity lower = better; tok/s measured on the DGX Spark (GB10, 128 GB unified).

Efficiency curve quality index × tok/s
Known drift bounded · honest
  • Reasoning models need a generous n_predict (≥1024) A clinical-MCQ reasoning trace runs 400–800 tokens before the closing think tag, and the answer is 1 token after it. At n_predict=256 the budget runs out mid-differential and the answer never lands — set n_predict to 1024 or more. A measurement gotcha, not a model defect.
  • MedMCQA accuracy ceiling (8B, n=50 mini-eval) MedMCQA (n=50, mcq_letter) lands 42–52% across the five variants, peaking at Q5_K_M (26/50) — an 8B reasoning ceiling on a 50-question mini-eval, not a quantization failure. Indicative, not a clinical validation.
  • Not medical advice An 8B reasoning model inherited from the upstream base — for study, retrieval-grounded drafting, and triage UX, not diagnosis or treatment decisions. No clinical-grade validation is claimed.