Bronze
Back the work
$10 / month
- Your name on our supporters list
- A vote on what we build next
- A thank you in the build log
Open-weight model
An open AI model for medical questions and clinical text. It runs fully offline on a small desktop, so patient details never leave the clinic.
II-Medical 8B
II-Medical 8B is an open AI model for medical questions and clinical text. It runs fully offline on a small desktop, so patient details never leave the clinic.
It answers health and medical questions, explains conditions and terms in plain words, and works through clinical text step by step. It is built on Intelligent-Internet’s II-Medical-8B, which learned to reason its way to an answer rather than just guess, and it is packed into ready-to-run files for a single desktop.
We scored five builds on MedMCQA, a 50-question medical exam test, on a small Spark desktop. The Q5_K_M build scored the best at 52 percent, above the full-size build, while running at about 36 tokens a second. The table above shows every build.
This is a short test, and the model is a study and drafting helper, not a doctor. It can be wrong, so never use it to make a real medical decision. Always check with a qualified clinician.
Download the GGUF files (the ready-to-run format) and run them with llama.cpp on a Spark-class desktop, a small AI machine with 128 GB of memory. The Q5_K_M build is the sweet spot here: the best score and a healthy 36 tokens a second.
huggingface-cli download Orionfold/II-Medical-8B-GGUFllama-cli -hf Orionfold/II-Medical-8B-GGUF:Q5_K_Mllama-cli -hf Orionfold/II-Medical-8B-GGUF:Q5_K_M -p "Explain the difference between Type 1 and Type 2 diabetes."from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Orionfold/II-Medical-8B-GGUF",
filename="*Q5_K_M.gguf",
)
out = llm("Explain the difference between Type 1 and Type 2 diabetes.")
print(out["choices"][0]["text"])
| Build | MedMCQA score | Speed on a Spark |
|---|---|---|
| Q4_K_M | 42% | 44 tokens a second |
| Q5_K_M (best pick) | 52% | 36 tokens a second |
| Q6_K | 46% | 33 tokens a second |
| Q8_0 | 48% | 28 tokens a second |
| F16 (full size) | 48% | 16 tokens a second |
Live counts from HuggingFace, refreshed when the site builds. Built and maintained in the open by Orionfold.
Back this work with a monthly tier. Your support moves your requests up the list, and Gold or Platinum earns a badge on the roadmap item you back.
Back the work
$10 / month
Get a say
$25 / month
Move it up the list
$50 / month
Shape the roadmap
$100 / month
Need something specific? Send an enquiry from the roadmap.

Offline patent reasoning in ready-to-run files, built with the NeMo toolkit. Nothing leaves your desktop.

Real notes from doing AI research on one desktop. The NVIDIA DGX Spark is a small machine with huge power (petascale means it runs about a quadrillion math steps a second), so you can push local AI further with no cloud needed. Every lesson is backed by code that runs.