Open software

Orionfold Arena

Run, compare, and score AI models on your own desktop. Live speed and memory, a private board, two side by side. Free, and nothing leaves your machine.

  • Runs local
  • Leaderboard
  • Live telemetry
  • Private
Orionfold Arena
Language
Python
Run it
fieldkit arena serve
Proven on
DGX Spark
License
Free and local

Orionfold Arena is a single screen for running, comparing, and scoring the AI models on your own desktop. Open it on an NVIDIA DGX Spark and you see the machine’s live readouts, every model you have built, the tests those models were measured on, and a private board that ranks them from your own results. All of it runs on the machine under your desk, and none of it leaves.

Why it exists

If you build models on a Spark, you end up with a shelf of them and nowhere to drive them from. Picking one meant remembering a long command. Comparing two meant a terminal and a notebook. Knowing which small build was the good one meant digging up notes you wrote weeks ago. Arena turns that shelf into a control room. Chat with the model that is already warm and loaded, set two of them against each other, score an answer against a known-good answer, and read one chart to decide which build is worth shipping.

What you can do

  • Watch the machine. A live strip across the top shows how hard the chip is working, how hot it is, how much of the shared 128 GB of memory is in use, and how fast answers come back. On a Spark the chip and the system share one pool of memory, so watching that number is how you avoid running out before it happens.
  • Rank your models. The board ranks your models from real results and folds in every new chat and test as you go. It is built from a safe slice that shares only scores, never your prompts or the model’s replies, so you can publish the board and keep your data.
  • Pick what to ship. One chart plots quality against speed for every build and draws the best trade-offs in gold. Quality here means how well the model answers; the gold line is the set where you cannot get more quality without giving up speed.
  • Try and test in one place. Chat with any model, pull the exact test it was measured on, and score its answer against a gold answer without leaving the chat. Or put two models head to head and read the trade in plain numbers: quality, speed, wait time, length, and cost.

Jargon, in plain words: a GGUF is just a packaged model file you can run on your own machine. To quantize a model is to shrink it so it runs faster, and the chart shows what you trade away when you do. Throughput is how many words a second the model can produce.

Private by design

Nothing you type is uploaded, and nothing you compare phones home, unless you deliberately pick a hosted model. The whole thing is a tool you could run on a plane.

Built on fieldkit

Arena is a thin cockpit over the fieldkit toolbox and a year of real research. fieldkit serves the models, runs the tests, and scores the answers; Arena gives that work a screen. Because the parts already existed, the whole cockpit, fourteen screens and 125 tests, came together in about fifteen hours of work, with the AI agent doing the typing. The honest version of that story is the better one: the cockpit is the sum of a lot of compounding work, not a fresh start.

A closer look

One home screen. Live machine readouts up top, your best runs, and what happened recently.
Your models ranked from real results, built from a safe slice that never shares your prompts.
Quality against speed on one chart. The gold line is the set worth shipping.
Put any two models head to head and read the trade in plain numbers.
Pull a real test, send it from the chat, and score the answer against a gold answer.

Install

pip install fieldkit

Use it

# Start the Arena on your Spark and open it in a browser.
# It reads your own models, your benches, and your past results.
fieldkit arena serve

Specs

What it is
A single-screen cockpit to run, compare, and score local AI models
Live readouts
GPU use, heat, memory, and speed, updated as a model runs
Leaderboard
Your models ranked from real results, with no private text shared
Pick what to ship
A quality-versus-speed chart that marks the best trade-offs in gold
Try and test
Chat with any model, score an answer against a gold answer, or duel two side by side
Private
Runs on your own machine; nothing is uploaded unless you choose a hosted model
Built on
The fieldkit toolbox (arena, eval, harness, nim, notebook)