GPU Util % utilisation
GPU Temp °C die
Unified GB of 128 · 8 GB guard
Throughput tok / second
TTFT ms · first token
throughput & first-token from the active lane
Active Lane idle no warm brain

← Models

What it's for
  • Offline finance-domain chat and 10-K Q&A on consumer hardware
  • A worked reference for GGUF quantization fidelity (Q8_0 perplexity-matches F16 losslessly)
  • Picking a quant variant by workload shape, not just RAM budget

Audience — Local-LLM power users who want an offline finance chat model on a 4–8 GB consumer GPU, and publishers studying how to measure quantization fidelity with a four-axis card on Spark-class hardware.

Quant economics quality × speed per build
Variant Perplexity tok/s FinanceBench (n=50, numeric_match)
Q4_K_M 6.221 31.1 0.14
Q5_K_M 6.164 26.9 0.16
Q6_K 6.147 23.9 0.16
Q8_0 6.137 8.9 0.18
F16 sweet spot 6.137 11.5 0.18

Perplexity lower = better; tok/s measured on the DGX Spark (GB10, 128 GB unified).

Efficiency curve quality index × tok/s
Known drift bounded · honest
  • FinanceBench accuracy ceiling (7B base, not a quant defect) Open-book FinanceBench (n=50, numeric_match) lands 14–18% across all five variants — a reasoning ceiling inherited from the Llama-2-Chat base, not a quantization failure. Fine for finance chat; not for high-stakes quantitative tasks, where a larger base is the only path up.
  • Q8_0 sustained-throughput anomaly Q8_0 generates at 8.9 tok/s — ~23% below F16's 11.5 and slower than every K-quant — likely a thermal/run-order or GB10 Q8_0-kernel effect. Perplexity favors Q8_0 (matches F16 to 4 decimals) but Q6_K is the safer pick for throughput-sensitive workloads; verify on your own hardware.
  • No modern chat_template in the tokenizer config 1 usage gotcha inherited from the upstream Llama-2-era base: the tokenizer ships no chat_template field, so apply_chat_template won't format prompts — wrap manually in the [INST] … [/INST] shape (llama-server, LM Studio, and Ollama handle this automatically).