The standard benchmark prompt is:
Write a short story about the meatpocalypse
Send the prompt and measure total wall-clock time for the full response:
START=$(date +%s%N)
curl -s http://localhost:PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"messages": [{"role": "user", "content": "Write a short story about the meatpocalypse"}],
"max_tokens": 256,
"temperature": 0.7,
"stream": false
}'
END=$(date +%s%N)
TOTAL_MS=$(( (END - START) / 1000000 ))
echo "Total: ${TOTAL_MS}ms"
Extract completion_tokens from the response JSON and compute:
tokens_per_second = completion_tokens / (total_ms / 1000)
Send the same prompt with stream: true and measure the wall-clock time from sending the request to the first data: chunk arriving:
START=$(date +%s%N)
curl -s -N http://localhost:PORT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "YOUR_MODEL",
"messages": [{"role": "user", "content": "Write a short story about the meatpocalypse"}],
"max_tokens": 256,
"temperature": 0.7,
"stream": true
}' | head -20
END=$(date +%s%N)
TTFT_MS=$(( (END - START) / 1000000 ))
echo "TTFT: ${TTFT_MS}ms"
time_to_first_token = TTFT_MS / 1000 (in seconds, e.g. 0.135)
| Field | Description |
|---|---|
model_name | Full model name, e.g. Qwen3.6-35B-A3B-MTP |
model_size | Approximate size in B: 35B, 7B, etc. |
quantization | Quant type: Q8_0, Q6_K, F16, etc. |
engine | Engine name: llamacpp, vllm, ollama, etc. |
hardware | Human-readable hardware string (CPU, RAM, GPU summary) |
prompt | The exact prompt used |
n_predict | Number of tokens to generate (default: 256) |
tokens_per_second | Measured throughput |
time_to_first_token | TTFT in seconds (e.g. 0.135) |
total_tokens | Number of tokens generated |
total_time | Wall-clock time in seconds |
| Field | Description |
|---|---|
engine_start_cmd | Full server launch command — this is critical for reproducibility. Include all flags, paths, and GPU assignments. E.g. the full llama-server command line. |
cpu, gpu, ram | Structured hardware info for filtering |
context_size | Context window size used (e.g. 262144) |
tags | Comma-separated tags: rx6800xt,local,262k,mtp |
notes | Any notes about tuning, special settings, or methodology |
Use a bot token from your dashboard:
POST /api/v1/runs
Header: X-API-Token: mm_bot_<your-token>
Content-Type: application/json
{
"model_name": "Qwen3.6-35B-A3B-MTP",
"model_size": "35B",
"quantization": "Q8_0",
"engine": "llamacpp",
"engine_version": "latest",
"engine_start_cmd": "/home/user/llama.cpp/build-vulkan/bin/llama-server -m /path/to/model.gguf -a qwen3.6-35b-q8 --host 127.0.0.1 --port 18002 -ngl 999 -fa on -fit off -b 512 -ub 128 -np 1 -dev Vulkan1,Vulkan2 -sm layer -ts 1,1 -ctk f16 -ctv f16 -c 262144 -n 32768 --reasoning on --jinja --mmproj /path/to/mmproj.gguf --spec-type draft-mtp --spec-draft-n-max 2",
"hardware": "AMD Ryzen 9 7950X, 64GB, RX 7900 XTX (2x)",
"cpu": "AMD Ryzen 9 7950X",
"gpu": "RX 7900 XTX (2x)",
"ram": "64GB",
"os_info": "Linux 6.8",
"prompt": "Write a short story about the meatpocalypse",
"n_predict": 256,
"temperature": 0.7,
"top_p": 0.95,
"context_size": 262144,
"time_to_first_token": 0.125,
"tokens_per_second": 123.7,
"total_tokens": 256,
"total_time": 2.07,
"tags": "rx7900xt,7950x,local,262k,mtp,draft-mtp,dual-gpu",
"notes": "Single run via direct llama-server (no proxy overhead)",
"verified": true
}
Log in, go to Dashboard, and click Generate Bot Token. Use this token with the X-API-Token header.
GET /api/v1/export/csv with your API token| Method | Path | Description |
|---|---|---|
| POST | /api/v1/runs | Submit a result (auth required) |
| GET | /api/v1/runs | List runs with filters (auth required) |
| GET | /api/v1/runs/<id> | Get a single run (auth required) |
| DELETE | /api/v1/runs/<id> | Delete a run (auth required) |
| GET | /api/v1/leaderboard | Public leaderboard |
| GET | /api/v1/export/csv | Export all data as CSV (auth required) |
| GET | /api/v1/stats | Aggregate statistics |
| POST | /api/v1/token/generate | Generate a new bot token (auth required) |
| POST | /api/v1/token/delete | Delete a bot token (auth required) |
engine_start_cmd — this is the single most important field for reproducibility. Others need to see exactly what flags, GPU splits, and context sizes you used.notes to document methodology.verified: true if you can reproduce the result consistently.dual-gpu, nvme, 262k, draft-mtp, no-ctx-prefill, etc.