jetson-llm-benchmark

General↓ 0 installsUpdated 18h ago

VerifiedCuratedNVIDIA

Benchmark Jetson LLM/VLM serving performance across vLLM, llama.cpp, and Ollama with structured JSON output.

SKILL.md preview

---
name: jetson-llm-benchmark
description: Benchmark Jetson LLM/VLM serving performance across vLLM, llama.cpp, and Ollama with structured JSON output.
version: 0.0.2
license: "Apache-2.0"
metadata:
author: "Jetson Team"
tags: [jetson, llm, benchmark]
languages: [bash]
data-classification: public
---

# Jetson LLM Benchmark

Reproducible Jetson benchmarks with **structured JSON output** so an agent can compare runs. Encodes the workflow from the [Jetson AI Lab GenAI Benchmarking tutorial](https://www.jetson-ai-lab.com/tutorials/genai-benchmarking/).

## Purpose

Measure deployed LLM latency and throughput on a Jetson target using the correct
runtime-specific benchmark wrapper. Use the JSON output to compare models,
runtime flags, power modes, and before/after tuning changes.

## Prerequisites

- Run on the Jetson device that hosts the model runtime.
- For vLLM, start the OpenAI-compatible vLLM server first and know the served
model ID.
- For Ollama, ensure the Ollama daemon is reachable at `--endpoint` and the
named model is already pulled.
- For llama.cpp/GGUF, provide a readable `.gguf` model path on the host.
- Put the device in the intended power mode before measuring. MAXN is preferred
for comparable performance numbers.

## Available Scripts

| Script | Purpose | Arguments |
|--------|---------|-----------|
| `scripts/bench_vllm.sh` | Runs `vllm bench serve` against a running OpenAI-compatible vLLM server. | `--model`, `--endpoint`, `--concurrency`, `--input-len`, `--output-len`, `--num-prompts`, `--no-warmup`, `--container`, `--native`. |
| `scripts/bench_llama_cpp.sh` | Runs `llama-bench` for a local GGUF model through the Jetson-appropriate NVIDIA-AI-IOT llama.cpp container. | `--model`, `--n-prompt`, `--n-gen`, `--n-gpu-layers`, `--threads`, `--container`. |
| `scripts/bench_ollama.sh` | Benchmarks a local or containerized Ollama daemon through the `/api/generate` REST API. | `--model`, `--endpoint`, `--num-prompts`, `--input-len`, `--output-len`, `--no-warmup`. |

If your agent runtime supports `run_script`, invoke the selected wrapper directly with the user-provided model identifier or local model path, then summarize the returned JSON. Otherwise run the wrapper with `bash {baseDir}/scripts/<wrapper-name> ...`.

## Instructions

Always use the matching wrapper script for the runtime — do **not** call the underlying `vllm bench serve`, `llama-bench`, or `curl` against `/api/generate` by hand:

- vLLM → `scripts/bench_vllm.sh` (required for the vLLM path)
- llama.cpp / GGUF → `scripts/bench_llama_cpp.sh` (required for the GGUF path)
- Ollama → `scripts/bench_ollama.sh` (required for the Ollama path)

These wrappers handle warmup, the NVIDIA-AI-IOT container selection, and JSON emission. Calling the underlying tool directly will not satisfy the output contract below.

For "how do I benchmark/measure" questions, first run the matching wrapper with
`--help` to verify the exact options, then answer with the wrapper command. Do
not run a full benchmark unless the user asks you to execute it or the required
server/model path is already confirmed.

## Expected Workflow

Pick exactly one wrapper based on the runtime the user named, and invoke that
wrapper with `--help` before composing the answer. Do not merely mention the
script name. If the runtime does not execute scripts relative to the skill
directory, use `{baseDir}/scripts/<wrapper-name>`.

- Existing vLLM OpenAI-compatible server at `localhost:8000`:
`{baseDir}/scripts/bench_vllm.sh --help`, then show a command using
`--concurrency 1,8` and the served model ID.
- llama.cpp / GGUF / `llama-server`: `{baseDir}/scripts/bench_llama_cpp.sh
--help`, then show a command for the GGUF model path and report that
prompt/generation speed maps to TTFT, ITL/TPOT, and throughput.
- Ollama: `{baseDir}/scripts/bench_ollama.sh --help`, then show a command with
`--model <ollama-tag>`. Do not use vLLM or llama.cpp wrappers for Ollama.

## When to use

- "Benchma

…