40–60% faster latency than Candle, 2–3× higher throughput, 30–50% lower memory usage, 16× faster model loading.
CPU (AVX-512/NEON), CUDA, Metal, Vulkan, WebGPU, and ROCm backends through a unified abstraction layer.
GGUF, SafeTensors, PyTorch, ONNX, TensorRT, and MLX formats. LLaMA, GPT, BERT, and 20+ architectures.
Vision (LLaVA, CLIP, Qwen-VL), speech (Whisper, VITS), audio (MusicGen), and 3D (PointNet++, NeRF).
Built-in RAG pipeline, MCP protocol, knowledge base (RDF/SPARQL), and multi-agent orchestration.
FIPS 140-3 compliance, CMMC 2.0, CVE tracking, MITRE ATT&CK framework, and hash-chain audit logs.
# Serve a GGUF model with OpenAI-compatible API
ironworks serve llama-3-70b.Q4_K_M.gguf \
--port 8080 \
--gpu-layers 80 \
--ctx-size 8192
# Multi-model serving
ironworks serve \
--model llama-3-70b.gguf --alias llama \
--model qwen2-7b.gguf --alias qwen \
--port 8080
# Chat completion (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7
}'
# Embedding generation
ironworks embed --model nomic-embed.gguf --input "Hello world"