LLM INFERENCE ENGINE

IRONWORKS

HIGH-PERFORMANCE LLM INFERENCE IN RUST

Memory-safe inference engine with zero-copy operations, SIMD-optimized kernels, and multi-backend acceleration. 41 specialized crates. 1,990 tests passing.

$ cargo install ironworks-cli

ENGINE CAPABILITIES

BLAZING PERFORMANCE

40–60% faster latency than Candle, 2–3× higher throughput, 30–50% lower memory usage, 16× faster model loading.

MULTI-BACKEND

CPU (AVX-512/NEON), CUDA, Metal, Vulkan, WebGPU, and ROCm backends through a unified abstraction layer.

UNIVERSAL MODELS

GGUF, SafeTensors, PyTorch, ONNX, TensorRT, and MLX formats. LLaMA, GPT, BERT, and 20+ architectures.

MULTIMODAL

Vision (LLaVA, CLIP, Qwen-VL), speech (Whisper, VITS), audio (MusicGen), and 3D (PointNet++, NeRF).

AGENT SWARMS

Built-in RAG pipeline, MCP protocol, knowledge base (RDF/SPARQL), and multi-agent orchestration.

ENTERPRISE SECURITY

FIPS 140-3 compliance, CMMC 2.0, CVE tracking, MITRE ATT&CK framework, and hash-chain audit logs.

ARCHITECTURE

CORE

Inference Engine
Model Loader
Tokenizer
Sampling
KV Cache

BACKENDS

CPU (AVX/NEON)
CUDA
Metal
Vulkan
WebGPU

AI/ML

RAG Pipeline
Knowledge Base
MCP Protocol
Agent Swarms
Multimodal

OPS

HTTP Server
CLI Tool
Python SDK
TypeScript SDK
FFI (C)

USAGE

terminal

# Serve a GGUF model with OpenAI-compatible API
ironworks serve llama-3-70b.Q4_K_M.gguf \
  --port 8080 \
  --gpu-layers 80 \
  --ctx-size 8192

# Multi-model serving
ironworks serve \
  --model llama-3-70b.gguf --alias llama \
  --model qwen2-7b.gguf --alias qwen \
  --port 8080

# Chat completion (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
  }'

# Embedding generation
ironworks embed --model nomic-embed.gguf --input "Hello world"

INFERENCE, FORGED IN RUST

Memory-safe, zero-copy, and fearlessly concurrent. The inference engine that crushes the competition.