Models Pricing Join the Waitlist

Run Models.
Skip the Ops.

OpenAI-compatible inference endpoints on production-ready GPUs. No cluster management. Just your model, live in minutes.

inferfly — bash
curl \
  https://{id}.api.inferfly.ai/v1/chat/completions \
  -H "Authorization: Bearer ifk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "{model}",
    "messages": [{"role": "user", "content": "{message}"}]
  }'
OpenAI-compatible vLLM-powered GPU-backed Bring your own model

From zero to inference in three steps

Choose a model

Browse the catalog of pre-validated open-source models. Each comes with a tested GPU configuration.

Configure your deployment

Select your GPUs and deployment hours. Transparent and prepaid pricing with a 1-hour minimum. Unused whole hours are refunded.

Get your endpoint

Your OpenAI-compatible API endpoint goes live in under 5 minutes. Drop it into your existing code — LangChain, LlamaIndex, or raw HTTP.

Run the models that matter

Every model ships with a pre-validated GPU configuration. No trial-and-error, no OOM surprises.

GPT-OSS 20B

Microsoft

RTX Pro 6000 / H200

Managed inference. Not managed complexity.

Everything you need to run models in production.

API surface

Drop-in OpenAI compatibility

Your LangChain, LlamaIndex, or OpenAI SDK code works unchanged. No wrapper libraries, no migration. Just swap the base URL.

Hardware

Pre-validated GPU configs

Every model ships with a tested GPU pairing. No quantization guesswork, no OOM surprises at 3 AM.

Observability

Built-in metrics dashboard

Latency, throughput, GPU utilization, and token usage at a glance. All in one place.

Deployment

Live in minutes, not days

Go from model selection to a working endpoint in under 5 minutes. No Docker builds, no vLLM flag tuning, no cluster provisioning.

Realtime

Streaming SSE out of the box

Server-sent events for real-time token streaming. Same interface as the OpenAI streaming API.

Billing

Pay for hours, not guesswork

Prepaid hourly billing with a 1-hour minimum. Stop your deployment anytime and unused whole hours go back to your account balance.

Ready to ship

Your model.
Your endpoint.
Live in minutes.