Run Models.
Skip the Ops.

OpenAI-compatible inference endpoints on production-ready GPUs. No cluster management. Just your model, live in minutes.

Join the Waitlist

inferfly — bash

curl \
  https://{id}.api.inferfly.ai/v1/chat/completions \
  -H "Authorization: Bearer ifk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "{model}",
    "messages": [{"role": "user", "content": "{message}"}]
  }'

OpenAI-compatible vLLM-powered GPU-backed Bring your own model

From zero to inference in three steps

Choose a model

Browse the catalog of pre-validated open-source models. Each comes with a tested GPU configuration.

Configure your deployment

Select your GPUs and deployment hours. Transparent and prepaid pricing with a 1-hour minimum. Unused whole hours are refunded.

Get your endpoint

Your OpenAI-compatible API endpoint goes live in under 5 minutes. Drop it into your existing code — LangChain, LlamaIndex, or raw HTTP.

Run the models that matter

Every model ships with a pre-validated GPU configuration. No trial-and-error, no OOM surprises.

Llama 3.1 8B

Llama 3.3 70B

Llama 4 Scout

GPT-OSS 20B

Microsoft

RTX Pro 6000 / H200

Deploy →

GPT-OSS 120B

Microsoft

B200

Deploy →

Qwen 3.5 9B

Alibaba

L40S

Deploy →

Qwen 3.5 27B

Alibaba

A100 / H100

Deploy →

Qwen 3.5 35B

Alibaba

A100 / H100

Deploy →

Qwen 3.5 122B

Alibaba

H100 / H200

Deploy →

MiniMax M2.5

MiniMax

H100 / H200

Deploy →

Your Model

Custom deployment

BYOM

Managed inference. Not managed complexity.

Everything you need to run models in production.

API surface

Drop-in OpenAI compatibility

Your LangChain, LlamaIndex, or OpenAI SDK code works unchanged. No wrapper libraries, no migration. Just swap the base URL.

Hardware

Pre-validated GPU configs

Every model ships with a tested GPU pairing. No quantization guesswork, no OOM surprises at 3 AM.

Observability

Built-in metrics dashboard

Latency, throughput, GPU utilization, and token usage at a glance. All in one place.

Deployment

Live in minutes, not days

Go from model selection to a working endpoint in under 5 minutes. No Docker builds, no vLLM flag tuning, no cluster provisioning.

Realtime

Streaming SSE out of the box

Server-sent events for real-time token streaming. Same interface as the OpenAI streaming API.

Billing

Pay for hours, not guesswork

Prepaid hourly billing with a 1-hour minimum. Stop your deployment anytime and unused whole hours go back to your account balance.

Ready to ship

Your model.
Your endpoint.
Live in minutes.

Deploy Your First Model

Run Models. Skip the Ops.

From zero to inference in three steps

Choose a model

Configure your deployment

Get your endpoint

Run the models that matter

Llama 3.1 8B

Llama 3.3 70B

Llama 4 Scout

GPT-OSS 20B

GPT-OSS 120B

Qwen 3.5 9B

Qwen 3.5 27B

Qwen 3.5 35B

Qwen 3.5 122B

MiniMax M2.5

Your Model

Managed inference. Not managed complexity.

Drop-in OpenAI compatibility

Pre-validated GPU configs

Built-in metrics dashboard

Live in minutes, not days

Streaming SSE out of the box

Pay for hours, not guesswork

Your model.Your endpoint.Live in minutes.

Run Models.
Skip the Ops.

Your model.
Your endpoint.
Live in minutes.