Pricing

Transparent pricing.
No hidden fees.

Pick a model, deploy in minutes, pay by the hour. 1-hour minimum per deployment. Unused whole hours are refunded to your account balance.

Small

Up to 13B params

~$1.50/hr

Llama 3.1 8B, Qwen 3.5 9B

Medium

20B – 40B params

~$5.00/hr

Qwen 3.5 27B, Qwen 3.5 35B, GPT OSS 20B

Large

70B – 130B params

~$11.50/hr

Llama 3.3 70B, Qwen 3.5 122B, Qwen3 Coder Next

XL / MoE

200B+ params

~$17.00/hr

Llama 4 Scout, MiniMax M2.5

Exact pricing depends on model. See the full catalog below.

Full model catalog

Every row is a deployable configuration.

Llama

Model name	Parameters	Max context length	Price / hr
Llama 3.1 8B Instruct	8B	128k	$1.50
Llama 3.3 70B Instruct	70B	128k	$12.00
Llama 4 Scout 17B-16E Instruct	109B (17 active)	10M	$17.00

Qwen

Model name	Parameters	Max context length	Price / hr
Qwen 3.5 9B	9B	256k	$1.50
Qwen 3.5 27B	27B	256k	$6.00
Qwen 3.5 35B	35B	256k	$6.00
Qwen 3.5 122B (FP8)	125B	256k	$13.00
Qwen3 Coder Next	80B	256k	$12.00
Qwen3 Coder Next (FP8)	80B	256k	$6.50

GPT OSS

Model name	Parameters	Max context length	Price / hr
GPT OSS 20B	22B	128k	$3.50
GPT OSS 120B	120B	128k	$6.50

MiniMax

Model name	Parameters	Max context length	Price / hr
MiniMax M2.5	229B	200k	$17.00

Bring Your Own Model

Have a fine-tuned or custom model? Contact us and we'll help you deploy it on Inferfly. Pricing is custom based on your GPU tier. Volume and term discounts available.

GPU tier	VRAM	Max context length	Price / hr
L40S	48 GB	Custom	Custom
A100 SXM	80 GB	Custom	Custom
H100 SXM	80 GB	Custom	Custom
H200 SXM	141 GB	Custom	Custom
B200	180 GB	Custom	Custom

Disclaimer: The above prices are for reference only. Exact pricing depends on model, replica sets, number of concurrent users, and GPU type and no. of GPUs.

FAQ

Frequently asked questions

How does billing work?

You pay by the hour, prepaid, with a 1-hour minimum per deployment. When you stop a deployment, any unused whole hours are refunded to your account balance.

Can I withdraw my account balance?

Account balance can be used toward future deployments but cannot be withdrawn as cash.

What happens if my deployment crashes?

If a deployment fails due to an infrastructure issue on our end, the affected hours are refunded to your account balance automatically.

Do prices change based on how many requests I send?

No. You pay for compute time, not per request or per token. Once your endpoint is live, you can send as many requests as the GPU can handle.

Can I run multiple models on the same GPU?

No. Each deployment is a dedicated GPU (or set of GPUs) running a single model. This ensures predictable latency and no noisy-neighbor issues.

What does "pre-validated GPU configuration" mean?

We test every model-GPU pairing before adding it to the catalog. You won't hit out-of-memory errors or need to guess quantization settings. The config that ships is the one that works.

Can I deploy a fine-tuned or custom model?

Yes. Contact us with your model details (parameter count, base architecture, quantization format) and we will set up a dedicated deployment with custom pricing.

Is there a free trial?

Not currently. We offer a 1-hour minimum so you can validate your setup at low cost before committing to longer runs.

How fast does my endpoint go live?

Most models are live within 5 minutes of deployment. Larger models that require multi-GPU setups may take slightly longer.

Do you offer volume discounts?

Yes. If you are running sustained workloads or multiple deployments, contact us for custom pricing.

Enterprise

Need something bigger?

Looking to run Llama 4 Maverick, Qwen 3.5 397B, Kimi K2.5, or other large-scale models? We provision these on demand. Tell us what you need and we'll get you a dedicated deployment with custom pricing.

Talk to us

Transparent pricing.No hidden fees.

Full model catalog

Frequently asked questions

Need something bigger?

Transparent pricing.
No hidden fees.