Models Pricing Join the Waitlist

Pricing

Transparent pricing.
No hidden fees.

Pick a model, deploy in minutes, pay by the hour. 1-hour minimum per deployment. Unused whole hours are refunded to your account balance.

Small

Up to 13B params

~$1.50/hr

Llama 3.1 8B, Qwen 3.5 9B

Medium

20B – 40B params

~$5.00/hr

Qwen 3.5 27B, Qwen 3.5 35B, GPT OSS 20B

Large

70B – 130B params

~$11.50/hr

Llama 3.3 70B, Qwen 3.5 122B, Qwen3 Coder Next

XL / MoE

200B+ params

~$17.00/hr

Llama 4 Scout, MiniMax M2.5

Exact pricing depends on model. See the full catalog below.

Full model catalog

Every row is a deployable configuration.

Llama
Model name Parameters Max context length Price / hr
Llama 3.1 8B Instruct 8B 128k $1.50
Llama 3.3 70B Instruct 70B 128k $12.00
Llama 4 Scout 17B-16E Instruct 109B (17 active) 10M $17.00
Qwen
Model name Parameters Max context length Price / hr
Qwen 3.5 9B 9B 256k $1.50
Qwen 3.5 27B 27B 256k $6.00
Qwen 3.5 35B 35B 256k $6.00
Qwen 3.5 122B (FP8) 125B 256k $13.00
Qwen3 Coder Next 80B 256k $12.00
Qwen3 Coder Next (FP8) 80B 256k $6.50
GPT OSS
Model name Parameters Max context length Price / hr
GPT OSS 20B 22B 128k $3.50
GPT OSS 120B 120B 128k $6.50
MiniMax
Model name Parameters Max context length Price / hr
MiniMax M2.5 229B 200k $17.00
Bring Your Own Model

Have a fine-tuned or custom model? Contact us and we'll help you deploy it on Inferfly. Pricing is custom based on your GPU tier. Volume and term discounts available.

GPU tier VRAM Max context length Price / hr
L40S 48 GB Custom Custom
A100 SXM 80 GB Custom Custom
H100 SXM 80 GB Custom Custom
H200 SXM 141 GB Custom Custom
B200 180 GB Custom Custom

Disclaimer: The above prices are for reference only. Exact pricing depends on model, replica sets, number of concurrent users, and GPU type and no. of GPUs.

FAQ

Frequently asked questions

Contact us
How does billing work?
You pay by the hour, prepaid, with a 1-hour minimum per deployment. When you stop a deployment, any unused whole hours are refunded to your account balance.
Can I withdraw my account balance?
Account balance can be used toward future deployments but cannot be withdrawn as cash.
What happens if my deployment crashes?
If a deployment fails due to an infrastructure issue on our end, the affected hours are refunded to your account balance automatically.
Do prices change based on how many requests I send?
No. You pay for compute time, not per request or per token. Once your endpoint is live, you can send as many requests as the GPU can handle.
Can I run multiple models on the same GPU?
No. Each deployment is a dedicated GPU (or set of GPUs) running a single model. This ensures predictable latency and no noisy-neighbor issues.
What does "pre-validated GPU configuration" mean?
We test every model-GPU pairing before adding it to the catalog. You won't hit out-of-memory errors or need to guess quantization settings. The config that ships is the one that works.
Can I deploy a fine-tuned or custom model?
Yes. Contact us with your model details (parameter count, base architecture, quantization format) and we will set up a dedicated deployment with custom pricing.
Is there a free trial?
Not currently. We offer a 1-hour minimum so you can validate your setup at low cost before committing to longer runs.
How fast does my endpoint go live?
Most models are live within 5 minutes of deployment. Larger models that require multi-GPU setups may take slightly longer.
Do you offer volume discounts?
Yes. If you are running sustained workloads or multiple deployments, contact us for custom pricing.

Enterprise

Need something bigger?

Looking to run Llama 4 Maverick, Qwen 3.5 397B, Kimi K2.5, or other large-scale models? We provision these on demand. Tell us what you need and we'll get you a dedicated deployment with custom pricing.

Talk to us