Llama 3.1 8B
Meta
OpenAI-compatible inference endpoints on production-ready GPUs. No cluster management. Just your model, live in minutes.
curl \
https://{id}.api.inferfly.ai/v1/chat/completions \
-H "Authorization: Bearer ifk_..." \
-H "Content-Type: application/json" \
-d '{
"model": "{model}",
"messages": [{"role": "user", "content": "{message}"}]
}'
Browse the catalog of pre-validated open-source models. Each comes with a tested GPU configuration.
Select your GPUs and deployment hours. Transparent and prepaid pricing with a 1-hour minimum. Unused whole hours are refunded.
Your OpenAI-compatible API endpoint goes live in under 5 minutes. Drop it into your existing code — LangChain, LlamaIndex, or raw HTTP.
Every model ships with a pre-validated GPU configuration. No trial-and-error, no OOM surprises.
Meta
Meta
Meta
Microsoft
Microsoft
Alibaba
Alibaba
Alibaba
Alibaba
MiniMax
Custom deployment
Everything you need to run models in production.
API surface
Your existing LangChain, LlamaIndex, or OpenAI SDK code works unchanged. Swap the base URL. That is it.
Hardware
Every model ships with a tested GPU pairing. No quantization guesswork, no OOM surprises at 3 AM.
Observability
Latency, throughput, and token usage at a glance. All in one place.
Governance
Usage tracking and rate limits per API key. Built for teams and agencies serving multiple clients.
Realtime
Server-sent events for real-time token streaming. Same interface as the OpenAI streaming API.
Billing
Pay by the hour, prepaid. Unused whole hours are returned as account balance.