Calling Ollama from Python: How to Use the Local API Endpoint
Want to use Ollama in your own Python scripts? This guide shows you how to call the local API endpoint with both basic and streaming examples—no cloud, no API keys, just fast local inference. Perfect for building apps, bots, or tools with open models.
If you're ready to move beyond the terminal and start integrating Ollama into your own apps, you'll want to get familiar with the Ollama API endpoint — and how to use it with Python.
In this guide, I’ll show you how to call Ollama programmatically using both curl
and Python. Whether you’re building a local chatbot, scripting bulk generations, or wiring up an AI-powered tool, Ollama’s HTTP interface makes it simple.
The Ollama Endpoint
Once Ollama is installed and running, it exposes a local API on http://localhost:11434
.
To generate a response from a model, make a POST request to the /api/generate
endpoint. Here’s what it looks like with curl
:
curl http://localhost:11434/api/generate -d '{
"model": "phi3",
"prompt": "Why is the sky blue?"
}'
This sends your prompt to the phi3
model and streams the response back.
Let’s break that down:
model
: The name of the model you want to use (must already be installed viaollama run
orollama pull
)prompt
: Your input or question
The response is streamed as newline-delimited JSON chunks, which makes it ideal for real-time interfaces or applications.
🧠 Tip: You can swap "phi3"
with any model you’ve installed locally, like "llama3"
or "mistral"
. Use ollama list
to see what’s available.
Using the Ollama API in Python
Ollama's local API makes it easy to integrate models into your own Python scripts. Let’s start with a simple request-response flow, then move on to streaming.
Basic Response (Non-Streaming)
If you just want the full response back — no fancy streaming — you can send a regular POST request and read the result once it's done:
import requests
url = 'http://localhost:11434/api/generate'
data = {'model': 'phi3', 'prompt': 'Why is the sky blue?'}
with requests.post(url, json=data, stream=False) as response:
for line in response.iter_lines():
if line:
print(line.decode('utf-8'))
This works like a traditional API call: send a prompt, get the full answer back when it’s ready.
🧠 Tip: This method is easier to debug and great for quick scripts, one-off generations, or logging results.
Streaming Response (Live Output)
For a more interactive feel — like seeing the response unfold in real time — you can stream the output instead:
import requests
import json
url = 'http://localhost:11434/api/generate'
data = {'model': 'phi3', 'prompt': 'Why is the sky blue?'}
with requests.post(url, json=data, stream=True) as response:
for line in response.iter_lines():
if line:
json_line = line.decode('utf-8')
response_data = json.loads(json_line)
if response_data['response']:
print(response_data['response'], end='', flush=True)
This streams back each chunk of the model’s response as it's generated — perfect for building chat apps, terminal tools, or anything that benefits from live feedback.
Why Use the Ollama API?
Running models locally with an API gives you a lot of flexibility:
- Build custom UIs or command-line tools
- Create bots or assistants that run entirely offline
- Script workflows using local generation
- Avoid latency and privacy concerns of cloud models
And because it’s just HTTP, you can use any language, not just Python.
Next Steps
- Want more control? Check out the full Ollama API reference.