Running an LLM locally using Ollama
If you are developing or integrating software with a Large Language Model (LLM) to add chatbot or other AI functionality, then you may have the need to host and run an LLM on your local workstation for development or testing. There are several open source LLMs available that you can freely run on your system provided that you have enough resources to do so.
LLMs are just inert objects that require special software to allow people and other software to interact with it (called running the model). Unfortunately, setting up the software components to run a model is akin to nailing Jello to a tree.
Luckily, Ollama makes it incredibly easy to host LLMs, and is available for x86- and ARM-based Windows, macOS, and Linux systems. It leverages the llama.cpp software library to host several different open source or custom LLMs via a service that provides the necessary model functionality. You can interact directly with this service via the ollama
command to download, manage, modify, and run models, or use the REST API provided by the service to do the same from an app.
After installing Ollama, you can open a Terminal app on your Windows, macOS, or Linux system and run the ollama pull
command to download an open source model of your choice. For example, ollama pull llama3.2
will download the Llama 3.2 model to your system. You can then use ollama list
to view your downloaded model, and ollama run llama3.2
to run it, where you can interact with it, get help, or quit:
Note that Ollama will use your CPU to perform all calculations (it doesn’t support GPU or NPU processing). As a result, when you send a request to the LLM, such as the Write me a short poem about AI and What is the answer to life, the universe and everything? requests shown earlier, your CPU usage will spike to around 70% or more while the response is generated.
If you’d rather interact with the model using a REST API, use the appropriate API code in your app to interface with it. For example, you could use the curl
command within a shell script to interact with the Ollama API and Llama 3.2 model running on the local system:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{ "role": "user", "content": "Write me a short poem about AI" }
]
}'
You can use ollama ps
to view models that are currently running. When you no longer need to run the model, you can use ollama stop
to free up resources. For example, ollama stop llama3.2
will stop running the Llama 3.2 model, but keep it on the system in case you want to run it again.