Jason Eckert's Website and Blog

If you are developing or integrating software with a Large Language Model (LLM) to add chatbot or other AI functionality, then you may have the need to host and run an LLM on your local workstation for development or testing. There are several open source LLMs available that you can freely run on your system provided that you have enough resources to do so.

LLMs are just inert objects that require special software to allow people and other software to interact with it (called running the model). Unfortunately, setting up the software components to run a model is akin to nailing Jello to a tree.

Luckily, Ollama makes it incredibly easy to host LLMs, and is available for x86- and ARM-based Windows, macOS, and Linux systems. It leverages the llama.cpp software library to host several different open source or custom LLMs via a service that provides the necessary model functionality. You can interact directly with this service via the ollama command to download, manage, modify, and run models, or use the REST API provided by the service to do the same from an app.

After installing Ollama, you can open a Terminal app on your Windows, macOS, or Linux system and run the ollama pull command to download an open source model of your choice. For example, ollama pull llama3.2 will download the Llama 3.2 model to your system. You can then use ollama list to view your downloaded model, and ollama run llama3.2 to run it, where you can interact with it, get help, or quit:

Using Ollama and Llama3.2

Note that Ollama will use your CPU to perform all calculations (it doesn’t support GPU or NPU processing). As a result, when you send a request to the LLM, such as the Write me a short poem about AI and What is the answer to life, the universe and everything? requests shown earlier, your CPU usage will spike to around 70% or more while the response is generated.

If you’d rather interact with the model using a REST API, use the appropriate API code in your app to interface with it. For example, you could use the curl command within a shell script to interact with the Ollama API and Llama 3.2 model running on the local system:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    { "role": "user", "content": "Write me a short poem about AI" }
  ]
}'

You can use ollama ps to view models that are currently running. When you no longer need to run the model, you can use ollama stop to free up resources. For example, ollama stop llama3.2 will stop running the Llama 3.2 model, but keep it on the system in case you want to run it again.

Running an LLM locally using Ollama