|-- projects
| |-- ml
| | |-- dl
| | | |-- natural-language-processing
| | | | |-- llms
| | | | | |-- run

Running LLM's locally

Context:

Running LLM's (large-language-models) locally is now possible ¹ due to the abundance of highly parallelised compute (GPU's) at affordable prices and also the advances of Deep Learning in the past decade.

As such, even slightly powerful consumer devices such as my M1 Macbook Pro with 8GB of RAM can run a small LLM. The purpose of this post is to investigate the token speed and accuracy of a variety of LLM's on my Machines.

Instructions

Courtesy of Alex Ziskind's tutorial:

clone the frontend https://github.com/open-webui/open-webui.git
install ollama from https://ollama.com
in the command line pull whichever model you like: ollama pull llama3

Results

>>> write me a 500 word summary on Dante's divine comedy The Divine Comedy: A Journey Through Hell, Purgatory, and Paradise

Dante Alighieri's masterpiece, The Divine Comedy, is an epic poem

The above output arises from asking the prompt: write me a 500 word..., it was trivially easy to configure.

The 7 Billion parameter model (llama3), utilised ~90% of my available RAM.

Opinion

I did not even create a new conda environment and pip install -r requirements.txt for the open-webui framework. It just feels as if everything has already been done for me.

Conclusion

I shall come back and rework this document once I receive my NVIDIA Orin Nano Super. I think this page will have some utility in the sense of benchmarking my hardware, but cloning and running pre-implemented LLM's is an intellectually trivial task not worth doing.

I am now implementing Andrej Karpathy's MinGPT here.

Footnotes

¹as of 2024