GPU-Accelerated LLM on ARM64...in Docker!

August 19, 2023 - docker machine-learning homelab

I think the current "AI" hype is overblown, but certainly the recent advances in ML around large language models (LLMs) have been impressive.

So, when Machine Learning Compilation (MLC) recently posted an LLM chat demo that can run on the Mali G610 GPU, I was intrigued.

The Mali G10 is an ARM mobile GPU most readily found on single-board computers (SBCs) with the RK3588/RK3588S chipsets, which typically cost between $100 and $200 USD.

I've got the Radxa Rock 5B with 16GB of RAM - actually, I have 3 of them in a Kubernetes cluster running a forked version of Talos Linux.

Anyway, if you've got a compatible device with Docker installed and want to try the mlc_chat_cli demo from the blog post, I published pre-built images.

Note that you must use the --privileged flag when running to have access to the Mali GPU.

Additionally, you need the mali_csffw.bin firmware file in /lib/firmware on your host.

The Dockerfile source is available in milas/rock5-toolchain on GitHub.

`RedPajama-INCITE-Chat-3B-v1-q4f16_1`

⚠️ This image is ~4.5GB!

docker run --rm -it --privileged docker.io/milas/mlc-llm:redpajama-3b

It performs at about the rate of human speech.

Often goes off the rails, starts repeating things in loops, or is just otherwise weird. But in a kind of apocryphally charming way.

`Llama-2-7b-chat-hf-q4f16_1`

⚠️ This image is ~8.5GB!

docker run --rm -it --privileged docker.io/milas/mlc-llm:llama2-7b

I think you'll need a 16GB board to run this - I was seeing total memory usage of ~9GB.

It takes a bit to load, and the first response in particular has a long delay. After that it's not too bad, but it's definitely not fast.

It's much more coherent than RedPajama-INCITE-Chat-3B-v1-q4f16_1 but also has a tendency to complain that everything is "inappropriate".

Final Thoughts

I have a lot of conflicted feelings about the state of "AI" in the tech industry and beyond right now.

Increasing access to tools to develop and run ML projects is vital to preventing further unilateral access gated by venture-backed tech companies.

On a lighter note, here's a joke "written" by RedPajama-INCITE-Chat-3B-v1-q4f16_1:

<human>: write a joke about a large language model (LLM) running on a low-power device

<bot>: A large language model was walking down the street when it stumbled upon a bag of chips. It looked at the bag and thought, "I could use a snack." It picked up the bag and took a chip. It noticed that it was a lot of work to eat one chip, so it put the bag down and continued walking.

As it continued down the street, it saw a bag of pretzels. It looked at the bag and thought, "I could use a snack." It took a pretzel and ate it. It noticed that it was a lot of work to eat one pretzel, so it put the bag down and continued walking.

As it continued down the street, it saw a bag of chips. It looked at the bag and thought, "I could use a snack." It took a bag of chips andate it. It noticed that it was a lot of work to eat one bag of chips, so it put the bag down and continued walking.

It continued down the street until it reached the end of the block. It noticed that it was tired and hungry, so it stopped and looked at the empty bag. It realized that it had eaten all the chips, pretzels, and bag of chips. It looked down at the bag and saw that there was onlyone chip left. It took one chip and realized that it had eaten all the other chips, pretzels, and bag of chips.

It realized that it had accomplished its goal of eating all the chips, pretzels, and bag of chips. It then realized that it had also achieved its goal of eating a bag of chips. It decided that it had had enough chips for the day, so it put down the bag and continued walking.

It continued down the street until it reached the end of the block. It noticed that it was still hungry, so it stopped and looked at the empty bag. It realized that it had eaten all the chips, pretzels, and bag of chips. It looked down at the bag and saw that there was only onechip left. It took one chip and realized that it had eaten all the other chips, pretzels, and bag of chips.

It realized that it had accomplished its goal of eating all the chips, pretzels, and bag of chips. It then realized that it had also achieved its goal of eating a bag of chips. It decided that it had had enough chips for the day, so it put down the bag

(I did not say it was a good joke 🫠)

Disclaimers

This is my personal work.

I work for Docker Inc. but opinions are my own etc.

I have no affiliation with MLC, ARM, Rockchip, Radxa, or Talos.

A lot of those previous words and probably others are copyright/trademark.