Turn Multiple Tvbox into LLM Clusters Supercomputer with Dllama

LLM stand for Large Language model, normies will never have access to giant horse power to runit, except our homies are part of Elite global!

Zuhri

June 16, 2025 ( 4 min read )

Introduction #

It’s start from community, when my homie running linux on TV box but they just bought this one for just a cheap internet purpose like bypassing the ISP, so i put interest on it! and time goes on then Supercomputer are created.

The define of Supercomputer #

Supercomputer is, a computer but with superpower! tf? yes! So what that mean? its mean the computer have a giant power rather than usual, those compute power can be achive with computer cluster, what is computer cluster? computer cluster is like, you have one task to compute and you working with your friend that should be done in quick if the requirement achived, if not? bottleneck! So let me introducing my Tvbox specification.

root@sv64
---------
OS: Debian GNU/Linux 12 (bookworm) aarch64
Host: H96 Max X3
Kernel: 6.6.93-flippy-93+
Uptime: 12 hours, 36 mins
Packages: 281 (dpkg)
Shell: bash 5.2.15
Resolution: 720x576i
Terminal: /dev/pts/0
CPU: ARMv8 rev 0 (v8l) (4) @ 1.908GHz
Memory: 115MiB / 3322MiB

Thats mine! i have both the different just 32G and 64GB storage, and the compute power are just like Rpi4 but so good to be the alternative, and The most important requirement is you should make sure the I/O are flawless! make sure there is not bottleneck, my machine use emmc and GbE on it.

Installation #

Starting Point #

First, you need to make linux working on it! Then connect those machine into switch or router. Then run ssh to have a control over this machine.

ssh root@192.168.1.2
ssh root@192.168.1.3

After you done, you need to installing pkg requirement.

apt install git build-essential

Make it Works #

This command will clone repository, then cd into it.

Note: This command should be run into all devices.

git clone https://github.com/b4rtaz/distributed-llama.git
cd distributed-llama

After you clone it, you need to compiling the source.

make dllama
make dllama-api

Then, choose who will be the “root node” and the “worker node”.

Root node:
Root node will serve all source (LLM data), and work as the controller you can choose will run in single or multiple mode (clustering). The root node it just one, the remaining are the worker node.
Worker node:
Will work as assistant and will help the root node.

python launch.py llama3_2_1b_instruct_q40

Then put this command to root node, this command will automate configuration and will installing LLM data, see here. I stick with lightweight one and wait. After you done, root node will ready to working.

So what now? lets do benchmark, you can put this command into shell script.

#!/bin/sh

sudo ./dllama inference \
--model models/llama3_2_1b_instruct_q40/dllama_model_llama3_2_1b_instruct_q40.m \
--tokenizer models/llama3_2_1b_instruct_q40/dllama_tokenizer_llama3_2_1b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads $(nproc) \ 
--max-seq-len 4096 
--prompt "Hello world" \
--steps 16 \

Below here, the result of my root node machine.

Evaluation
   nBatches: 32
    nTokens: 2
   tokens/s: 3.19 (313.24 ms/tok)
Prediction
    nTokens: 14
   tokens/s: 5.76 (173.63 ms/tok)

That just single node (root node), then what about with worker node? you need add --workers 192.168.1.3:9999 in last command, dont forget about ip address and port number to make fit within yours. here example for my case.

before you start lets configure the “worker node”, to listening root node.

sudo nice -n -20 ./dllama worker --port 9999 --nthreads $(nproc)

Then use this command on “root node”.

#!/bin/sh

sudo ./dllama inference \
--model models/llama3_2_1b_instruct_q40/dllama_model_llama3_2_1b_instruct_q40.m \
--tokenizer models/llama3_2_1b_instruct_q40/dllama_tokenizer_llama3_2_1b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads $(nproc) \ 
--max-seq-len 4096 
--prompt "Hello world" \
--steps 16 \
--workers 192.168.1.3:9999

Here is my result, its about 50% increase!

Evaluation
   nBatches: 32
    nTokens: 2
   tokens/s: 4.79 (208.55 ms/tok)
Prediction
    nTokens: 14
   tokens/s: 7.31 (136.78 ms/tok)

H96 Max X3 (TvBox)	Result Llama3.2:1b
1 node	tokens/s: 3.19 (313.24 ms/token)
2 node	tokens/s: 4.79 (208.55 ms/token)

willn’t end here, if you wanna try like chat mode you can use command below.

#!/bin/sh

./dllama chat \
--model models/llama3_2_1b_instruct_q40/dllama_model_llama3_2_1b_instruct_q40.m \
--tokenizer models/llama3_2_1b_instruct_q40/dllama_tokenizer_llama3_2_1b_instruct_q40.t \
--buffer-float-type q80 \
--nthreads 4 \
--max-seq-len 4096

that just single node, just add --workers ip:port for 2 node, and --workers ip:port ip:port ... for more node.

Known Limitations:

You can run Distributed Llama only on 1, 2, 4… 2^n nodes.
The maximum number of nodes is equal to the number of KV heads in the model
Only the following quantizations are supported:
- q40 model with q80 buffer-float-type
- f32 model with f32 buffer-float-type

The end #

Thanks for coming. Shoutout to the dllama.

←

Linux Server Hardening Guides: IDS and IPS with Fail2ban

Performing SAST on Dlink DIR-1253 Firmware (White version)

→