Spatial Asset Recommender – Demo LLM Server Setup
← Previous | Start | Next →
One of the powerful features of the Spatial Asset Recommender is the ability to ask a large language model (LLM) for weighted tags. To try out this feature, you need to set up a LLM server that supports chat completions.
Note: Following this procedure, you’ll have a local LLM server. Depending on your hardware, you can only run smaller models. For production use, you want a powerful machine with lots of VRAM. If you have a machine like that, you can use a similar process to run larger models there.
Note: This guide is meant for Windows users. Linux and Mac users will need to find their own way, though the process likely looks similar.
There are mainly two easy ways to get llama.cpp:
Note: I have nothing to do with llama.cpp or the creator of llama.cpp. Downloading and running their software is at your own risk!
Win+R and executing cmd.winget install ggml.llamacpp.%PATH% environment variable.Note: The CUDA version only works with an NVIDIA card.
There are three ways to get the batch file I prepared:
llama-server.bat).*.bat.@echo off
llama-server.exe ^
-hf ggml-org/gemma-3-12b-it-GGUF:Q4_K_M ^
--port 8080 --host 127.0.0.1 ^
--ctx-size 16000
Running the server is as simple as executing the batch file. Windows will probably ask for firewall access on the first run. Other than that, there are a few things to note:
127.0.0.1 with a different address, e.g. 0.0.0.0.Warning: Exposing your server at
0.0.0.0makes it reachable to anyone on your network. Be sure to only run this on a secure network and at your own risk!
There are a few ways to improve the performance and accuracy of the LLM server:
--n-gpu-layers. This parameter sets the number of LLM model layers that will be offloaded into the GPU. You can use this to maximize the performance, but you’ll need to try out different values.--threads. This is the number of CPU threads that will be used. It’s often good to set it at the number of physical CPU cores, sometimes it’s better to reduce that number by 1. In some scenarios, it’s even better to use a completely different value.--no-mmproj flag.--ctx-size parameter to a lower value.These are a few of many parameters you can adjust, and it’s a matter of playing around with them to find the sweet spot. There are external tools that can run benchmarks on your computer to find the best values if you need them. See also the full documentation for the llama.cpp server.