VSCode + Free LLM model on private server with or without GPU (CPU-only)

Prerequisites

Private server with 8 or 12 cores (8 or up to 12 virtual CPU if your server is a virtual machine).

Install Ollama + LLM models

To easily manager LLM model you can install Ollama. It’s easier to manage than llama.cpp but a little less quick (around 20% performance less).

[root@llm ~]# curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100,0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.

If your server doesn’t have GPU, the ollama installer will warn you “Ollama will run in CPU-only mode“.

We can install model codellama tagged 7b (for 7 billion parameters).

[root@llm ~]# ollama run codellama:7b
pulling manifest
pulling 3a43f93b78ec... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 3.8 GB
pulling 8c17c2ebb0ea... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 7.0 KB
pulling 590d74a5569b... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.8 KB
pulling 2e0493f67d0c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   59 B
pulling 7f6a57943a88... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  120 B
pulling 316526ac7323... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  529 B
verifying sha256 digest
writing manifest
success
>>> /bye
[root@llm ~]#

We can also install deepseek-coder with 1.3B parameters, because it has less parameters it will be faster but still with quality: ollama run deepseek-coder

Configure the Ollama API

The Ollama API support GET and POST method:

POST /api/generate
GET /api/tags

You can test the ollama API locally:

[root@llm ~]# curl http://localhost:11434/api/tags
{"models":[{"name":"codellama:7b","model":"codellama:7b","modified_at":"2025-04-17T16:33:25.752300427+02:00","size":3825910662,"digest":"8fdf8f752f6e80de33e82f381aba784c025982752cd1ae9377add66449d2225f","details":{"parent_model":"","format":"gguf","family":"llama","families":null,"parameter_size":"7B","quantization_level":"Q4_0"}}]}

Open the ollama listening port to your private network:

[root@llm ~]# firewall-cmd --add-port=11434/tcp --permanent
success
[root@llm ~]# firewall-cmd --reload
success

You can test, but you’ll get an error:

[root@llm ~]# curl http://llm.yourdomain.com:11434/api/tags
curl: (7) Failed to connect to llm.yourdomain.com port 11434: Connexion refusée

[root@llm ~]# systemctl status ollama
● ollama.service - Ollama Service
     Loaded: loaded (/etc/systemd/system/ollama.service; enabled; preset: disabled)
     Active: active (running) since Thu 2025-04-17 16:14:02 CEST; 54min ago
   Main PID: 34775 (ollama)
      Tasks: 17 (limit: 153159)
     Memory: 3.6G
        CPU: 1min 285ms
     CGroup: /system.slice/ollama.service
             └─34775 /usr/local/bin/ollama serve

avril 17 16:33:28 llm ollama[34775]: llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
avril 17 16:33:29 llm ollama[34775]: llama_kv_cache_init:        CPU KV buffer size =  4096.00 MiB
avril 17 16:33:29 llm ollama[34775]: llama_init_from_model: KV self size  = 4096.00 MiB, K (f16): 2048.00 MiB, V (f16): 2048.00 MiB
avril 17 16:33:29 llm ollama[34775]: llama_init_from_model:        CPU  output buffer size =     0.55 MiB
avril 17 16:33:29 llm ollama[34775]: llama_init_from_model:        CPU compute buffer size =   560.01 MiB
avril 17 16:33:29 llm ollama[34775]: llama_init_from_model: graph nodes  = 1030
avril 17 16:33:29 llm ollama[34775]: llama_init_from_model: graph splits = 1
avril 17 16:33:30 llm ollama[34775]: time=2025-04-17T16:33:30.039+02:00 level=INFO source=server.go:619 msg="llama runner started in 4.02 seconds"
avril 17 16:33:30 llm ollama[34775]: [GIN] 2025/04/17 - 16:33:30 | 200 |  4.245900053s |       127.0.0.1 | POST     "/api/generate"
avril 17 17:01:00 llm ollama[34775]: [GIN] 2025/04/17 - 17:01:00 | 200 |     937.782µs |       127.0.0.1 | GET      "/api/tags"

You need to allow all IP on your private network, the easier way is to allow all IP (because you are in a private network and your server is protected because it is not public). To do this, just edit the systemd service and add the following environement variable Environment=”OLLAMA_HOST=0.0.0.0″ under the [Service] tag:

[root@llm ~]# systemctl edit --full ollama.service

You’ll save the following file:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"

[Install]
WantedBy=default.target

Reload systemd configuration and restart ollama:

[root@llm ~]# systemctl daemon-reload
[root@llm ~]# systemctl restart ollama

[root@llm ~]# curl http://llm.yourdomain.com:11434/api/tags
{"models":[{"name":"codellama:7b","model":"codellama:7b","modified_at":"2025-04-17T16:33:25.752300427+02:00","size":3825910662,"digest":"8fdf8f752f6e80de33e82f381aba784c025982752cd1ae9377add66449d2225f","details":{"parent_model":"","format":"gguf","family":"llama","families":null,"parameter_size":"7B","quantization_level":"Q4_0"}}]}

You can curl the api from your laptop, or access it via a web browser: http://llm.yourdomain.com:11434/api/tags

Access your LLM in your VSCode

Install the “Continue” extension.

Edit the settings file config.yaml. If you don’t see it in VSCode you can also directly access it (on windows it is here: C:\Users\<your_user>\.continue\config.yaml).

name: Local Assistant
version: 1.0.0
schema: v1
models:
  # Auto detect your local models if you have some
  - name: Autodetect
    provider: ollama
    model: AUTODETECT
  # Add you remote codellama model
  - name: Test remote ollama
    provider: ollama
    apiBase: http://llm.yourdomain.com:11434
    model: codellama:7b
    capabilities:
      - tool_use
      - image_input
    roles:
      - chat
      - edit
      - apply
      - summarize
  # Add your remode Deepseek-coder:
  - name: Remote Deepseek-coder
    provider: ollama
    apiBase: http://llm.yourdomain.com:11434
    model: deepseek-coder:latest
context:
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

Now you can select your remode model:

You’ll see that you need a lot of CPU cores to have answers in less than 1 minute… but if you have a powerful server with GPU I imagine it can be very good.

Note book to help you and me remember some tricks

Prerequisites

Install Ollama + LLM models

Configure the Ollama API

Access your LLM in your VSCode

ndl_admin

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories