Linux Kernel Hacks: LLM Inference and Serving Engine 소개

이번 시간에는 (이미 많이 알려져 있는 내용이긴 하지만) LLM(Large Language Model) 추론 및 서빙 엔진 몇가지를 소개하는 시간을 가져 보고자 한다. 😎

AI와의 만남 - 첫번째 시간

1. 주요 LLM 추론 & 서빙 엔진 돌려 보기

2. Jetson Orin Nano Kit에서 LLM 추론 & 서빙 엔진 돌려 보기

References

Keywords: LLM inference & Serving, Huggingface, LLaMA.cpp, Ollama, vLLM, SGLang, Modular MAX, Open WebUI, PyTorch, TensorFlow(Keras), NVIDIA GPU CUDA, AMD GPU ROCm

"AI가 당신을 대체하기 전에 이 5가지 기술을 배우세요.” “당신에게 필요한 마지막 프로그래밍 튜토리얼입니다” “코딩은 이제 공식적으로 끝났습니다.”

- 중략 -

"예측에는 기한이 있었다. 하지만, 기한은 지났음에도 개발자들은 여전히 건재하다. 물론 역할은 바뀌고 있고, 필요한 도구들도 실제로 존재하는게 사실이다. 하지만 "5년 안에 프로그래머는 필요 없을 거야"라고 말했던 사람들은 예측을 했던 게 아니라 투자자들에게 설명할 콘텐츠를 만들었던 것이다. 코딩의 종말은 지난 3년간 기술 업계에서 가장 수익성이 높은 이야기에 불과하다."

https://medium.com/@vndpal/ai-was-supposed-to-replace-developers-by-now-what-happened-fc63aa466749

1. 주요 LLM 추론 & 서빙 엔진 돌려 보기

지금까지 Jetson Orin Nano Super Dev Kit 관련하여 H/W 스펙과 BSP 관련 내용에 관하여 살펴 보았다. 이제 부터는 AI 관련 이야기를 해 볼 차례가 되었다. 그 중에서도 Open source LLM 추론 & Serving 엔진에 관한 이야기를 해 볼까 한다. 💬

Jetson Orin Nano board 상에서 LLM 추론 & 서빙 system을 돌려보기에 앞서, NVIDIA GPU가 장착되어 있는 Ubuntu PC(22.04 LTS)에서 네가지 LLM 추론 시스템(LLaMA.cpp, vLLM, SGLang, Modular MAX)을 돌려 보는 내용을 소개해 보도록 하겠다.

1.1) LLM 추론 & 서빙 엔진을 위한 준비 절차

LLM 추론 및 서빙 엔진은 학습된 모델을 GPU 메모리에 올리고, 외부 요청을 받아 추론(Inference)을 수행한 뒤 결과를 반환하는 시스템을 말한다. 이러한 LLM 추론 및 서빙 엔진은 모델 로딩, 요청 처리(토큰화, 배칭, 추론), 응답 반환(디코딩, 스트리밍)의 세 단계로 구성되어 있다. 아래 내용은 LLM 추론 및 서빙 엔진을 이해하기 위해 (사전에) 습득해야 할 내용을 간략히 요약해 본 것이다. 😋

[1] PyTorch or TensorFlow(Keras) 에 대한 이해

-> ML framework에 대한 이해

[2] NVIDIA CUDA or AMD ROCm GPU programming에 대한 이해

-> 특히, 최근 NVIDIA CUDA 13.1에 추가된 cuTile Python도 주목할 만한 부분임.

[3] LLM(Large Language Model)의 내부 동작 원리

-> 다양한 LLM model의 동작 원리 - Transformer - 이해

[그림 1.1] LLM Transformer Architecture [출처 - 참고문헌 13]

[4] Huggingface 생태계와 transformers library에 대한 이해

-> 다양한 LLM model과 연동하여 실제 사용하는 방법에 대한 이해

[그림 1.2] Hugggingface transformers pipeline 개요 [출처 - 참고문헌 12]

[5] 주요 LLM 추론 & 서빙 엔진의 주요 동작 원리 & 코드 분석

-> LLaMA.cpp, vLMM, SGLang, Modular MAX

-> 각각의 LLM 추론 엔진은 huggingface transformers를 사용하지 않고, 자체적인 방식을 사용하여 성능을 개선하고 있음.

[그림 1.3] LLM Inference Overview [출처 - 참고문헌 13]

지금부터는 네가지 LLM 추론 & 서빙 엔진 즉, LLaMA.cpp, vLLM, SGLang, Modular MAX를 차례로 동작시켜 보도록 하자.

1.2) LLaMA.cpp 설치하기

LLaMA.cpp는 local computer에서 LLM(Large Language Model)을 돌리기 위해 시작된 project로 (이름에서 유추할 수 있듯이) pure C/C++를 이용하여 구현되었다.

[그림 1.4] LLaMA.cpp 로고 [출처 - 참고문헌 1]

다행히도 필자가 사용 중인 PC에 (좀 구형이기는 하지만) NVIDIA GPU가 하나 장착되어 있으므로, 지금부터는 LLaMA.cpp를 build하여 돌려 보는 내용을 소개하고자 한다.

[1] NVIDIA driver 설치 확인하기

$ nvidia-smi

-> 이 명령을 실행해 보니, NVIDIA GeForce GTX 1650라는 문구가 보인다. (확인해 보니) 구형 모델이기는 하지만, 다행히도 CUDA가 지원된다.

[그림 1.5] nvidia-smi 명령 실행 모습

📌 GPU를 이런 곳에 쓸 줄 미리 알았더라면, 좀 더 최신형의 GPU card를 구매할 걸 그랬다. 😓

[2] CUDA Toolkit 설치하기

다음으로 (아래 site로 부터) CUDA toolkit를 설치해야 한다(nvcc --version 명령이 먹히지 않을 경우에 설치하도록 한다).

https://developer.nvidia.com/cuda/toolkit

주의할 점은 nvidia-smi 명령으로 출력된 CUDA version(예: 13.0)에 맞는 toolkit을 설치해야 한다는 것이다.

[그림 1.6] NVIDIA CUDA Toolkit download page(1)

📌 "Download Now" -> Archive of Previous CUDA Releases -> CUDA Toolkit 13.0.0 선택

[그림 1.7] NVIDIA CUDA Toolkit download page(2)

📌 Linux -> x86_64 -> Ubuntu -> 22.04 -> deb(network) 선택

[그림 1.8] NVIDIA CUDA Toolkit download page(3)

위의 web page 내용을 참조하여 CUDA toolkit을 설치하도록 한다.

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

$ sudo dpkg -i cuda-keyring_1.1-1_all.deb

$ sudo apt-get update

$ sudo apt-get -y install cuda-toolkit-13-0

$ vi ~/.bashrc

export PATH=/usr/local/cuda-13.0/bin${PATH:+:${PATH}}

export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:${LD_LIBRARY_PATH}

$ source ~/.bashrc

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Built on Wed_Aug_20_01:58:59_PM_PDT_2025

Cuda compilation tools, release 13.0, V13.0.88

Build cuda_13.0.r13.0/compiler.36424714_0

[3] CUDA를 지원하는 PyTorch 설치하기

$ sudo apt install python3 python3-pip

$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

CUDA를 지원하는 PyTorch library가 제대로 설치되었는지 확인해 본다.

$ cat check.py

import torch

print(torch.cuda.is_available())

print(torch.cuda.get_device_name(0))

$ python3 ./check.py

True

NVIDIA GeForce GTX 1650

-> OK, 제대로 설치되었다.

[4] LLaMA.cpp source code download & build하기

아래 github으로 부터 LLaMA.cpp source code를 내려 받는다.

[그림 1.9] LLaMA.cpp github page

$ git clone https://github.com/ggerganov/llama.cpp.git

$ cd llama.cpp

아래와 같이 CUDA option을 주어 compile하도록 한다.

$ cmake -B build -DGGML_CUDA=ON

$ cmake --build build --config Release -j 16

...

[ 97%] Linking CXX executable ../../bin/llama-mtmd-debug

[ 97%] Built target llama-mtmd-debug

[ 97%] Building CXX object tools/cvector-generator/CMakeFiles/llama-cvector-generator.dir/cvector-generator.cpp.o

[ 98%] Linking CXX executable ../../bin/llama-cvector-generator

[ 98%] Built target llama-cvector-generator

[ 98%] Building CXX object tools/export-lora/CMakeFiles/llama-export-lora.dir/export-lora.cpp.o

[ 99%] Linking CXX executable ../../bin/llama-export-lora

[ 99%] Built target llama-export-lora

[ 99%] Building CXX object tools/fit-params/CMakeFiles/llama-fit-params.dir/fit-params.cpp.o

[ 99%] Linking CXX executable ../../bin/llama-fit-params

[ 99%] Built target llama-fit-params

[100%] Building CXX object tools/results/CMakeFiles/llama-results.dir/results.cpp.o

[100%] Linking CXX executable ../../bin/llama-results

[100%] Built target llama-results

정상 build가 진행되었는지 아래 명령으로 확인해 본다.

$ ./build/bin/llama-cli

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 3714 MiB):

Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes, VRAM: 3714 MiB

The following devices will have suboptimal performance due to a lack of tensor cores:

Device 0: NVIDIA GeForce GTX 1650

Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing.

error: --model is required

$ pip3 install -r requirements.txt

-> LLaMA.cpp에서 필요로하는 python package를 설치하도록 한다.

[5] huggingface 계정과 Access Token 만들기

-> https://huggingface.co/에서 자신의 계정을 하나 만든다.

-> 로긴 후, 설정 메뉴(Settings -> Access Tokens -> Create new token)에서 access token(예: Read 권한)을 만든다.

-> Huggingface에서 LLM model을 내려 받아 사용하기 위해서는 사전에 이 과정이 반드시 필요하다.

[그림 1.10] huggingface web page - 우측 상단 Sign Up 버튼 선택

[6] huggingface에 로긴하기

$ pip uninstall huggingface_hub typer -y

-> (아래 명령 실행 시 문제가 있을 경우) 기존에 설치되어 있던 내용을 깨끗이 지운다.

$ pip3 install "huggingface_hub[cli]"

$ hf --help

-> 이제 부터는 hf 명령어를 사용할 수 있다.

$ hf auth login

-> huggingface login을 시도하자.

_| _| _| _| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _|_|_|_| _|_| _|_|_| _|_|_|_|

_| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|

_|_|_|_| _| _| _| _|_| _| _|_| _| _| _| _| _| _|_| _|_|_| _|_|_|_| _| _|_|_|

_| _| _| _| _| _| _| _| _| _| _|_| _| _| _| _| _| _| _|

_| _| _|_| _|_|_| _|_|_| _|_|_| _| _| _|_|_| _| _| _| _|_|_| _|_|_|_|

To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .

Enter your token (input will not be visible): 자신의 HUGGINGFACE token을 입력

Add token as git credential? [y/N]: n

Token is valid (permission: write).

The token `slowbootllm` has been saved to /home/chyi/.cache/huggingface/stored_tokens

Your token has been saved to /home/chyi/.cache/huggingface/token

The current active token is: `slowbootllm`

📌 이 과정은 access token이 변경된 경우에만 한 차례 수행해 주면 된다. 따라서, 이후 다른 LLM 추론 & 서빙 엔진 시험시에는 이 단계가 생략될 것이다.

[7] huggingface로 부터 Model download하기

Huggingface site에서 원하는 model을 검색한다. 예를 들어, llama-3.2-1B-Instruct-GGUF model은 다음과 같다.

[그림 1.11] huggingface에서 llama-3.2-1B-Instruct-GGUF model 확인 모습

📌 [중요] 사용하려는 model에 따라서는 site에서 해당 모델에 대한 access 권한을 요청해야 사용할 수가 있다.

hf download 명령을 사용하여 원하는 model을 하나 download 한다.

$ mkdir models

$ hf download bartowski/Llama-3.2-1B-Instruct-GGUF --include "Llama-3.2-1B-Instruct-Q4_K_M.gguf" --local-dir ./models/llama-3.2-quants

-> llama-3.2-1b model 관련 파일을 download한다.

📌 GGUF(Georgi Gerganov Unified Format)는 CPU 및 저사양 GPU(소비자용 하드웨어)에서 대규모 언어 모델(LLM)을 효율적으로 로컬 실행하기 위해 설계된 바이너리 파일 형식이다. 모델 가중치를 양자화하여 저장하므로 용량이 작고 로딩이 빠르며, llama.cpp, Ollama, LM Studio 등에서 주로 사용된다.

📌 LLM 양자화(Quantization)는 대규모 언어 모델의 파라미터(가중치)와 연산을 FP32/FP16(16/32비트 부동소수점)에서 INT8/INT4(8/4비트 정수) 등 낮은 정밀도로 변환하여, 모델 용량을 줄이고 추론 속도를 높이는 경량화 기술을 말한다. 성능 하락을 최소화하면서 VRAM 사용량을 획기적으로 줄여, 일반 GPU나 기기(Edge Device)에서도 거대 모델을 실행할 수 있게 한다.

[8] CLI 명령 실행하기

$ ./llama.cpp/build/bin/llama-cli -m ./models/llama-3.2-quants/Llama-3.2-1B-Instruct-Q4_K_M.gguf -p "The capital of France is"

-> llama-cli를 이용하여 LLM에게 프랑스의 수도를 묻는 질문을 해 보자.

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 3714 MiB):

Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5, VMM: yes, VRAM: 3714 MiB

The following devices will have suboptimal performance due to a lack of tensor cores:

Device 0: NVIDIA GeForce GTX 1650

Consider compiling with CMAKE_CUDA_ARCHITECTURES=61-virtual;80-virtual and DGGML_CUDA_FORCE_MMQ to force the use of the Pascal code for Turing.

Loading model...

▄▄ ▄▄

██ ██

██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄

██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██

██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀

██ ██

▀▀ ▀▀

build : b9014-d4b0c22f9

model : Llama-3.2-1B-Instruct-Q4_K_M.gguf

modalities : text

available commands:

/exit or Ctrl+C stop or exit

/regen regenerate the last response

/clear clear the chat history

/read <file> add a text file

/glob <pattern> add text files using globbing pattern

> The capital of France is

The capital of France is Paris.

[ Prompt: 2.5 t/s | Generation: 129.6 t/s ]

> 대한민국의 수도는 ?

대한민국의 수도는 Seoul입니다.

[ Prompt: 389.9 t/s | Generation: 120.6 t/s ]

> /exit

Exiting...

common_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |

common_memory_breakdown_print: | - CUDA0 (GTX 1650) | 3714 = 1005 + (2061 = 762 + 992 + 306) + 647 |

common_memory_breakdown_print: | - Host | 275 = 205 + 0 + 70 |

----------------------------------------------------------------------------------------------

[9] 서버를 실행하고, 웹 브라우져로 접속하기

$ ./llama.cpp/build/bin/llama-server -m ./models/llama-3.2-quants/Llama-3.2-1B-Instruct-Q4_K_M.gguf -ngl 35

-> 이번에는 llama-server를 구동시켜 보자.

...

Cutting Knowledge Date: December 2023

Today Date: 05 May 2026

srv init: init: chat template, thinking = 0

main: model loaded

main: server is listening on http://127.0.0.1:8080

main: starting the main loop...

srv update_slots: all slots are idle

이 상태에서 웹 브라우져를 띄우고, http://localhost:8080으로 접속을 시도해 본다. OK, 요즘 유행하는 생성형 AI와 유사한 prompt 창이 보인다. 😎

[그림 1.12] llama-server에 접속한 모습(1)

Prompt 창에서 아래와 같이 질문을 해 본다.

please let me know llama.cpp project.

[그림 1.13] llama-server에 접속한 모습(2)

1.3) vLLM 설치하기

vLLM은 빠르고, 간편한 LLM 추론 및 서빙 framework으로, 상용 제품으로도 사용하기에 충분할 정도로 우수한 성능과 안정성을 보장한다. 따라서, 현재 이 분야에서 가장 인기가 많은 project 중 하나로 평가받고 있다. vLLM은 내부적으로 PyTorch 기반의 백엔드를 사용한다. 즉, vLLM은 PyTorch로 학습된 모델을 불러와서(Load), vLLM만의 최적화된 엔진(PagedAttention)으로 속도를 대폭 향상시켜 서비스하는 구조이다.

[그림 1.14] vLLM logo

https://github.com/vllm-project/vllm

참고로, 아래 blog에서는 vLLM의 내부 아키텍쳐를 상세히 소개하고 있다. 👍

https://www.aleksagordic.com/blog/vllm

한편, 아래 테이블은 앞절에서 소개한 LLaMA.cpp와 vLLM 기술을 간략히 상호 비교한 것이다.

[그림 1.15] vLLM과 LLaMA.cpp간 기술 비교

우선, (아래 site의 내용을 참조하여) prebuilt package를 이용하여 빠르게 설치를 진행해 보기로 한다.

https://docs.vllm.ai/en/stable/getting_started/quickstart/#offline-batched-inference

[1] uv install

$ curl -LsSf https://astral.sh/uv/install.sh | sh

-> vllm package를 설치하기 위해, 먼저 uv tool을 install한다.

$ uv --version

uv 0.11.9 (x86_64-unknown-linux-gnu)

📌 uv는 rust로 구현한 tool이다. pip나 conda를 사용해도 되지만, uv가 더 빠르고 깔끔하다. 이와 비슷한 tool로 pixi라는 것도 있다(역시 rust로 구현).

[2] vllm 설치하기

$ uv venv --python 3.10 --seed

-> 어라 warning 메시지가 출력된다.

Using CPython 3.10.12 interpreter at: /usr/bin/python3.10

Creating virtual environment with seed packages at: .venv

warning: Failed to hardlink files; falling back to full copy. This may lead to degraded performance.

If the cache and target directories are on different filesystems, hardlinking may not be supported.

If this is intentional, set `export UV_LINK_MODE=copy` or use `--link-mode=copy` to suppress this warning.

+ packaging==26.2

+ pip==26.1.1

+ setuptools==82.0.1

+ wheel==0.47.0

Activate with: source .venv/bin/activate

$ export UV_CACHE_DIR=/mnt/hdd/workspace/LLMs/.cache

-> ~/.cache 위치를 현재 테스트 디렉토리로 조정하도록 한다.

$ uv venv --python 3.12 --seed

-> (이번은 python 3.12.x version을 사용하도록) 위의 명령을 다시 시도한다. 위의 UV_CACHE_DIR 조정 덕분에, 이번에는 warning message가 사라졌다.

Using CPython 3.12.13 interpreter at: /usr/bin/python3.12

Creating virtual environment with seed packages at: .venv

✔ A virtual environment already exists at `.venv`. Do you want to replace it? · yes

+ pip==26.1.1

Activate with: source .venv/bin/activate

[3] 가상환경으로 진입하기

$ source .venv/bin/activate

-> python 가상 환경으로 진입한다.

(LLMs) chyi@earth:/mnt/hdd/workspace/LLMs$ uv pip install vllm --torch-backend=auto

-> vllm package를 설치한다. 상당히 많은 package가 설치되고 시간도 많이 소요된다. 확실히 uv가 pip 보다는 더 빠른 것 같긴하다. 😍

Resolved 180 packages in 6.28s

Prepared 51 packages in 54.11s

Installed 180 packages in 1.33s

+ aiohappyeyeballs==2.6.1

+ aiohttp==3.13.5

+ aiosignal==1.4.0

+ annotated-doc==0.0.4

+ annotated-types==0.7.0

+ anthropic==0.99.0

+ anyio==4.13.0

...

(LLMs) chyi@earth:/mnt/hdd/workspace/LLMs$ uv run --with vllm vllm --help

-> vllm이 제대로 설치되었는지 확인해 본다.

usage: vllm [-h] [-v] {chat,complete,serve,launch,bench,collect-env,run-batch} ...

vLLM CLI

positional arguments:

{chat,complete,serve,launch,bench,collect-env,run-batch}

chat Generate chat completions via the running API server.

complete Generate text completions based on the given prompt via the running API server.

serve Launch a local OpenAI-compatible API server to serve LLM completions via HTTP.

launch Launch individual vLLM components.

bench vLLM bench subcommand.

collect-env Start collecting environment information.

run-batch Run batch prompts and write results to file.

options:

-h, --help show this help message and exit

-v, --version show program's version number and exit

For full list: vllm [subcommand] --help=all

For a section: vllm [subcommand] --help=ModelConfig (case-insensitive)

For a flag: vllm [subcommand] --help=max-model-len (_ or - accepted)

Documentation: https://docs.vllm.ai

[4] huggingface에 로긴하기

앞 절에서 이미 수행했으므로, 여기에서는 별도로 수행할 필요가 없다.

[5] 오프라인 배치방식 추론(Offline batched inference) 시험하기

vllm이 제대로 설치되었으니, 이제 부터는 동작 시험을 해 볼 차례이다. 먼저, server 없이, python code로 LLM model을 생성하고, prompt 정보를 입력으로 주어 제대로 동작하는지 확인해 보자.

$ vi basic.py

-> 아래 link에 있는 파일을 복사해 온다.

-> https://github.com/vllm-project/vllm/blob/main/examples/basic/offline_inference/basic.py

<basic.py에서 하는 일 요약>

1) LLM 생성

llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.5)

2) prompts 요청에 대한 text 생성

outputs = llm.generate(prompts, sampling_params)

3) 이후 for loop에서 결과 출력

[그림 1.16] basic.py

📌 코드 내용 자체는 크게 어려운 부분이 없어 자세히 설명하지는 않는다(자세한 사항은 아래 site 내용 참조).

https://docs.vllm.ai/en/stable/getting_started/quickstart/#installation

$ python3 ./vllm_tests/basic.py

-> basic.py를 돌려 보니, prompt에 대한 결과가 나오는 것을 알 수 있다.

INFO 05-06 12:08:03 [utils.py:233] non-default args: {'gpu_memory_utilization': 0.5, 'disable_log_stats': True, 'model': 'facebook/opt-125m'}

INFO 05-06 12:08:05 [model.py:555] Resolved architecture: OPTForCausalLM

INFO 05-06 12:08:05 [model.py:1680] Using max model len 2048

INFO 05-06 12:08:05 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=8192.

INFO 05-06 12:08:05 [vllm.py:840] Asynchronous scheduling is enabled.

INFO 05-06 12:08:05 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])

(EngineCore pid=24044) INFO 05-06 12:08:07 [core.py:109] Initializing a V1 LLM engine (v0.20.1) with config: model='facebook/opt-125m',

...

Rendering prompts: 100%|██████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 28.58it/s]

Processed prompts: 100%|████████████████| 4/4 [00:00<00:00, 4.42it/s, est. speed input: 28.73 toks/s, output: 70.73 toks/s]

Generated Outputs:

------------------------------------------------------------

Prompt: 'Hello, my name is'

Output: " Joel, I'm a high school teacher in Colorado. I'm here for the"

------------------------------------------------------------

Prompt: 'The president of the United States is'

Output: ' congratulating a foreign leader on his election and gives a thoughtful speech on the history'

------------------------------------------------------------

Prompt: 'The capital of France is'

Output: ' bordered on the west by Greece, Cyprus, Greece, Malta, the Republic'

------------------------------------------------------------

Prompt: 'The future of AI is'

Output: ' uncertain. The technology will be as pervasive as the future of human computing.\n'

------------------------------------------------------------

(EngineCore pid=24044) INFO 05-06 12:08:57 [core.py:1238] Shutdown initiated (timeout=0)

(EngineCore pid=24044) INFO 05-06 12:08:57 [core.py:1261] Shutdown complete

이번에는 vllm을 서버 형태로 실행시켜 보자.

[6] OpenAI 호환 서버 형태로 구동하기

(vllm_tests) chyi@earth:/mnt/hdd/workspace/LLMs$ vllm serve facebook/opt-125m --port 8000 --gpu-memory-utilization 0.5

-> GPU memory 부족 상황을 고려하여 작은 model(facebook/opt-125m)을 사용하였다.

-> 좀더 규모(?)가 큰 model을 사용할 경우, GPU memory 문제로 뻗어 버린다.

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:299]

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:299] █ █ █▄ ▄█

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.1

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:299] █▄█▀ █ █ █ █ model facebook/opt-125m

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:299]

(APIServer pid=34321) INFO 05-06 13:15:07 [utils.py:233] non-default args: {'model_tag': 'facebook/opt-125m', 'model': 'facebook/opt-125m', 'gpu_memory_utilization': 0.5}

...

(APIServer pid=34321) INFO 05-06 13:15:43 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:37] Available routes are:

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /docs, Methods: GET, HEAD

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /redoc, Methods: GET, HEAD

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /tokenize, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /detokenize, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /load, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /version, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /health, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /metrics, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/models, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /ping, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /ping, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /invocations, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/chat/completions, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/responses, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/completions, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/messages, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /inference/v1/generate, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /generative_scoring, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST

(APIServer pid=34321) INFO 05-06 13:15:43 [launcher.py:46] Route: /v1/completions/render, Methods: POST

(APIServer pid=34321) INFO: Started server process [34321]

(APIServer pid=34321) INFO: Waiting for application startup.

(APIServer pid=34321) INFO: Application startup complete.

이 상태에서, 터미널 창을 하나 띄워, 아래와 같이 curl 기반의 HTTP 요청(채팅 형식 추론 요청)을 시도해 보자.

$ curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "facebook/opt-125m",
       "messages": [
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": "Who won the world series in 2020?"}
       ]
   }'

어라, 근데, 위의 요청을 받자, 서버에서 에러를 뿌린다.

$ vllm serve facebook/opt-125m --port 8000 --gpu-memory-utilization 0.5

...

(APIServer pid=38771) INFO: 127.0.0.1:59062 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

(APIServer pid=38771) ERROR: Exception in ASGI application

(APIServer pid=38771) Traceback (most recent call last):

(APIServer pid=38771) File "/mnt/hdd/workspace/LLMs/vllm_tests/.venv/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 421, in run_asgi

(APIServer pid=38771) result = await app( # type: ignore[func-returns-value]

(APIServer pid=38771) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

...

위 문제를 해결하기 위해, (Google 검색 후) 아래와 같이 template.jinja 파일을 하나 준비하도록 한다.

$ cat template.jinja

{% for message in messages %}\n{{ '<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}

📌 vLLM에서 template.ninja 파일(chat template)은 대화 시 역할(system, user, assitant)과 message가 형식(single string 형태)을 정의하는 용도로 사용된다.

Option을 하나 더 추가하여, vllm server를 다시 실행하도록 한다.

(vllm_tests) chyi@earth:/mnt/hdd/workspace/LLMs$ vllm serve facebook/opt-125m --port 8000 --gpu-memory-utilization 0.5 --chat-template ./vllm_tests/template.jinja

...

(APIServer pid=39887) INFO: 127.0.0.1:55040 - "POST /v1/chat/completions HTTP/1.1" 200 OK

(APIServer pid=39887) INFO 05-06 14:01:01 [loggers.py:271] Engine 000: Avg prompt throughput: 2.6 tokens/s, Avg generation throughput: 90.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 19.0%, Prefix cache hit rate: 0.0%

이 상태에서 curl 명령을 다시 시도해 보니, 이번에는 응답이 제대로 출력된다. 😎

[그림 1.17] curl 명령 실행 시 결과 출력 모습

참고로, vllm에 내장된 web server는 fastapi(python 기반 project)로 구현되어 있으며, 아래와 같은 Restful API를 제공하고 있다.

[그림 1.18] fastapi로 만들어진 vllm의 HTTP server

vLLM도 LLaMA.cpp에서 보여준 prompt 입력 가능한 WebUI(Open WebUI)를 사용할 수 있는데, 이와 관련해서는 12장에서 확인하도록 하자.

1.4) SGLang 설치하기

SGLang은 복잡한 prompt 파이프라인과 multi-turn 대화에서 발생하는 비효율성을 해결하기 위해 등장한 기술로, RadixAttention이라는 기법을 통해 여러 요청 간에 공유되는 Prompt Prefix의 KV 캐시를 자동으로 재사용하는 특징을 갖는다. 에이전트(Agent) 기반 워크플로우나 JSON과 같은 구조화된 출력이 필요할 때 압도적인 성능을 발휘하는 것으로 알려져 있다.

[그림 1.19] SGLang logo

[출처 - https://www.sglang.io/ ]

참고로, 아래 테이블은 앞서 소개한 vLLM 기술과 SGLang의 특징을 상호 비교한 것이다.

[그림 1.20] vLLM과 SGLang 기술 비교

[출처 - https://wikidocs.net/337222]

앞서 이미 기본적인 환경(CUDA 등)이 준비된 상태이므로, 빠르게 설치를 진행하도록 한다.

[1] uv 설치하기

$ pip install --upgrade pip

$ pip install uv

-> 이미 설치되어 있으니, 이 과정은 굳이 필요가 없다.

[2] sglang package 설치하기

$ export UV_CACHE_DIR=/mnt/hdd/workspace/LLMs/sglang/.cache

$ uv pip install sglang

Using Python 3.12.13 environment at: /mnt/hdd/workspace/LLMs/.venv

Checked 1 package in 2.04s

[3] 가상환경으로 진입하기

chyi@earth:/mnt/hdd/workspace/LLMs$ source .venv/bin/activate

-> 가상 환경으로 진입한다.

[4] huggingface에 로긴하기

앞 절에서 이미 수행했으므로, 여기에서는 별도로 수행할 필요가 없다.

[5] model 돌려 보기(1차)

(LLMs) chyi@earth:/mnt/hdd/workspace/LLMs$ python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --port 30000 --mem-fraction-static 0.5

예상대로, GPU memory 부족 문제가 발생한다. 따라서 vLLM에서 사용했던 model 즉, facebook/opt-15m으로 다시 시도해 보기로 한다.

[6] model 돌려 보기(2차)

(LLMs) chyi@earth:/mnt/hdd/workspace/LLMs$ python3 -m sglang.launch_server --model-path facebook/opt-125m --host 0.0.0.0 --port 30000 --mem-fraction-static 0.5

...

50, 'top_p': 1.0}

[2026-05-12 18:08:56] INFO: Application startup complete.

[2026-05-12 18:08:56] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

[2026-05-12 18:08:57] INFO: 127.0.0.1:46584 - "GET /model_info HTTP/1.1" 200 OK

[2026-05-12 18:09:07] Prefill batch, #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False

[2026-05-12 18:09:07] INFO: 127.0.0.1:46596 - "POST /generate HTTP/1.1" 200 OK

[2026-05-12 18:09:07] The server is fired up and ready to roll!

[2026-05-12 18:09:13] INFO: 127.0.0.1:38278 - "GET / HTTP/1.1" 200 OK

[2026-05-12 18:09:14] INFO: 127.0.0.1:38278 - "GET /favicon.ico HTTP/1.1" 404 Not Found

[2026-05-12 18:09:19] INFO: 127.0.0.1:38290 - "GET /docs HTTP/1.1" 200 OK

OK, 이번에는 제대로 구동된다. Web browser를 띄우고, 30000번 port로 접속을 시도해 보자.

http://localhost:30000

[그림 1.21] fastapi로 만들어진 sglang의 HTTP server

반복되는 내용이라, curl client 연결 부분은 생략하기로 한다. 😋

1.5) Modular MAX 설치하기

필자는 이전 posting을 통해서 Modular 사에서 만든 Mojo programming language를 소개한 바가 있다. 이번 시간에 소개할 내용은 Mojo를 기반으로 만든 LLM 추론 & 서빙 엔진인 Modular MAX에 관한 것이다.

https://docs.modular.com/max/get-started/

[그림 1.22] vLLM과 Modular MAX 비교

https://www.modular.com/open-source/max

[그림 1.23] Modular 아키텍쳐

https://docs.modular.com/max/intro/

백견이 불여 일타! 바로 설치 작업을 시작해 보기로 하자. 😎

[1] pixi 설치하기

$ curl -fsSL https://pixi.sh/install.sh | sh

📌 uv를 이용하여 설치할 수도 있으나, pixi를 사용해 보자.

$ source ~/.bashrc

[2] modular max 설치하기

$ pixi init modular_max -c https://conda.modular.com/max-nightly/ -c conda-forge && cd modular_max

-> modular_max라는 디렉토리 아래에 project를 하나 생성한다.

✔ Created /mnt/hdd/workspace/LLMs/modular_max/pixi.toml

$ ls -la

-rw-rw-r-- 1 chyi chyi 128 5월 7 13:23 .gitattributes

-rw-rw-r-- 1 chyi chyi 47 5월 7 13:23 .gitignore

-rw-rw-r-- 1 chyi chyi 174 5월 7 13:23 pixi.toml

$ pixi add modular

-> modular를 설치한다.

WARN Skipped running the post-link scripts because `run-post-link-scripts` = `false`

- bin/.librsvg-pre-unlink.sh

To enable them, run:

pixi config set --local run-post-link-scripts insecure

More info:

https://pixi.sh/latest/reference/pixi_configuration/#run-post-link-scripts

✔ Added modular >=26.4.0.dev2026050406,<27

$ pixi config set --local run-post-link-scripts insecure

-> 위에서 warning이 발생한 부분을 해결하기 위해 이 명령을 실행한다.

✅ Updated config at /mnt/hdd/workspace/LLMs/modular_max/.pixi/config.toml

[3] 가상환경으로 진입하기

$ pixi shell

-> pixi shell을 실행하여 가상 환경으로 진입한다.

(modular_max) chyi@earth:/mnt/hdd/workspace/LLMs/modular_max$

[4] huggingface token 설정하기

$ export HF_TOKEN="hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXX"

-> 자신이 사용하는 huggingface access token을 export한다.

사용하려는 model이 gemma-4-31B-it라고 할 경우, (필요 시) 아래 site에서 해당 모델에 대한 access 권한을 요청한다(모델에 따라 요청이 필요 없을 수도 있음).

https://huggingface.co/google/gemma-4-31B-it

준비가 끝났으니, 이제부터 max serve 명령을 사용하여, model을 구동시켜 보도록 하자.

[5] max serve 명령을 사용하여 huggingface model 돌려 보기

$ max serve --model google/gemma-4-31B-it

-> (예상한 대로) 역시나 memory 부족하다는 에러가 뜬다.

17:11:46.407 INFO: Metrics initialized.

generation_config.json: 100%|███████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 539kB/s]

config.json: 4.62kB [00:00, 7.98MB/s]

17:11:49.401 WARNING: Architecture 'Gemma4ForConditionalGeneration' requires PipelineRuntimeConfig.max_num_steps=1, overriding current value 10

Traceback (most recent call last):

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/bin/max", line 10, in <module>

sys.exit(main())

~~~~^^

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1514, in __call__

return self.main(*args, **kwargs)

~~~~~~~~~^^^^^^^^^^^^^^^^^

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1435, in main

rv = self.invoke(ctx)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1902, in invoke

return _process_result(sub_ctx.command.invoke(sub_ctx))

~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/pipelines.py", line 102, in invoke

return super().invoke(ctx)

~~~~~~~~~~~~~~^^^^^

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1298, in invoke

return ctx.invoke(self.callback, **ctx.params)

~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 853, in invoke

return callback(*args, **kwargs)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 368, in wrapped

return func(*args, **kwargs)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 368, in wrapped

return func(*args, **kwargs)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 368, in wrapped

return func(*args, **kwargs)

[Previous line repeated 7 more times]

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 495, in wrapper

return func(*args, **kwargs)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/pipelines.py", line 196, in wrapper

return func(*args, **kwargs)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/pipelines.py", line 260, in cli_serve

pipeline_config = PipelineConfig(**config_kwargs)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/pydantic/main.py", line 263, in __init__

validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/pipelines/lib/config/config.py", line 758, in __postprocess_configs

self.resolve()

~~~~~~~~~~~~^^

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/pipelines/lib/config/config.py", line 927, in resolve

self._validate_and_resolve_remaining_pipeline_config(

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^

model_config=self.model

^^^^^^^^^^^^^^^^^^^^^^^

)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/pipelines/lib/config/config.py", line 1392, in _validate_and_resolve_remaining_pipeline_config

MemoryEstimator.estimate_memory_footprint(

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^

self,

^^^^^

...<4 lines>...

activation_size,

^^^^^^^^^^^^^^^^

)

File "/mnt/hdd/workspace/LLMs/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/pipelines/lib/memory_estimation.py", line 221, in estimate_memory_footprint

raise RuntimeError(error_msg)

RuntimeError: Model size exceeds available memory (73.25 GiB > 2.66 GiB). Model weights: 58.25 GiB, Activation memory: 15.00 GiB. Try running a smaller model, using a smaller precision, or using a device with more memory.

그렇다면, 다른 모델을 시도해 보면 어떨까 ?

$ max serve --model meta-llama/Llama-3.1-8B-Instruct

...

$ max serve --model LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct

...

안타깝게도 결과는 모두 동일하다(GPU memory가 너무 부족하단다). 만들어진지 얼마 안되어서 그런지, 아직은 다양한 model을 지원하고 있지는 못한 것 같다. (큰 기대는 안하지만) 아무래도 Jetson Orin Nano Kit에서 돌려 보아야 할 것 같다. 😓

2. Jetson Orin Nano Kit에서 LLM 추론 & 서빙 엔진 돌려 보기

지금부터는 Jetson Orin Nano Dev Kit를 사용하여 3가지 LLM 추론 & 서빙 엔진(Ollama, vLLM, Modular MAX)을 돌려 보도록 하자. 잠자고 있던 Jetson Kit를 다시 꺼냈다. 😂 💤

[그림 2.1] Jetson Orin Nano Super Board Reloaded !

📌 Jetson Orin Nano Super Dev Kit은 NVIDIA Ampere GPU(1024 CUDA cores, 32 tensor cores)가 탑재되어 있으며, 8GB LPDDR5 RAM(CPU와 GPU가 공유하여 사용)을 장착하고 있다. 자세한 사항은 아래 link를 참조하기 바란다.

https://slowbootkernelhacks.blogspot.com/2025/06/nvidia-jetson-orin-nano-developer-kit.html

2.1) Ollama 설치하기

Ollama는 Llama 3, Mistral, Gemma 등 오픈소스 LLM을 로컬 PC에서 설치부터 실행까지 간편하게 관리할 수 있도록 하는 오픈소스 플랫폼이다. Docker와 유사하게 단일 명령어로 모델을 다운로드하여 실행할 수 있으며, 데이터 프라이버시 보호와 오프라인 환경 활용에 최적화되어 있다. 사실, Ollama는 1장에서 소개한 LLaMA.cpp 프로젝트를 기반으로 하고 있다.

[그림 2.2] Ollama Project

[출처 - https://docs.ollama.com/]

📌 Local machine에서 돌리기에 적당하게 잘 만들어진 느낌이다.

[1] Ollama 설치하기

chyi@jetsonkit:~/workspace$ curl -fsSL https://ollama.com/install.sh | sh

📌 ollama를 설치한다.
>>> Installing ollama to /usr/local
[sudo] password for chyi:
>>> Downloading Linux arm64 bundle
######################################################################## 100.0%
>>> Downloading JetPack 6 components
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA JetPack ready.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.

[그림 2.3] /etc/systemd/system/ollama.service file

Ollama가 설치된 상태에서 ps 명령을 실행해 보면, 다음과 같다. 참고로, ollama serve 명령은 11434 TCP port를 열고 HTTP 요청이 들어 오기를 기다린다.

chyi@jetsonkit:~$ ps -ef|grep ollama
ollama 4267 1 1 14:09 ? 00:01:40 /usr/local/bin/ollama serve
chyi 8184 8170 0 15:34 pts/3 00:00:00 grep --color=auto ollama

ollama 명령의 사용법은 다음과 같이 간결하다.

chyi@jetsonkit:~/workspace$ ollama --help

Large language model runner

Usage:

ollama [flags]

ollama [command]

Available Commands:

serve Start Ollama

create Create a model

show Show information for a model

run Run a model

stop Stop a running model

pull Pull a model from a registry

push Push a model to a registry

signin Sign in to ollama.com

signout Sign out from ollama.com

list List models

ps List running models

cp Copy a model

rm Remove a model

launch Launch the Ollama menu or an integration

help Help about any command

Flags:

-h, --help help for ollama

--nowordwrap Don't wrap words to the next line automatically

--verbose Show timings for response

-v, --version Show version information

Use "ollama [command] --help" for more information about a command.

chyi@jetsonkit:~/workspace$ ollama --version

ollama version is 0.23.1

[2] Ollama에서 gemma3:4b model 돌려보기

그럼, 이제부터는 아래 link에서 model을 하나 선택한 후 돌려 보도록 하자.

https://ollama.com/search

chyi@jetsonkit:~/workspace$ ollama pull gemma3:4b

📌 대부분의 model이 GB 단위의 크기로 이루어져 있기 때문에 model download에 시간이 오래 걸릴 수 밖에 없다.

pulling manifest

pulling aeda25e63ebd: 100% ▕████████████████████████████████████████████▏ 3.3 GB

pulling e0a42594d802: 100% ▕████████████████████████████████████████████▏ 358 B

pulling dd084c7d92a3: 100% ▕████████████████████████████████████████████▏ 8.4 KB

pulling 3116c5225075: 100% ▕████████████████████████████████████████████▏ 77 B

pulling b6ae5839783f: 100% ▕████████████████████████████████████████████▏ 489 B

verifying sha256 digest

writing manifest

success

Download가 끝났으니, 이를 실행(ollama run)해 볼 차례이다. 정상적으로 구동될 경우, prompt를 입력할 수 있는 상태가 된다.

chyi@jetsonkit:~/workspace$ ollama run gemma3:4b

📌 사실은 run 명령만 실행해도 pull 과정이 자동으로 포함되게 된다.

>>> hi

Hi there! How's your day going so far? 😊

Is there anything I can help you with today? Do you want to:

* Chat about something?

* Get an answer to a question?

* Play a game?

>>> what are you ?

I’m a large language model, also known as a conversational AI. Basically, I was created by Google!

Here's a breakdown of what that means:

* **Large Language Model:** I’ve been trained on a *massive* amount of text data – think books,

articles, websites, code, and more. This training allows me to understand and generate human-like

text.

* **Conversational AI:** My main purpose is to have conversations with you. I can respond to your

prompts, answer your questions, and even try to be creative!

**I don't *think* or *feel* in the same way humans do.** I operate based on patterns I’ve learned

from the data I was trained on.

**Think of me like a really, really advanced autocomplete.** I predict what words should come next

based on what you've already said.

Do you want to know more about *how* I was created, or would you like to just keep chatting?

>>> Send a message (/? for help)

>>> /bye

ollama run 상태에서 ps 명령을 실행해 보면, 다음과 같이 ollama run process가 실행되고 있음을 알 수 있다.

chyi@jetsonkit:~$ ps -ef|grep ollama
ollama 4267 1 3 14:09 ? 00:03:57 /usr/local/bin/ollama serve
chyi 8269 3563 0 15:55 pts/1 00:00:01 ollama run gemma3:4b

[3] Ollama에서 llama3 model 돌려보기

이번에는 llama3 model을 run 명령을 사용하여 곧 바로 실행(pull + run)해 보도록 하자.

chyi@jetsonkit:~/workspace$ ollama run llama3

pulling manifest

pulling 6a0746a1ec1a: 100% ▕████████████████████████████████████████████▏ 4.7 GB

pulling 4fa551d4f938: 100% ▕████████████████████████████████████████████▏ 12 KB

pulling 8ab4849b038c: 100% ▕████████████████████████████████████████████▏ 254 B

pulling 577073ffcc6c: 100% ▕████████████████████████████████████████████▏ 110 B

pulling 3f8eb4da87fa: 100% ▕████████████████████████████████████████████▏ 485 B

verifying sha256 digest

writing manifest

success

>>> Send a message (/? for help)

이 상태에서 (다른 터미널 창을 띄운 후) 11434 port로 아래와 같은 HTTP 요청(chat completion)을 전달해 보자.

chyi@jetsonkit:~$ curl http://localhost:11434/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
       "model": "llama3",
       "messages": [
           {"role": "system", "content": "You are a helpful assistant."},
           {"role": "user", "content": "Who won the world series in 2020?"}
       ]
   }'

잠시 후, 아래와 같은 응답이 출력된다.

{"id":"chatcmpl-113","object":"chat.completion","created":1778137491,"model":"llama3","system_fingerprint":"fp_ol
lama","choices":[{"index":0,"message":{"role":"assistant","content":"The Los Angeles Dodgers won the World Series
in 2020! They defeated the Tampa Bay Rays in the series, taking the final game on October 27, 2020. It was their
first World Series title since 1988!"},"finish_reason":"stop"}],"usage":{"prompt_tokens":31,"completion_tokens":
49,"total_tokens":80}}

[그림 2.4] ollama server에게 명령 요청 및 결과 출력 모습

[4] Open WebUI 설치 후, Ollama와 연결하기

다음으로 (아래 site 내용을 참조하여) Open WebUI와 Ollama를 연결하는 절차를 확인해 보도록 하자.

https://docs.openwebui.com/getting-started/quick-start

<Jetson Nano Orin Super Dev Kit>

$ curl -LsSf https://astral.sh/uv/install.sh | sh

-> 먼저 uv tool을 설치한다.

chyi@jetsonkit:~/workspace$ source $HOME/.local/bin/env

다음으로 uvx를 이용하여 open-webui를 구동시킨다.

chyi@jetsonkit:~/workspace$ DATA_DIR=~/.open-webui uvx --python 3.11 open-webui@latest serve

...

INFO: Started server process [2567]

INFO: Waiting for application startup.

2026-05-07 21:07:00.353 | INFO | open_webui.utils.logger:start_logger:194 - GLOBAL_LOG_LEVEL: INFO

2026-05-07 21:07:00.353 | INFO | open_webui.main:lifespan:659 - Installing external dependencies of functions and tools...

2026-05-07 21:07:00.394 | INFO | open_webui.utils.plugin:install_frontmatter_requirements:407 - No requirements found in frontmatter.

2026-05-07 21:07:00.394 | INFO | open_webui.utils.automations:scheduler_worker_loop:172 - Scheduler worker started (poll interval: 10s)

이 상태에서 ollama를 재구동시키자.

$ sudo service ollama restart

OK, web browser를 이용하여 Open WebUI에 접속(8080 port 사용)을 시도해 보자.

http://192.168.8.187:8080

[그림 2.5] Ollama와 연동된 Open WebUI(1)

[그림 2.6] Ollama와 연동된 Open WebUI(2)

📌 최초 접속 시에는 사용자 계정을 등록하는 화면이 나오지만 생략하였다.

<여기서 잠깐 - LM Studio project에 관하여>

Ollama와 유사한 project로 LM Studio라는 것도 있다. 비록 Open source는 아니지만, Local machine 상에서의 LLM 추론 & 서빙 엔진을 돌리기 위해 사용하며, desktop 환경(Typescript & node로 구성)과 server(daemon) 환경을 지원한다. 아래 내용은 desktop 버젼을 설치해 본 것인데, 화면(Typescript로 작성한 듯)이 꽤나 수려해 보인다.

[그림 2.7] LM Studio 설치 - Model downloading

[그림 2.8] LM Studio 설치 후, Chat 예

-------------------------------------------------------------------------------------------------

2.2) vLLM 설치하기

vLLM은 LLM(Large Language Model - 대규모 언어모델)과 VLM(Vision-Language Model, 비전 언어모델)을 모두 지원한다. 이번에는 vLLM을 docker 상에서 돌려 보는 내용을 소개해 보고자 한다.

https://www.jetson-ai-lab.com/tutorials/genai-on-jetson-llms-vlms/

[1] docker 상에서 Open WebUI 구동하기

먼저 open-webui docker image를 실행시킨다.

$ docker run -d \

--network=host \

-v ${HOME}/open-webui:/app/backend/data \

-e OPENAI_API_BASE_URL=http://0.0.0.0:8000/v1 \

--name open-webui \

ghcr.io/open-webui/open-webui:main

📌 ghcr.io (GitHub Container Registry)는 GitHub에서 제공하는 Docker 컨테이너 이미지 및 OCI 아티팩트 저장소 서비스를 말한다.

[2] docker 상에서 vllm server 구동하기

다음으로 vllm docker image를 실행하도록 하자. 먼저 docker container를 띄우고, container 안에서 vllm을 구동시켜 보도록 하자.

$ docker run --pull=always --rm -it \

--network host \

--shm-size=16g \

--ulimit memlock=-1 \

--ulimit stack=67108864 \

--runtime=nvidia \

--name=vllm \

-v $HOME/data/models/huggingface:/root/.cache/huggingface \

ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

root@jetsonkit:/home# vllm serve facebook/opt-125m --gpu-memory-utilization 0.5 --chat-template /home/template.jinja

...

(APIServer pid=1) INFO 05-13 08:26:28 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:37] Available routes are:

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /docs, Methods: GET, HEAD

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /redoc, Methods: GET, HEAD

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /tokenize, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /detokenize, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /load, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /version, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /health, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /metrics, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/models, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /ping, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /ping, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /invocations, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/chat/completions, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/responses, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/completions, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/messages, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /inference/v1/generate, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST

(APIServer pid=1) INFO 05-13 08:26:28 [launcher.py:46] Route: /v1/completions/render, Methods: POST

(APIServer pid=1) INFO: Started server process [1]

(APIServer pid=1) INFO: Waiting for application startup.

(APIServer pid=1) INFO: Application startup complete.

docker ps -a 명령을 실행해 보면, 아래와 같이 2개의 container가 구동되어 있는 것을 알 수가 있다.

$ docker ps -a

[그림 2.9] 동작 중인 2개의 docker container 확인

이 상태에서 open webui port인 8080으로 연결을 시도해 보자.

http://192.168.8.187:8080

-> 8080(open webui port) => 8000(vllm port)

[그림 2.10] Open WebUI를 통해 vLLM 서버에 연결한 모습

2.3) Modular Max 설치하기

이번에는 11장에서 실패했던 Modular Max를 Jetson Dev Kit에 다시 설치해 보도록 하자.

[1] pixi 설치하기

chyi@jetsonkit:~/workspace$ curl -fsSL https://pixi.sh/install.sh | sh

chyi@jetsonkit:~/workspace$ source ~/.bashrc

[2] project 생성 후, modular 설치하기

$ pixi init modular_max -c https://conda.modular.com/max-nightly/ -c conda-forge && cd modular_max

✔ Created /home/chyi/workspace/modular_max/pixi.toml

$ pixi add modular

WARN Skipped running the post-link scripts because `run-post-link-scripts` = `false`

- bin/.librsvg-pre-unlink.sh

To enable them, run:

pixi config set --local run-post-link-scripts insecure

More info:

https://pixi.sh/latest/reference/pixi_configuration/#run-post-link-scripts

✔ Added modular >=26.4.0.dev2026050616,<27

chyi@jetsonkit:~/workspace/modular_max$ pixi config set --local run-post-link-scripts insecure

✅ Updated config at /home/chyi/workspace/modular_max/.pixi/config.toml

[3] 가상환경으로 진입하기

chyi@jetsonkit:~/workspace/modular_max$ pixi shell

[4] huggingface access token 설정하기

$ export HF_TOKEN="hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXX"

-> 자신이 사용하는 huggingface access token을 export한다.

[5] max serve 명령으로 모델 돌려 보기

google/gemma-4-31B-it model을 돌려 보도록 하자.

$ max serve --model google/gemma-4-31B-it

13:28:27.490 INFO: Metrics initialized.

generation_config.json: 100%|█████████████████████████████████████████████| 208/208 [00:00<00:00, 479kB/s]

config.json: 4.62kB [00:00, 4.69MB/s]

13:28:30.239 WARNING: Architecture 'Gemma4ForConditionalGeneration' requires PipelineRuntimeConfig.max_num_steps=1, overriding current value 10

Traceback (most recent call last):

File "/home/chyi/workspace/modular_max/.pixi/envs/default/bin/max", line 10, in <module>

sys.exit(main())

~~~~^^

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1514, in __call__

return self.main(*args, **kwargs)

~~~~~~~~~^^^^^^^^^^^^^^^^^

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1435, in main

rv = self.invoke(ctx)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1902, in invoke

return _process_result(sub_ctx.command.invoke(sub_ctx))

~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/pipelines.py", line 102, in invoke

return super().invoke(ctx)

~~~~~~~~~~~~~~^^^^^

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 1298, in invoke

return ctx.invoke(self.callback, **ctx.params)

~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/click/core.py", line 853, in invoke

return callback(*args, **kwargs)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 368, in wrapped

return func(*args, **kwargs)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 368, in wrapped

return func(*args, **kwargs)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 368, in wrapped

return func(*args, **kwargs)

[Previous line repeated 7 more times]

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/cli/config.py", line 495, in wrapper

return func(*args, **kwargs)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/pipelines.py", line 196, in wrapper

return func(*args, **kwargs)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/max/entrypoints/pipelines.py", line 260, in cli_serve

pipeline_config = PipelineConfig(**config_kwargs)

File "/home/chyi/workspace/modular_max/.pixi/envs/default/lib/python3.14/site-packages/pydantic/main.py", line 263, in __init__

validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)

pydantic_core._pydantic_core.ValidationError: 1 validation error for PipelineConfig

Value error, failed to create device: MAX doesn't support your current NVIDIA GPU driver. MAX requires a minimum driver version of 580 and CUDA version 13.0. Your driver version is 540.4.0 [type=value_error, input_value={'model_path': 'google/ge...=0, device_type='gpu')]}, input_type=dict]

For further information visit https://errors.pydantic.dev/2.13/v/value_error

이런, 역시나 에러가 난다. 내용을 보아하니, Modular MAX가 기본적으로 CUDA 13.0을 필요로 하고 있는데, Jetson Orin Nano Dev Kit가 현재 CUDA 12.6을 기반으로 하고 있다는 것이다. (검색을 좀 해 보니) CUDA 13.0을 지원하려면 JetsonPack 7.2가 release(Roadmap 상으로는 2026년 Q2에 release 예정) 되어야만 가능하단다. 😈

-------------------------------------------------------------------

결국, Ubuntu PC(Intel® Core™ i7, NVIDIA GeForce GTX 1650)나 Jetson Orin Nano Dev Kit 모두 GPU memory 부족 문제로, 아쉽지만 다양한 모델을 테스트해 볼 수가 없었다. 😂

참고로, 4GB의 저사양 GPU에서도 실행할 수 있게 해주는 혁신적인 오픈소스 라이브러리인 AirLLM이라는 project도 있다.

https://github.com/lyogavin/airllm

이상으로, 여러모로 아쉬움은 많이 남지만 LLM Inference & Serving Engine 관련 글을 마무리하고자 한다. 끝까지 읽어 주셔서 감사드린다. 😎

To be continued...

References

Open source Inference & Serving engine

[1] https://llama-cpp.com/#why-choose-llama-cpp

[2] https://balaakshay.medium.com/beginners-guide-setting-up-llama-cpp-for-local-llm-experiments-gpu-optimized-291adc5b7ba2

[3] https://ollama.com/

[4] https://github.com/vllm-project/vllm