Llama 3.2 경량화 모델 사용기

1B와 3B 로 경량화 된 Llama 3.2를 사용해봅니다.

Mar 12, 2026

Contents

Llama 3.2 개요 Llama 경량화 모델 사용해보기 앞으로의 LLM 모델 추세

2024년 9월 25일 Meta는 최신 LLM 모델인 Llama 3.2를 발표했습니다. 이번 모델은 이전과 달리 파라미터 수를 대폭 줄여 경량화 해, 더 가볍고 빠르게 실행할 수 있는 모델과 이미지를 분석할 수 있는 멀티 모달 모델이 공개되었습니다.

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs, and lightweight, text-only models that fit onto edge and mobile devices.

https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

기존의 Llama 외에도 여러 모델들은 대형 모델들, 실행시키기 힘들지만 성능이 좋은 모델들이 주였습니다. 하지만, 이번에는 파라미터를 획기적으로 줄여서 만든 작은 모델을 선보였습니다.

이에 공개된 Llama3.2에 대해 알아보고 경량화 모델의 장점에 대해서 알아보겠습니다.

Llama 3.2 개요

Llama는 Meta에서 공개한 LLM(대규모 언어 모델)입니다. 모델을 오픈소스로 공개했으며, Llama 2 부터는 상업적으로 이용도 가능하게 라이센스가 되어있습니다. 모델의 파라미터 또한 8B ~ 70B 등 여러 파라미터의 버전을 제공하고 쉽게 다운받아 사용해볼 수 있습니다.

Llama 3.2에서는 경량화 모델과 멀티 모달 모델을 공개했습니다.

경량화 모델

경량화 모델은 기존의 큰 모델이 좋은 추론 결과의 정확성을 높이던 방향과는 다르게 빠르게 실행되고, 큰 모델을 실행시키기위한 무거운 시스템이 필요 없이 가능해지게 되었습니다.

기존의 70B 모델과 비교해보면 1B는 매우 작아진 것을 알 수 있습니다.

Llama 3.2 Lightweight models

실제로 경량화 하기 위해 Llama 3.1 8B 버전 모델을 프루닝(Pruning) 기술로 만들었습니다. 프루닝은 기존 모델의 불필요한 노드 연결들을 제거해서 효율적인 모델로 만들 수 있습니다.

하지만, 모델을 작게만 만들게되면 성능에 대한 문제가 있을 수 있기 때문에 지식 증류(Knowledge Distillation) 이라는 기술을 사용하게 됩니다. 지식 증류는 기존의 모델이 학습한 결과를 작은 모델에게 전달하는 과정입니다. 큰 모델이 가지고 있는 출력값에 선택되는 확률들을 전달하여 모델의 크기를 크게 늘리지 않으면서 성능을 높일 수 있습니다.

뿐만 아니라 추론 성능의 퀄리티를 위해 기존에 학습된 405B 같은 추론 모델의 데이터를 합성한 버전의 모델도 만들 수 있습니다.

멀티 모달

Llama 3.2 Vison models

이번에 처음으로 Llama 3.2에서 이미지를 분석할 수 있는 멀티 모달 모델을 공개하였습니다.

이미지에서 정보를 추론하기 위해 이미지 인코더를 기존 언어 모델과 통합해 추론할 수 있게 했습니다. 이미지 인코더를 직접적으로 영향을 끼치지 않기 위해 어댑터를 훈련하는 방식을 사용하였습니다. 그래서 기존의 모델의 언어능력을 그대로 유지한대로 이미지도 이해할 수 있게되었습니다.

이미지와 기존 텍스트를 쌍으로 연결하는 학습을 시킵니다. 이후 미세 조정, 샘플링, 선호도 최적화 등 다양한 과정을 거쳐 이미지와 텍스트 모두 추론할 수 있게 되어집니다.

기존에 텍스트로 이루어진 언어만 추론할 수 있던 모델에서 이미지도 포함된 추론이 가능해지게 되었습니다.

Llama 경량화 모델 사용해보기

그러면 Llama의 경량화 모델을 실제로 사용해보겠습니다.

Llama의 경우 ollama를 통해 쉽게 구동해볼 수 있습니다.

Ollama

Get up and running with large language models.

https://ollama.com/

Llama 3.2 모델의 경우 Llama 공식 사이트에서 다운로드 받으실 수 있습니다.

Llama 3.2

The open source AI model you can fine-tune, distill and deploy anywhere. Our latest models are available in 8B, 70B, and 405B variants.

https://www.llama.com/

ollama와 모델이 준비되었다면 로컬에서 구동해보겠습니다. 가장 작은 사이즈인 1B를 사용해보겠습니다.


❯ ollama run llama3.2:1b
>>> Why use a smaller large language model?
Using a smaller Large Language Model (LLM) like BERT, RoBERTa, or XLNet can provide several benefits. Here are some reasons why
you might want to consider using a smaller LLM:

1. **Faster training times**: Smaller models typically require less computational resources and can be trained faster than larger
models. This means you can try out different language models quickly without having to wait for hours or days.
2. **Easier deployment**: With smaller models, it's often easier to deploy them in production, as they don't require the same
level of computing power or data storage as larger models.
3. **Reduced model size and complexity**: Smaller models are typically more lightweight and have fewer parameters, which can make
them easier to train and deploy.

...

1B 모델도 로컬에서 쉽게 구동되고(실행해본 환경 맥북 M1) 아주 빠른 속도로 응답이 나왔습니다. 추론 또한 위에 출력을 보듯이 제대로 된 대답이 나오는 것을 볼 수 있었습니다. 작은 LLM 모델을 사용하는 이유에 대해 빠른 학습 시간, 쉬운 배포, 적은 비용 등등 합리적인 대답이 나오는 것을 알 수 있습니다.

그러면 3B 모델도 사용해보겠습니다.


❯ ollama run llama3.2
>>> Why use a smaller large language model?
There are several reasons why using a smaller large language model (LLM) might be beneficial:

1. **Reduced computational requirements**: Smaller LLMs require less memory and computation power to train and deploy, making
them more suitable for resource-constrained environments, such as edge devices or mobile apps.
2. **Faster inference times**: Due to their smaller size, smaller LLMs can process text inputs faster, resulting in quicker
response times and improved user experience.
3. **Lower latency**: Smaller LLMs can handle concurrent requests more efficiently, reducing the overall latency of your
application.

...

1B 보다 응답속도는 약간 느린 것 같지만, 크게 차이는 나지 않았습니다. 응답 결과도 준수한 것을 확인할 수 있습니다.

3B 를 실행하는 경우의 메모리 크기를 보면 LLM 모델을 로컬에서 구동하는 것 치고 매우 적은 리소스를 사용중이는 것을 알고 있습니다.


>>> What is your last data update?
My last data update was December 2023. However, please note that my training data is constantly being updated and expanded by my
developers to ensure I have the most accurate and up-to-date information available.

...

또한, 직접 마지막 데이터 갱신일을 물어보면 2023년 12월이라고 대답해주는 것을 알 수 있었습니다.

작은 모델이 가지는 강점들은 위와 같이 빠른 응답속도를 가지면서 실행시키기 위해 큰 리소스가 필요없습니다. 또한, 모델을 만드는 입장에서도 적은 학습을 통해 만들 수 있는 여러 가능성이 있는 모델을 만드는 방식입니다.

앞으로의 LLM 모델 추세

Llama 3.2 Lightweight models Mobile writing assistant demo

LLM 모델이 경량화가 되면 어떤 것들이 가능해질까요? 위 사진처럼 모바일 기기에서 개인화된 모델이 동작하는 것이 가능해질 것입니다. 기존에 모델들은 실행시키기 위해서 리소스가 많이 들기에 모바일에서 구동되는 것이 아니라 LLM이 실행되고 있는 서버에서 응답을 만들어 전달해야 했습니다. 모델이 작아지면 개인 모바일 기기에서 실행되서 각각 개인들에게 맞춘 다양한 응답을 제공할 수 있게 될 것 입니다.

또한, 여러 모바일 기기에 맞춘 모델을 개발하여 사용자의 요청에 맞게 행동을 취하는 것도 가능할 것 입니다. 예를 들어 아이폰의 시리가 사용자의 음성을 듣고 동작을 하는 것처럼, LLM이 더 정밀하게 추론하고 동작을 만들어 낼 수 있게 됩니다.

모델을 개발하는 입장에서도 빠른 개발 속도를 만들 수 있고, 모델의 실행 시간 또한 빠르게 되기 때문에 만드는 입장과 사용하는 입장 모두 더 나은 경험을 할 수 있게 됩니다.

LLM에 관심이 있으시다면, 이러한 경량화 모델의 발전에 대해서 관심있게 지켜보면 좋을 것 같습니다.

❯ ollama run llama3.2:1b >>> Why use a smaller large language model? Using a smaller Large Language Model (LLM) like BERT, RoBERTa, or XLNet can provide several benefits. Here are some reasons why you might want to consider using a smaller LLM: 1. **Faster training times**: Smaller models typically require less computational resources and can be trained faster than larger models. This means you can try out different language models quickly without having to wait for hours or days. 2. **Easier deployment**: With smaller models, it's often easier to deploy them in production, as they don't require the same level of computing power or data storage as larger models. 3. **Reduced model size and complexity**: Smaller models are typically more lightweight and have fewer parameters, which can make them easier to train and deploy. ...

❯ ollama run llama3.2 >>> Why use a smaller large language model? There are several reasons why using a smaller large language model (LLM) might be beneficial: 1. **Reduced computational requirements**: Smaller LLMs require less memory and computation power to train and deploy, making them more suitable for resource-constrained environments, such as edge devices or mobile apps. 2. **Faster inference times**: Due to their smaller size, smaller LLMs can process text inputs faster, resulting in quicker response times and improved user experience. 3. **Lower latency**: Smaller LLMs can handle concurrent requests more efficiently, reducing the overall latency of your application. ...

>>> What is your last data update? My last data update was December 2023. However, please note that my training data is constantly being updated and expanded by my developers to ensure I have the most accurate and up-to-date information available. ...