Take Llama 2 as an example to practice large model inference

2026-01-28 02:47:34

1. What is Llama2

On July 18, Meta released the Llama2 open source large language model, available for free for research and commercial use.

The Llama training method involves unsupervised pre-training followed by supervised fine-tuning, training a reward model, and reinforcement learning based on human feedback. Llama 2 has 40% more training data than Llama 1, is trained on 2 trillion tokens and have double the context length of Llama 1. The Llama 2 model comes in three size variants: 7B, 13B, and 70B.

According to official data published by Meta, Llama 2 outperforms other open language models on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests, and even outperforms some closed source models in terms of helpfulness, security, and security.

Llama 2-Chat builds on Llama 2 with fine-tuning and security improvements for dialog use cases, and the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety.

Llama 2-Chat is more focused on chatbots and mainly used in the following aspects:

· Customer service: Llama 2-Chat can be used for online customer service to answer FAQs about products and services, and provide help and support to users.

· Social entertainment: Llama 2-Chat can be used as a funny chat partner to have casual and relaxed conversations with users, providing entertainment content such as jokes, riddles, stories, etc., to increase the user's entertainment experience.

· Personal Assistant: Llama 2-Chat can answer some daily life questions, such as weather queries, time settings, reminders, etc., help users solve simple tasks and provide some practical functions.

· Mental Health: Llama 2-Chat can be used as a simple mental health support tool that can communicate with users, provide advice and tips for emotional regulation and stress relief, and provide comfort and support to users.

2. Build a model running environment on the GPU cloud server

Step 1: Download the model and upload

Download the Llama-2-7b-chat-hf model from huggingface official website, as shown in the figure below. Then, upload the downloaded model to the GPU cloud server.

Description

For more information on how to upload local files to the Linux-based cloud server, see How to Upload Local Files to Linux-based Cloud Server.

Step 2: Build an environment

1. Upload and install the GPU driver

Download the GPU driver from the Nvidia official website and upload it to the GPU cloud server. Install the driver in the following steps.

# Add the execution permission to the installation package

chmod +x NVIDIA-Linux-x86_64-515.105.01.run

# Install gcc and linux-kernel-headers

sudo apt-get install gcc linux-kernel-headers

# Run the driver installer

sudo sh NVIDIA-Linux-x86_64-515.105.01.run --disable-nouveau

# Check whether the driver is successfully installed

nvidia-smi

Description

For more information on how to select a driver, library, and software version, see How to Select a Driver, Library, or Software Version.

2. Install the Nvidia CUDA Toolkit component

wget http://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run

# Install CUDA

bash cuda_11.7.0_515.43.04_linux.run

# Edit the environment variable file

vi ~/.bashrc

# Add environment variables

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

# Take environment variables into effect

source ~/.bashrc

# Check whether it is successfully installed

nvcc -V

3. Install Miniconda

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install Miniconda3

bash Miniconda3-latest-Linux-x86_64.sh

# Configure environment variables of Conda

vim /etc/profile

# Add environment variables

export ANACONDA_PATH=~/miniconda3
export PATH=$PATH:$ANACONDA_PATH/bin

# Take environment variables into effect

source /etc/profile

# Check whether it is successfully installed

which anaconda
conda --version
conda info -e
python

# Check the virtual environment

conda env list

4. Install cuDNN

Down the cuDNN ZIP file from cudnn-download and upload it to the GPU cloud server. Install cuDNN in the following steps.

# Unzip

tar -xf cudnn-linux-x86_64-8.9.4.25_cuda11-archive.tar.xz

# Go to the directory

cd cudnn cudnn-linux-x86_64-8.9.4.25_cuda11-archive

# Copy

cp ./include/*  /usr/local/cuda-11.7/include/
cp ./lib/libcudnn*  /usr/local/cuda-11.7/lib64/

# Authorize

chmod a+r /usr/local/cuda-11.7/include/* /usr/local/cuda-11.7/lib64/libcudnn*

# Check whether it is successfully installed

cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

5. Install dependencies

a. Download the code for Llama model

git clone https://github.com/facebookresearch/llama.git

b. Install dependencies online

python -m pip install --upgrade pip -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

# Download dependencies

pip install -e . -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
pip install transformer  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com

# Download peft

git clone https://github.com/huggingface/peft.git

# Upload to the offline server and switch branches, and install the specific peft version

git checkout 13e53fc

# Install peft

pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.tuna.tsinghua.edu.cn

Step 3: Package the image

In order to enable you to build a model running environment faster, after you complete the operations in Step 1 and Step 2, we package the system disk of the GPU cloud server and generate a standard GPU cloud server image. Now the package has been uploaded to eSurfing Cloud Chengdu 4 and Haikou 2 resource pools. You can directly use the image.

Package the image in the following steps:

echo "nameserver 114.114.114.114" > /etc/resolv.conf
echo "localhost" > /etc/hostname

# Clear machine-id.

yes | cp -f /dev/null /etc/machine-id

# If /var/lib/dbus/machine-id exists,

# rm -f /var/lib/dbus/machine-id

# ln -s /etc/machine-id /var/lib/dbus/machine-id

cloud-init clean -l  # clear cloud-init. If this command is unavailable, try: rm -rf /var/lib/cloud
rm -f /tmp/*.log  # clear the image script log.

# Clear /var/log log.

read -r -d '' script <<-"EOF"
import os
def clear_logs(base_path="/var/log"):
    files = os.listdir(base_path)
    for file in files:
        file_path = os.path.join(base_path, file)
        if os.path.isfile(file_path):
            with open(file_path, "w") as f:
                f.truncate()
        elif os.path.isdir(file_path):
            clear_logs(base_path=file_path)
 
if __name__ == "__main__":
    clear_logs()
EOF
if [ -e /usr/bin/python ]; then
    python -c "$script"
elif [ -e /usr/bin/python2 ]; then
    python2 -c "$script"
elif [ -e /usr/bin/python3 ]; then
    python3 -c "$script"
else
    echo "### no python env in /usr/bin. clear_logs failed ! ###"
fi

# Clear the history.

rm -f /root/.python_history
rm -f /root/.bash_history
rm -f /root/.wget-hsts

3. Rapidly deploy the model with the foundation model image

Step 1: Create a GPU cloud server

Log in to the eSurfing Cloud console, go to the ECS ordering page, select the computing accelerated GPU cloud server, and select the foundation model image LLaMA2-7B-Chat in the public image.

The recommended minimum specification of the foundation model image LLaMA2-7B-Chat is p2v.2xlarge.4 8vCPU with 32GB memory and single v100 GPU.

Step 2: Online reasoning

#Go to the LLaMa directory and run the sh run.sh command

cd /opt/llama && sh run.sh

# Enter the reasoning question after "please input your question :" as instructed

GPU Cloud Server

Take Llama 2 as an example to practice large model inference

1. What is Llama2

2. Build a model running environment on the GPU cloud server

Step 1: Download the model and upload

Step 2: Build an environment

Step 3: Package the image

3. Rapidly deploy the model with the foundation model image

Step 1: Create a GPU cloud server

Step 2: Online reasoning