AI/ML

How to Install Qwen-2.5 Model on a Local Server Using Hugging Face

 

Qwen-2.5 Model
Qwen 2.5 Model for your Business?
  • check icon

    Cost Efficiency (Open Source)

  • check icon

    Lower Long Term costs

  • check icon

    Customised data control

  • check icon

    Pre-trained model

Read More

Get Your Qwen 2.5 AI Model Running in a Day


Free Installation Guide - Step by Step Instructions Inside!

Problem

We want to install and run the Qwen-2.5 model on our local server using Hugging Face, but are unsure how to properly set up the environment, manage dependencies, and execute a prompt.

Solution

We will go through the step-by-step process of:

  • Setting up the local server with required dependencies.
  • Installing Hugging Face Transformers & PyTorch for model inference.
  • Downloading and loading the Qwen-2.5 model for text generation.
  • Running the model locally and testing an AI-generated response.

 

1. System Requirements

Before installation, ensure that the local server has the following:

  • Operating System: Ubuntu 22.04 (or similar)
  • GPU Support (Optional but Recommended): NVIDIA GPU with CUDA support
  • RAM: At least 16GB (32GB+ recommended for large models)
  • Disk Space: At least 50GB free for model storage

 

2. Install System Dependencies

Start by updating the system and installing required packages:

sudo apt update && sudo apt upgrade -ysudo apt install -y python3 python3-pip git

 

For NVIDIA GPU, install CUDA & cuDNN:

sudo apt install -y nvidia-driver-525pip install torch torchvision torchaudio --index-url ttps://download.pytorch.org/whl/cu118

 

Verify GPU installation:

nvidia-smi 

If you see GPU details, it’s installed correctly.

3. Set Up a Virtual Environment (Recommended)

To isolate dependencies, create and activate a virtual environment:

python3 -m venv qwen_envsource qwen_env/bin/activate 

4. Installing Hugging Face Transformers & Dependencies

Now, install Hugging Face Transformers, PyTorch, and other required libraries:

pip install torch transformers acceleratepip install sentencepiece 

Confirm installation:

python -c "import torch; print(torch.cuda.is_available())" 

If it prints True, CUDA is enabled for GPU acceleration.

5. Download the Qwen-2.5 Model from Hugging Face

Use the Hugging Face CLI to pull the model:

pip install huggingface_hubhuggingface-cli login  # (Optional, required for some models) 

Then, download the Qwen-2.5 model:

 

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen2.5-7B"  # Change to Qwen2.5-14B if needed # Load tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")print("Model loaded successfully!")

6. Running Qwen-2.5 Locally & Executing a Prompt

Now, let’s test text generation using Qwen-2.5:

def generate_text(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda")  # Use "cpu" if no GPU output = model.generate(**inputs, max_length=100) return tokenizer.decode(output[0], skip_special_tokens=True)# Example usageprint(generate_text("What is the meaning of life?"))

 

If the setup is correct, we should see an AI-generated response.

7. Optimizing Performance (For Large Models)

Enable Half-Precision (FP16) for Faster Inference

Modify the model loading to use torch_dtype=torch.float16:

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")Use DeepSpeed or BitsAndBytes for Memory Efficiency

 

Install additional tools for better memory usage:

pip install bitsandbytes deepspeed

 

Then, modify model loading:

from transformers import BitsAndBytesConfigbnb_config = BitsAndBytesConfig(load_in_8bit=True)model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")

 

8. Running Qwen-2.5 as an API (Optional)

To access Qwen-2.5 via an API, use FastAPI:

pip install fastapi uvicorn

 

Create a simple API (app.py):

from fastapi import FastAPIfrom transformers import AutoModelForCausalLM, AutoTokenizerapp = FastAPI()model_name = "Qwen/Qwen2.5-7B"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")@app.post("/generate")async def generate(prompt: str): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_length=200) return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}# Run API# uvicorn app:app --host 0.0.0.0 --port 8000

 

This allows you to send prompts via HTTP requests:

curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Tell me about quantum physics"}'

 

Conclusion

Hosting Qwen-2.5 on a local server provides:

  • Full control over deployment and performance tuning
  • Lower long-term costs vs. cloud-hosted models
  • Better security since no data leaves your server

For better performance, enable FP16, quantization, or DeepSpeed optimizations.

 

Ready to transform your business with our technology solutions? Contact Us  today to Leverage Our AI/ML Expertise. 

0

AI/ML

Related Center Of Excellence