AI/ML

How to Install Qwen-2.5 Model on a Local Server Using Hugging Face

Qwen 2.5 Model for your Business?

Cost Efficiency (Open Source)
Lower Long Term costs
Customised data control
Pre-trained model

Get Your Qwen 2.5 AI Model Running in a Day

Need technical help?

Our experts will get back to you within 24 hours.

Free Installation Guide - Step by Step Instructions Inside!

Problem

We want to install and run the Qwen-2.5 model on our local server using Hugging Face, but are unsure how to properly set up the environment, manage dependencies, and execute a prompt.

Solution

We will go through the step-by-step process of:

Setting up the local server with required dependencies.
Installing Hugging Face Transformers & PyTorch for model inference.
Downloading and loading the Qwen-2.5 model for text generation.
Running the model locally and testing an AI-generated response.

1. System Requirements

Before installation, ensure that the local server has the following:

Operating System: Ubuntu 22.04 (or similar)
GPU Support (Optional but Recommended): NVIDIA GPU with CUDA support
RAM: At least 16GB (32GB+ recommended for large models)
Disk Space: At least 50GB free for model storage

2. Install System Dependencies

Start by updating the system and installing required packages:

sudo apt update && sudo apt upgrade -ysudo apt install -y python3 python3-pip git

For NVIDIA GPU, install CUDA & cuDNN:

sudo apt install -y nvidia-driver-525pip install torch torchvision torchaudio --index-url ttps://download.pytorch.org/whl/cu118

Verify GPU installation:

nvidia-smi

If you see GPU details, it’s installed correctly.

3. Set Up a Virtual Environment (Recommended)

To isolate dependencies, create and activate a virtual environment:

python3 -m venv qwen_envsource qwen_env/bin/activate

4. Installing Hugging Face Transformers & Dependencies

Now, install Hugging Face Transformers, PyTorch, and other required libraries:

pip install torch transformers acceleratepip install sentencepiece

Confirm installation:

python -c "import torch; print(torch.cuda.is_available())"

If it prints True, CUDA is enabled for GPU acceleration.

5. Download the Qwen-2.5 Model from Hugging Face

Use the Hugging Face CLI to pull the model:

pip install huggingface_hubhuggingface-cli login # (Optional, required for some models)

Then, download the Qwen-2.5 model:

from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "Qwen/Qwen2.5-7B" # Change to Qwen2.5-14B if needed # Load tokenizer and modeltokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

print("Model loaded successfully!")

6. Running Qwen-2.5 Locally & Executing a Prompt

Now, let’s test text generation using Qwen-2.5:

def generate_text(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Use "cpu" if no GPU output = model.generate(**inputs, max_length=100) return tokenizer.decode(output[0], skip_special_tokens=True)# Example usageprint(generate_text("What is the meaning of life?"))

If the setup is correct, we should see an AI-generated response.

7. Optimizing Performance (For Large Models)

Enable Half-Precision (FP16) for Faster Inference

Modify the model loading to use torch_dtype=torch.float16:

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")Use DeepSpeed or BitsAndBytes for Memory Efficiency

Install additional tools for better memory usage:

pip install bitsandbytes deepspeed

Then, modify model loading:

from transformers import BitsAndBytesConfigbnb_config = BitsAndBytesConfig(load_in_8bit=True)model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")

8. Running Qwen-2.5 as an API (Optional)

To access Qwen-2.5 via an API, use FastAPI:

pip install fastapi uvicorn

Create a simple API (app.py):

from fastapi import FastAPIfrom transformers import AutoModelForCausalLM, AutoTokenizerapp = FastAPI()model_name = "Qwen/Qwen2.5-7B"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")@app.post("/generate")async def generate(prompt: str): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_length=200) return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}# Run API# uvicorn app:app --host 0.0.0.0 --port 8000

This allows you to send prompts via HTTP requests:

curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d '{"prompt": "Tell me about quantum physics"}'

Conclusion

Hosting Qwen-2.5 on a local server provides:

Full control over deployment and performance tuning
Lower long-term costs vs. cloud-hosted models
Better security since no data leaves your server

For better performance, enable FP16, quantization, or DeepSpeed optimizations.

Ready to transform your business with our technology solutions? Contact Us today to Leverage Our AI/ML Expertise.

Experts in AI, ML, and automation at OneClick IT Consultancy

AI Force

AI Force at OneClick IT Consultancy pioneers artificial intelligence and machine learning solutions. We drive COE initiatives by developing intelligent automation, predictive analytics, and AI-driven applications that transform businesses.

AI/ML

Related Center Of Excellence

See all