Cost Efficiency (Open Source)
Lower Long Term costs
Customised data control
Pre-trained model
Get Your Qwen 2.5 AI Model Running in a Day
We want to install and run the Qwen-2.5 model on our local server using Hugging Face, but are unsure how to properly set up the environment, manage dependencies, and execute a prompt.
We will go through the step-by-step process of:
Before installation, ensure that the local server has the following:
Start by updating the system and installing required packages:
sudo apt update && sudo apt upgrade -y
sudo apt install -y python3 python3-pip git
For NVIDIA GPU, install CUDA & cuDNN:
sudo apt install -y nvidia-driver-525
pip install torch torchvision torchaudio --index-url
ttps://download.pytorch.org/whl/cu118
Verify GPU installation:
nvidia-smi
If you see GPU details, it’s installed correctly.
To isolate dependencies, create and activate a virtual environment:
python3 -m venv qwen_env
source qwen_env/bin/activate
Now, install Hugging Face Transformers, PyTorch, and other required libraries:
pip install torch transformers accelerate
pip install sentencepiece
Confirm installation:
python -c "import torch; print(torch.cuda.is_available())"
If it prints True, CUDA is enabled for GPU acceleration.
Use the Hugging Face CLI to pull the model:
pip install huggingface_hub
huggingface-cli login # (Optional, required for some models)
Then, download the Qwen-2.5 model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen2.5-7B" # Change to Qwen2.5-14B if needed # Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
print("Model loaded successfully!")
Now, let’s test text generation using Qwen-2.5:
def generate_text(prompt):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Use "cpu" if no GPU
output = model.generate(**inputs, max_length=100)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Example usage
print(generate_text("What is the meaning of life?"))
If the setup is correct, we should see an AI-generated response.
Enable Half-Precision (FP16) for Faster Inference
Modify the model loading to use torch_dtype=torch.float16:
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
Use DeepSpeed or BitsAndBytes for Memory Efficiency
Install additional tools for better memory usage:
pip install bitsandbytes deepspeed
Then, modify model loading:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config, device_map="auto")
To access Qwen-2.5 via an API, use FastAPI:
pip install fastapi uvicorn
Create a simple API (app.py):
from fastapi import FastAPI
from transformers import AutoModelForCausalLM, AutoTokenizer
app = FastAPI()
model_name = "Qwen/Qwen2.5-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
@app.post("/generate")
async def generate(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_length=200)
return {"response": tokenizer.decode(output[0], skip_special_tokens=True)}
# Run API
# uvicorn app:app --host 0.0.0.0 --port 8000
This allows you to send prompts via HTTP requests:
curl -X POST "http://localhost:8000/generate" -H "Content-Type: application/json" -d
'{"prompt": "Tell me about quantum physics"}'
Hosting Qwen-2.5 on a local server provides:
For better performance, enable FP16, quantization, or DeepSpeed optimizations.
Ready to transform your business with our technology solutions? Contact Us today to Leverage Our AI/ML Expertise.