Object Detection with DINO using Hugging Face (Python)
DINO is a modern transformer-based object detection model that delivers strong accuracy without complex post-processing. This article shows how to run DINO using Hugging Face in Python with a clean, practical setup.
Why DINO
Traditional object detection pipelines rely on multiple stages and heuristics. DINO simplifies this by using a transformer-based approach with end-to-end training. With Hugging Face, running DINO becomes straightforward and reproducible.
This setup is ideal for:
- Prototyping detection pipelines
- Research experiments
- Backend AI services
What this guide covers
- Installing dependencies
- Loading a pretrained DINO model
- Running inference on an image
- Visualizing predictions
- Common pitfalls and tips
Install dependencies
Make sure you have Python 3.9+ installed.
pip install torch torchvision transformers pillow matplotlib
GPU is optional but recommended for faster inference.
Load the DINO model
Hugging Face provides pretrained DINO models. Load both the processor and the model:
from transformers import DetrImageProcessor, DetrForObjectDetection
import torch
processor = DetrImageProcessor.from_pretrained("facebook/dino-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/dino-resnet-50")
model.eval()
This handles resizing, normalization, and tensor conversion automatically.
Prepare the input image
Load an image using Pillow:
from PIL import Image
image = Image.open("input.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
Move tensors to GPU if available:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
Run inference
Run the model without gradient tracking:
with torch.no_grad():
outputs = model(**inputs)
The raw outputs contain class logits and bounding boxes.
Post-process predictions
Convert model outputs to usable bounding boxes:
target_sizes = torch.tensor([image.size[::-1]]).to(device)
results = processor.post_process_object_detection(
outputs,
target_sizes=target_sizes,
threshold=0.7
)[0]
Each result contains:
scoreslabelsboxes(in pixel coordinates)
Visualize detections
Draw bounding boxes on the image:
import matplotlib.pyplot as plt
plt.imshow(image)
ax = plt.gca()
for score, label, box in zip(
results["scores"],
results["labels"],
results["boxes"]
):
x0, y0, x1, y1 = box.tolist()
ax.add_patch(
plt.Rectangle(
(x0, y0),
x1 - x0,
y1 - y0,
fill=False,
color="red",
linewidth=2
)
)
ax.text(x0, y0, f"{model.config.id2label[label.item()]} {score:.2f}", color="white")
plt.axis("off")
plt.show()
This makes debugging and validation much easier.
Performance notes
- DINO performs best on GPU.
- Adjust the confidence threshold based on your use case.
- Batch images when running inference at scale.
Common mistakes
- Forgetting to call
model.eval() - Using incorrect image sizes without processor
- Not moving tensors to the same device
- Setting threshold too low and getting noisy results
When to use DINO
DINO is a good fit when:
- You want strong detection without heavy tuning
- You prefer transformer-based architectures
- You need a clean API for production or research
Final thoughts
Hugging Face removes most of the friction in running modern detection models like DINO. With a few lines of Python, you can get reliable object detection suitable for research, demos, or backend services.
This is an excerpt from Abdulvhab Shaikh's Object Detection with DINO using Hugging Face (Python) article. I highly recommend you give it a read!