Object Detection with DINO using Hugging Face (Python)

DINO is a modern transformer-based object detection model that delivers strong accuracy without complex post-processing. This article shows how to run DINO using Hugging Face in Python with a clean, practical setup.

Why DINO

Traditional object detection pipelines rely on multiple stages and heuristics. DINO simplifies this by using a transformer-based approach with end-to-end training. With Hugging Face, running DINO becomes straightforward and reproducible.

This setup is ideal for:

Prototyping detection pipelines
Research experiments
Backend AI services

What this guide covers

Installing dependencies
Loading a pretrained DINO model
Running inference on an image
Visualizing predictions
Common pitfalls and tips

Install dependencies

Make sure you have Python 3.9+ installed.

pip install torch torchvision transformers pillow matplotlib

GPU is optional but recommended for faster inference.

Load the DINO model

Hugging Face provides pretrained DINO models. Load both the processor and the model:

from transformers import DetrImageProcessor, DetrForObjectDetection
import torch

processor = DetrImageProcessor.from_pretrained("facebook/dino-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/dino-resnet-50")

model.eval()

This handles resizing, normalization, and tensor conversion automatically.

Prepare the input image

Load an image using Pillow:

from PIL import Image

image = Image.open("input.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

Move tensors to GPU if available:

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

Run inference

Run the model without gradient tracking:

with torch.no_grad():
    outputs = model(**inputs)

The raw outputs contain class logits and bounding boxes.

Post-process predictions

Convert model outputs to usable bounding boxes:

target_sizes = torch.tensor([image.size[::-1]]).to(device)

results = processor.post_process_object_detection(
    outputs,
    target_sizes=target_sizes,
    threshold=0.7
)[0]

Each result contains:

scores
labels
boxes (in pixel coordinates)

Visualize detections

Draw bounding boxes on the image:

import matplotlib.pyplot as plt

plt.imshow(image)
ax = plt.gca()

for score, label, box in zip(
    results["scores"],
    results["labels"],
    results["boxes"]
):
    x0, y0, x1, y1 = box.tolist()
    ax.add_patch(
        plt.Rectangle(
            (x0, y0),
            x1 - x0,
            y1 - y0,
            fill=False,
            color="red",
            linewidth=2
        )
    )
    ax.text(x0, y0, f"{model.config.id2label[label.item()]} {score:.2f}", color="white")

plt.axis("off")
plt.show()

This makes debugging and validation much easier.

Performance notes

DINO performs best on GPU.
Adjust the confidence threshold based on your use case.
Batch images when running inference at scale.

Common mistakes

Forgetting to call model.eval()
Using incorrect image sizes without processor
Not moving tensors to the same device
Setting threshold too low and getting noisy results

When to use DINO

DINO is a good fit when:

You want strong detection without heavy tuning
You prefer transformer-based architectures
You need a clean API for production or research

Final thoughts

Hugging Face removes most of the friction in running modern detection models like DINO. With a few lines of Python, you can get reliable object detection suitable for research, demos, or backend services.