Object detection output showing bounding boxes detected by DINO.

Object Detection with DINO using Hugging Face (Python)

DINO is a modern transformer-based object detection model that delivers strong accuracy without complex post-processing. This article shows how to run DINO using Hugging Face in Python with a clean, practical setup.

Why DINO

Traditional object detection pipelines rely on multiple stages and heuristics. DINO simplifies this by using a transformer-based approach with end-to-end training. With Hugging Face, running DINO becomes straightforward and reproducible.

This setup is ideal for:

  • Prototyping detection pipelines
  • Research experiments
  • Backend AI services

What this guide covers

  1. Installing dependencies
  2. Loading a pretrained DINO model
  3. Running inference on an image
  4. Visualizing predictions
  5. Common pitfalls and tips

Install dependencies

Make sure you have Python 3.9+ installed.

pip install torch torchvision transformers pillow matplotlib

GPU is optional but recommended for faster inference.


Load the DINO model

Hugging Face provides pretrained DINO models. Load both the processor and the model:

from transformers import DetrImageProcessor, DetrForObjectDetection
import torch

processor = DetrImageProcessor.from_pretrained("facebook/dino-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/dino-resnet-50")

model.eval()

This handles resizing, normalization, and tensor conversion automatically.


Prepare the input image

Load an image using Pillow:

from PIL import Image

image = Image.open("input.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")

Move tensors to GPU if available:

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}

Run inference

Run the model without gradient tracking:

with torch.no_grad():
    outputs = model(**inputs)

The raw outputs contain class logits and bounding boxes.


Post-process predictions

Convert model outputs to usable bounding boxes:

target_sizes = torch.tensor([image.size[::-1]]).to(device)

results = processor.post_process_object_detection(
    outputs,
    target_sizes=target_sizes,
    threshold=0.7
)[0]

Each result contains:

  • scores
  • labels
  • boxes (in pixel coordinates)

Visualize detections

Draw bounding boxes on the image:

import matplotlib.pyplot as plt

plt.imshow(image)
ax = plt.gca()

for score, label, box in zip(
    results["scores"],
    results["labels"],
    results["boxes"]
):
    x0, y0, x1, y1 = box.tolist()
    ax.add_patch(
        plt.Rectangle(
            (x0, y0),
            x1 - x0,
            y1 - y0,
            fill=False,
            color="red",
            linewidth=2
        )
    )
    ax.text(x0, y0, f"{model.config.id2label[label.item()]} {score:.2f}", color="white")

plt.axis("off")
plt.show()

This makes debugging and validation much easier.


Performance notes

  • DINO performs best on GPU.
  • Adjust the confidence threshold based on your use case.
  • Batch images when running inference at scale.

Common mistakes

  • Forgetting to call model.eval()
  • Using incorrect image sizes without processor
  • Not moving tensors to the same device
  • Setting threshold too low and getting noisy results

When to use DINO

DINO is a good fit when:

  • You want strong detection without heavy tuning
  • You prefer transformer-based architectures
  • You need a clean API for production or research

Final thoughts

Hugging Face removes most of the friction in running modern detection models like DINO. With a few lines of Python, you can get reliable object detection suitable for research, demos, or backend services.

This is an excerpt from Abdulvhab Shaikh's Object Detection with DINO using Hugging Face (Python) article. I highly recommend you give it a read!

Related Articles