Hot Spot Detection

Computer Vision January 26, 2026

José D. Valda‐Peñaranda • Alvaro J. Gaona • Lucas Gómez‐Velayos

Peter Pasuy‐Quevedo • Pedro Álvarez‐Monteagudo

Abstract

Thermal hotspots in photovoltaic modules are a leading cause of performance loss, accelerated degradation, and safety concerns, making timely and scalable detection essential for modern solar farm operations. This article presents a hybrid approach that combines classical computer vision with deep learning to analyze radiometric thermal imagery, demonstrating that complementary methods can be orchestrated to balance interpretability, robustness, and speed. The proposed workflow emphasizes rapid prototyping and modularity, enabling practitioners to assemble a practical end-to-end pipeline—from data capture and preprocessing through candidate generation, classification, and reporting—using readily available tools and modest datasets.

The Problem

Thermal hotspots in photovoltaic (PV) modules occur when individual cells or connections exhibit abnormally high temperatures due to defects, shading, or degradation. These localized overheating events cause energy to dissipate as heat rather than electricity, reducing module efficiency and accelerating material degradation.

A single hotspot can reduce a panel's output by 10–20%, and prolonged exposure may lead to permanent cell damage or, in severe cases, fire hazards. As utility-scale solar installations grow to encompass thousands of modules, manual inspection with handheld thermal cameras becomes impractical—both in terms of time and cost.

Automated Inspection

Unmanned aerial vehicles (UAVs) equipped with thermal cameras can survey large installations rapidly, while computer vision algorithms enable real-time detection and localization of thermal anomalies—providing a scalable solution for predictive maintenance.

Key Takeaways

YOLO11 outperforms RT-DETR for thermal hotspot detection—CNN-based approaches handle localized features better than transformers for this task
Edge-ready models work—YOLO11-N runs at 6+ FPS on CPU with less than 10MB storage, perfect for drone deployment
Data augmentation matters—perspective transforms expanded our dataset from 934 to 3,472 images
Smaller models are viable—performance gap between model sizes is small, so deployment constraints can drive model selection

How It Works

Our research pipeline consists of two phases: training and inference evaluation.

Training Pipeline

1. Dataset

Public thermal dataset with hotspot labels

2. Augment

Random perspective transformations

3. Train

YOLO11 & RT-DETR models

4. Evaluate

mAP, precision, recall metrics

Inference Pipeline

1. Test Images

Held-out thermal images for evaluation

2. Inference

Run on CPU or GPU hardware

3. Benchmark

Measure timings & resource usage

Data Augmentation: Perspective Transforms

Since drone cameras capture panels from varying angles and altitudes, we simulate these conditions using homography-based perspective transformations. This technique warps images as if viewed from different positions while correctly transforming the bounding box labels—expanding our training dataset nearly 4x without collecting new images.

Three different perspectives of the same thermal image with bounding box labels — Three perspectives of the same image generated through homography transforms, with labels shown as bright rectangles.

The Models

We evaluated two fundamentally different approaches to object detection: a fast CNN-based detector and a transformer-based architecture with global attention.

YOLO11

"You Only Look Once"—processes the entire image in a single pass through the network.

• Architecture: CNN-based (Backbone → Neck → Head)
• Strength: Speed and efficiency
• Key feature: Anchor-free detection
• Variants tested: Nano, Small, Medium, Large, X-Large

Best for: Edge deployment, real-time processing, resource-constrained devices

RT-DETR

Real-Time Detection Transformer—uses attention mechanisms to understand global context.

• Architecture: Hybrid encoder + transformer decoder
• Strength: Global context understanding
• Key feature: NMS-free (no post-processing)
• Variants tested: Large, X-Large

Best for: Complex scenes, when context matters, GPU-equipped systems

Why test both? Hotspots are simple, localized features (bright spots on panels), but they require context to distinguish from reflections or normal variations. We wanted to see if transformers' global attention would help—spoiler: for this task, CNNs win.

Training

All models were trained for 100 epochs on the augmented dataset using an NVIDIA RTX 3090 GPU. Training time varied significantly between architectures, with YOLO11 variants generally training faster than their RT-DETR counterparts.

Training Time Analysis

Training time comparison across all models (100 epochs).

Training Insights

• YOLO11-N trains in just 1.4 hours—ideal for rapid prototyping and iteration
• RT-DETR-X requires 10.8 hours—the transformer architecture demands significantly more compute
• Time per epoch scales with model size—from 0.6 min (YOLO11-N) to 5.3 min (RT-DETR-X)

Results

Performance Comparison

YOLO11 consistently outperformed RT-DETR across all metrics. The best overall model was YOLO11-M, while YOLO11-N offers the best efficiency for edge deployment.

Model	mAP50	mAP50-95	Precision	Recall	Size
YOLO11-M	66.2%	30.7%	65.8%	70.1%	~40 MB
YOLO11-N	53.7%	28.0%	49.1%	67.1%	<10 MB
YOLO11-S	62.5%	29.3%	63.8%	71.5%	~20 MB
RT-DETR-L	51.2%	23.0%	49.1%	71.5%	~65 MB
RT-DETR-X	48.9%	21.8%	53.7%	67.7%	~120 MB

Best accuracy Best for edge deployment

The charts below break down these metrics visually. mAP50-95 (mean Average Precision averaged across IoU thresholds from 0.50 to 0.95) provides the strictest measure of localization accuracy, while mAP50 evaluates detection at a more lenient threshold. Together with Precision (how many detections are correct) and Recall (how many hotspots are found), these metrics reveal that YOLO11 variants consistently outperform RT-DETR.

Model Performance: RT-DETR vs YOLO11

Performance metrics across all evaluated models (interactive).

The scatter plot visualizes the precision/recall trade-off. Precision measures how many detected hotspots are actually real (avoiding false alarms), while Recall measures how many actual hotspots are found (avoiding missed detections). Models in the upper-right quadrant achieve the best balance. YOLO11-S and YOLO11-M cluster in this optimal region, while RT-DETR variants show lower precision despite comparable recall.

Precision vs Recall trade-off. Models in the upper-right quadrant achieve the best balance.

The grouped bar chart provides a comprehensive side-by-side comparison. A key finding is that the performance gap between YOLO11 variants is relatively small, suggesting that deployment considerations—such as inference speed and storage footprint—may be more important than raw accuracy differences when selecting a model.

Comprehensive view of all metrics simultaneously for each model.

Detection Performance Summary

• YOLO11 outperforms RT-DETR for thermal hotspot detection across all metrics
• YOLO11-M achieves the best accuracy with 66.2% mAP50 and 30.7% mAP50-95
• RT-DETR's transformer architecture does not provide advantages for this task's simple, localized features

Speed & Efficiency

The transition from CPU to GPU inference yields significant performance gains across all models. The GPU speedup factor ranges from approximately 10× for the lightweight YOLO11-N to over 45× for larger models like YOLO11-X. While RT-DETR and YOLO11-X achieve high accuracy, their mean inference time on CPU exceeds 2,500 ms, making them unsuitable for real-time video processing without a dedicated GPU.

Inference Performance: CPU vs GPU

Inference performance and memory footprint comparison (interactive).

Conversely, YOLO11-N maintains a frame rate above 5 FPS on CPU, providing a viable path for deployment on low-power devices lacking hardware acceleration.

Efficiency Summary

• YOLO11-N is optimal for edge deployment with <10 MB storage and >5 FPS on CPU
• GPU acceleration provides 10-45× speedup, making larger models viable for server deployments
• RT-DETR models are slower on CPU (1,400–2,700ms), making GPU acceleration essential for real-time use

Can I Run This?

Deployment Scenario	Recommended Model	Requirements
Drone (edge device)	YOLO11-N	CPU only, <1GB RAM, <10MB storage
Laptop/workstation	YOLO11-S or YOLO11-M	Any modern CPU, or GPU for real-time
Server/cloud	YOLO11-M or YOLO11-L	GPU recommended, ~0.3GB VRAM

Live Demo

Watch an instance of our trained models in action. This video shows yolo11n variant processing the same thermal footage of a solar farm, with detected hotspots highlighted in real-time.

Try It Yourself

Upload your own thermal images and see the model detect hotspots in real-time.

View source files on Hugging Face

Resources

Citation

If you use this work in your research, please cite our work:

@misc{valda2026thermal,
  title        = {Thermal Hotspots Detection in Photovoltaic Modules},
  author       = {Valda-Pe{\~n}aranda, Jos{\'e} D. and
                  Gaona, Alvaro J. and
                  G{\'o}mez-Velayos, Lucas and
                  Pasuy-Quevedo, Peter and
                  Álvarez-Monteagudo, Pedro},
  year         = {2026},
  month        = {January},
  howpublished = {Course project, Visi{\'o}n por Computador, Universidad Polit{\'e}cnica de Madrid},
  note         = {Available at: \url{https://github.com/alvgaona/vxc}}
}

Acknowledgements

We would like to thank Prof. Ramón Suárez Fernández and the Universidad Politécnica de Madrid for their guidance throughout this project.