Deep Learning Method for Heliostat Instance Segmentation

. Heliostat instance segmentation (HST-IS) is a crucial component of the heliostat tracking system at Heliogen’s Lancaster test facility. The system estimates the mirror normal of each heliostat by performing a nonlinear optimization-based fitting strategy using approximations of the non-shaded, non-blocked sunlit pixels on each heliostat, and the tracking system uses these estimates to improve performance. HST-IS is fundamentally challenging due to variability in lighting conditions and heliostat size relative to the capturing camera. Deep learning-based convolutional neural networks (CNN) have emerged in recent years by demonstrating noteworthy precision in tasks such as object recognition, detection, and segmentation. CNN-based methods offer a robust augmentation to HST-IS methods as they capture a context-less hierarchy of image features. In this study, we developed deep learning models to automatically segment heliostat instances from elevated images taken from the field. We study various image parameters and architectural customizations to optimize for scalability, robustness, and accuracy in our predictions. We perform robust evaluations of our best model to quantify gaps between model development and real-world deployment and provide evidence for utility in the field.


Introduction 1.1 Related Work
HST-IS is useful in many concentrated solar power measuring contexts that require a pixelprecision understanding of heliostat positioning, particularly in soiling detection and calibration [1,2,3].In many of these downstream applications, the need for automation and consistency in recalibration cadence are crucial.This is due to the tendency for field conditions to change key parameters such as soiling conditions and heliostat orientation [4].
Several previous studies leverage computer vision techniques to study the properties of heliostats.Roger et al. developed an edge detection technique that extracts heliostat vertices to calculate the surface normal, demonstrating results comparable to manual photogrammetry at a fraction of the required time [5].Coventry et al. leveraged computer vision thresholding methods to determine the soiling levels of heliostat mirrors from aerial imagery by segmenting mirrors from background pixels [6].Ydrissi et al. designed and trained a convolutional neural network on images of soiled mirror images to optimize the task of predicting reflectivity, dust density, and cleanliness values [7].Computer vision-based approaches for HST-IS frequently assume inputs of images that feature a single heliostat, which are difficult to collect at scale [5,6,7].This limitation highlights the value of generalizable, automated HST-IS approaches that require minimal context.

Heliostat Instance Segmentation
Heliostat instance segmentation (HST-IS) is a significant underlying process in the Heliostat tracking system at the Lancaster test facility.This system includes a module called SOHOT [1] ("System for Observing Heliostat Orientations While Tracking") that estimates the mirror normal of each heliostat by performing a nonlinear optimization to find the mirror orientations that best fit radiance-proportional values for regions of interest (ROIs) corresponding to heliostats in images.These estimates are used to train per-heliostat kinematic models.
HST-IS from SOHOT imagery is fundamentally challenging for two main reasons.Firstly, shading and blocking patterns are inconsistent since heliostats appear in a variety of lighting conditions (Fig. 1).With these inconsistencies, naive patterns and shapes cannot be used to distinguish ROI pixels from non-ROI pixels.Secondly, sizes of heliostats in each image vary drastically due to variable distances from the capturing camera.These differences in scale present significant challenges to many standard computer vision methods that do not have parameters corresponding directly to an image's receptive field.In SOHOT, error in ROIs contribute to error in normal estimates, and existing ROI calculation techniques are susceptible to subtle errors in camera calibration and positional assumptions.These sources of error can render inconsistent or inaccurate ROI calculation results.

Deep Convolutional Neural Networks
Deep neural networks have advanced significantly in recent years, demonstrating success in many real-world prediction tasks and capturing the attention of several diverse scientific fields [8].Neural networks are parameter-dense functions that are structured in a progressive hierarchy of layers to model complex sets of features from underlying inputs.Convolutional neural networks (CNNs), which are the basis for the models leveraged in this study, are neural networks designed specifically for image data.Parameters of CNN's correspond directly to progressively wider areas of an image's receptive field and learn low and highlevel features from local structures in the image to form a more linearly interpretable set of feature representations.CNN's have been applied successfully in many real-world contexts to perform diverse tasks such as object tracking, instance segmentation, and image classification [9,10,11].
Deep CNN-based models for instance segmentation have emerged in recent years with the development of foundational image models trained on several thousands of images of commonly encountered objects such as cars, balloons, pets, and houses [12].These models have been applied in many challenging problem contexts.For example, models such as UNet [13] and Yolo-V3 [14] have been adapted for real-time segmentation of pedestrians and vehicles to service autonomous self-driving tasks [15].
In this paper, we design, develop, and test a novel deep learning-based approach to perform HST-IT from images taken of our Lancaster test facility by adapting the Mask R-CNN architecture proposed by He et al. [16].We perform studies on the effect of datacentric parameters and architectural components on model performance and perform a stratified analysis of our best model with respect to object instance size to quantify gaps between model development and model deployment.Finally, we discuss the implications of our results on improving the robustness and scalability of HST-IS approaches using deep learning-based computer vision techniques and offering a feasible solution to low-stakes, automated heliostat monitoring needs.

Dataset
Images of the full field were taken at six different camera positions in four distinct exposures for a collection of timestamps.For a given timestamp, images taken at different exposures for the same camera position were registered using phase correlation and subsequently stitched together into a resultant HDR image.Heliostat instance mask labels for each HDR image were generated using SOHOT's ROI generation process (Fig. 2) and used for the training and evaluation results in this paper [1].Ablation studies were performed on several image-specific parameters relative to HST-IS performance, such as bit depth, image stitching technique, and sampling of seasonal, time-of-day periods.For our best model, HDR images from all six cameras were randomly sampled from a full-year period and full-day distribution of times.

Neural Network Architecture and Training
To address the unique set of challenges in HST-IS, we conducted ablation studies to tune hyperparameters and customize model subcomponent architectures in Mask R-CNN.Specifically, we study the choice of CNN backbone, depth of CNN backbone, choice of pretrained baseline, region proposal network (RPN) loss function, RPN thresholding parameters, RPN anchor parameters, number of linear layers in projection heads, random data augmentation strategies, and image preprocessing and postprocessing strategies.Each ablation was studied in the context of multiple evaluation benchmarks stratified by distinct heliostat size categories and separated between bounding box and mask generation performance.Average precision metrics at average recalls above 90% were evaluated alongside qualitative observations of predictions to make conclusions.Our best model leverages a ResNet-50 feature proposal network (FPN) backbone with pretrained ImageNet weights and finetuned on 50K image tiles with a batch size and learning rate of 8 and 0.005, respectively.We use a momentum and weight decay of 0.9 and 0.0001, respectively, with a multistep learning-rate warmup scheduler.Furthermore, we leverage the multi-task training loss function defined by the original Mask R-CNN paper [16], which features a binary crossentropy classification loss, binary cross-entropy mask loss, and smooth L1 bounding-box localization loss.

Evaluation Metrics
We evaluate model performance based on average precision metrics at both segmentation mask and bounding box levels at intersection-over-union (IOU) thresholds ranging from 0.5-0.75.The IOU between a prediction and associated ground truth label Is calculated by dividing the number of intersecting pixels between both objects by the number of pixels comprising the union of both objects (Fig. 4).We conform to definitions outlined in the Microsoft Coco object detection metric standards [17] in which predictions merged by a non-max suppression procedure and deemed to be associated with a ground truth label based on an IOU threshold are classified as a correct instance prediction.The Microsoft Coco object detection metric standards are commonly used in deep learning-based object detection and instance segmentation studies [16,18,19].

Data Preparation and Post-Processing
Full-field images were split into pixel x 256 pixel (256x256) image subcomponents overlapping by a fixed pixel threshold in each direction.Instances contained within each subcomponent were included as labels if their centroid was within a fixed pixel distance from each image boundary.This labeling filter step is included to conform to the modeling task of segmenting fully present heliostat instances.Horizontal and vertical flips were performed during training to improve model robustness.Following model instance segmentation, predicted instances were mapped to their associated trackers with a nearest neighbors-based algorithm using the centroids of predicted instances and centroids of assumed tracker locations.Like our preprocessing labeling filter step, instance predictions whose centroids were not within a fixed pixel distance from image boundaries were discarded to eliminate the presence of duplicate predictions for each tracker.

Results
Our best model achieves an average bounding box and segmentation precision of 93.7% and 92.7% averaged over all IOU thresholds, respectively, at a 90% average recall, compared against ROIs generated by the previous technique.Throughout the course of our ablation studies, we discovered unique training paradigms specific to HST-IS.Firstly, we find that training with a wide distribution of region proposal network (RPN) anchor size parameters, which correspond in proportion to the diverse image footprints of captured heliostats, improves robustness to object scale variance.We demonstrate successful segmentation of heliostats that occupy a diverse set of image footprints by tuning the aforementioned RPN anchor size parameters (Fig. 6).Secondly, we find that training our model with a pretrained Resnet-50 FPN backbone results in superior overall performance compared to training with other object detection backbone model architectures.Lastly, we find that aggregating top-performing dataset parameters such as introducing random data augmentation in training, representing our images in a high dynamic range, and selecting from a wide distribution of seasonal, timeof-day periods sees a 10% improvement in average precision compared to model baselines.

Discussion and Future Work
In this study, we demonstrate the capability of deep learning models to perform instance segmentation of heliostats from images taken of our Lancaster test facility.Evidence of the approach's robustness is indicated in observing the high average precision of generated instance bounding boxes and masks relative to ground truth labels produced by our existing SOHOT module.In addition, we qualitatively observe robust instance segmentation performance on images that contain heliostats at varying distances from the capturing camera, alluding to our model's adaptiveness to object size and scale (Fig. 6).Within our manual inspection of model predictions, we observe evidence of mask generation that distinguishes non-blocked and non-shaded heliostat pixels from blocked or shaded heliostat pixels (Fig. 7).
Although the results of this distinction among model predictions are inconsistent, they indicate promise in the modelling approach's ability to interpret higher-level image features and reconcile a nuanced task.
There are limitations of our study that we hope to address in future work.Firstly, training our deep learning model requires several hundreds of images and associated labels constituting diverse training, validation, and test sets.Acquiring such volumes of imagery and detailed instance labels can be time-intensive, complex, and expensive.In our study, our Lancaster test facility and SOHOT module are configured to naturally satisfy these requirements, which provides a unique advantage.Secondly, the labels used in our training dataset were generated using our SOHOT module, which is known to be associated with a margin of error due to positional and mathematical assumptions.Relative to object human labels, the extents to which these inaccuracies in dataset labels affect the performance of the model are unclear and will be a subject of future work.Despite this limitation, our model offers a standalone solution that can produce similar predictions to those of our SOHOT module without any contextual field information.This capability offers significant value in ROI calibration.Thirdly, the results of our study are closely associated with field conditions in our Lancaster test facility.We anticipate that future challenges will arise for our modelling approach when applied to fields in which heliostats are more distant from the capturing camera, more numerous in count, or positioned more compactly.In future work, we aim to better simulate these conditions and quantify these gaps with the generation of synthetic training data using 3D rendering tools.

Conclusion
In this paper, we demonstrate a deep CNN-based approach to performing heliostat instance segmentation on full CSP field images.We study architectural and datacentric parameters to optimize model performance and provide evidence of robustness to object scale variance and semantic understanding of shaded pixel distinction.The method shows promise in being

Figure 1 .
Figure 1.Graphical summary illustrating process of deep learning HST-IS method.

Figure 2 .
Figure 2. Examples of images in which heliostat segmentation proves challenging due to lightning conditions (right) and differences in heliostat scale and orientation (left, middle).

Figure 3 .
Figure 3. Example of an HDR-stitched full-field image (left) and its mask-overlay counterpart (right).

Figure 4 .
Figure 4. Illustration of prediction-object pairs at various associated IOU thresholds.

Figure 5 .
Figure 5. Examples of predicted HST-IS masks and their original counterparts at various positions of the field.As seen, predictions are robust across front (A), middle (C, D), and back (B) areas of the field.Evidence of semantic understanding of shading and blocking constraints is observed as well (A).

Figure 6 .
Figure 6.Examples of predicted HST-IS masks and their original counterparts selected from closer (A) to farther (C) distances from the capturing camera.Consistent results provide evidence that our instance segmentation model is robust to variance in object size and scale relative to the component image.

Figure 7 .
Figure 7. Examples of predicted HST-IS masks and their original counterparts selected from images that contain significant shading and blocking.Our instance segmentation model is able to distinguish blocked heliostat pixels (B, C) and account for non-aiming heliostats (A).

Figure 8 .
Figure 8. Examples of predicted HST-IS masks and their original counterparts selected from images taken of the farthest sections of the field.Due to loss in image context and fewer pixels making up each heliostat instance at greater distances, performing HST-IS on these sections of the field tends to be more challenging for traditional optimization and computer vision approaches.