Detecting Trophozoites

Winning the Zindi Lacuna Malaria Challenge

Introduction

Firstly, we would like to thank Zindi and the Lacuna fund for hosting this competition. We had a lot of fun doing this, and learned a lot. We hope our model has a positive impact on the world.

Despite modern medical advances, malaria diagnostics in resource-constrained regions still heavily rely on manual microscopic examination - a process bottlenecked by the scarcity of trained technicians. The Lacuna Malaria Detection Challenge, hosted on Zindi, aimed to address this problem. The practical implications to this are substantial: this technology can serve as a screening tool in high-volume settings, potentially enabling earlier intervention in regions with limited access to trained microscopists.

Our solution achieved strong and consistent performance, with scores of 0.928 and 0.925 on the public and private leaderboards respectively.

In this post, we'll detail our technical approach, from initial data preprocessing challenges through model architecture decisions to our final ensemble strategy. We'll cover:

An overview of the problem and evaluation metric
Our analysis into the provided image dataset
Our model architecture choices and their empirical justification
Key optimizations that improved model performance
Observations and methods that gave us a winning edge.
Technical limitations and areas for future improvement

Problem Overview

Prior to understanding the evaluation metric of this challenge, it is vital to understand the concept of bounding boxes, what they are, and how they relate to this problem.

Bounding boxes are rectangular regions defined by four coordinates: (x_min, y_min) for the top-left corner and (x_max, y_max) for the bottom-right corner. Think of them as drawing a rectangle around an object of interest - in our case, either a malaria parasite (trophozoite) or a healthy blood cell - that completely encloses the object while staying as tight as possible to its boundaries. An example of bounding boxes are presented below:

Zindi provided us with mean Average Precision (mAP) as our evaluation metric, specifically mAP@0.5. This metric is the standard for object detection tasks, and understanding it was crucial for optimizing our solution.

Mean Average Precision combines two critical aspects of object detection:

How good we are at locating objects (the bounding box precision)
How accurate we are at classifying what we've found (infected vs. healthy cells)

The "@0.5" in mAP@0.5 refers to the Intersection over Union (IoU) threshold of 0.5. IoU measures how well our predicted bounding boxes align with the ground truth boxes - imagine sliding two rectangles over each other and calculating what percentage they overlap. An IoU of 0.5 means our predicted box needs to overlap with at least 50% of the ground truth box (and vice versa) to be considered a correct detection.

Digging deeper, for each bounding box we predict, we determine whether it's a true positive (TP, a correct prediction), false positive (FP, we predicted something not in the image), or false negative (FN, we don't predict an object) based on two criteria: our IoU is greater than 0.5, and our class prediction is correct.

We now calculate precision and recall, two very important terms in machine learning and computer vision.

Precision and recall are two fundamental metrics used to evaluate the performance of predictive models.

Precision measures how accurate our positive predictions are by calculating the ratio of correct positive predictions (true positives) to all positive predictions made (true positives + false positives). The formula is given by:

Recall measures how well we identify all relevant cases by calculating the ratio of correct positive predictions (true positives) to all actual positive cases (true positives + false negatives). The formula is given by:

There exists an important inverse relationship between precision and recall, known as the precision-recall trade-off. When we adjust our model to achieve higher precision, we see a decrease in recall. Conversely, when we optimize for higher recall, precision tends to decrease. Making a system more sensitive to catch all true cases (avoiding false negatives) leads to more false alarms (false positives) and vice versa. This relationship demonstrates that improving one metric often comes at the expense of the other, requiring careful consideration of which metric is more important for your specific use case.

The mAP equation is given by:

This can actually be visualised as the area under the precision recall curve! This is vitally important in understanding what matters in this competition, and how confidence values and thresholds affect our score.

Consequently, the objective of this competition can be thought of as having the biggest area under this precision recall graph, while maintaining the highest accuracy in identifying the classes within the image.

It can be quickly noticed that the data annotation quality requires improvement. Many samples appear to have incomplete annotations, with approximately 80% of trophozoites unlabelled based on our assessment.

It could also be noticed that some samples never contained any WBC labels (specifically, ones with resolution 3120x4160). This resulted in most of our efforts being directed towards solving the problems with the dataset.

The Dataset

However, since it was not possible to relabel the images ourselves and it would certainly cause a distribution shift in relation to the test data, we were hesitant to make any major changes.

Finally, there was a mistake made in gathering the negative training data, namely, the NEG images are from a different distribution then the train images.

All of the images labelled NEG have the same image size, and all have a similarly sized black edge around the image, which was never found like in the actual training data. This meant that one could train any YOLO image classification model, train it for 1 epoch, and get a 100% accuracy for determining whether an image is NEG or not.

In order to help our models generalize better, we considered preprocessing the dataset. However, since this is box-annotated image data, we did not find many (consistently working) preprocessing steps.

We looked into the removal of some of the worst annotated images, however doing this dramatically increased the variability of both our local and the public score. It would later turn out this model scored poorly on the private dataset (about 0.915), but it was our top submission for a while. We decided not to pursue this approach because of the distribution shift, which turned out to be the right call.

Another thing we did end up doing for the final submission was removing the doubly labelled boxes in the train dataset, of which there are about 60.

Data Preprocessing

Implementation

Model Selection process

We spent a lot of time trying various models in order to find the one that generalized best with noisy data.

Firstly we tried a segmentation approach, with the idea behind it that segmentation in theory gives more data to learn from. However, it is also a very hard approach, and since the training data consisted of bounding boxes - so no edge at the actual cell edge- we gave up on this approach in favour of object detection.

For this task, we experimented with 3 models: Faster-RCNN, DETR and Ultrallytics-YOLO models. We tried continuously improving on all 3 of these models, in the hopes of ensembling them at the end, however Faster-RCNN failed to keep up with the other 2 approaches, which led to us dropping it.

We tried several different types of YOLO models, from YOLOv5 to YOLOv10. Out of these, YOLO8l performed the best. However, during the competition, YOLO11 was released and turned out to be the best performing model.

Other YOLO models would more often choose boxes that were too big (especially 9), which we speculate is due to the bad annotation quality. YOLO11 did this significantly better. Due to the low amount of training data we sweeped heavily over augmentations and models. From this sweeping, we concluded YOLO11m + relatively heavy augmentations were optimal (any bigger models would run into more overfitting).

The final important step was to set the confidence threshold of at 0, as the Ultralytics default is higher. This matters because low confidence boxes can still provide new data points on the recall precision curve and thus increase the MAP50 score.

DETR was an important part of our submission. We leveraged pre-trained models from the Hugging Face model library, recognizing its potential for our malaria detection challenge. Initially, we encountered a common limitation with standard DETR models: poor performance on small object detection, which was critical for identifying small parasites in blood cell images.

To address this, we turned to DETR variants with multi-scale deformable attention modules, specifically for detecting small, variable-sized objects. We discovered several promising pre-trained models on Hugging Face that had been trained on blood cell detection datasets, sharing similarities with our challenge. After experimenting with multiple variants including Deformable DETR, DETA, and coDetr, we found DETA (Detection Transformer with Assignment) to be the most promising. Our optimization strategy involved extensive sweeping across multiple dimensions: model architecture elements like the number of queries in attention blocks, training parameters, loss coefficients, augmentation techniques. An important addition was implementing exponential moving average (EMA), which significantly stabilized our training process and improved model convergence.

Even though DETR’s performance on the public leaderboard was worse than YOLO’s (the best submission we got with it was 0.916), its key advantage was high recall rate. Where YOLO might have missed small, indistinct smudges or barely perceptible parasite fragments, DETR was able to identify these near-invisible traces. YOLO11’s higher precision together with DETR’s higher recall made it so these models could complement each other well.

During inference, we used Test time augmentations to improve the robustness of our implementation. The image is flipped upside down and sideways (giving us 4 outputs per model per image). To the DETR output NMS is applied, with an IOU of 0.6.

All of these were ensembled using our own WBF implementation, which has an internal confidence split. Boxes below and above this threshold only impact each other. The final submission uses 3 WBF modules. 1 per model to combine the test time augmentation predictions, and another to combine the YOLO and DETR predictions.

Winning Edges

Our competitive edge emerged through a combination of strategic technical additions:

We recognized the challenge of poor data annotation quality of the dataset, and addressed it by removing doubly labeled boxes and implementing heavy regularization to improve model generalization.
The successful implementation of the DETR model contributed greatly to our victory. We presume most teams relied heavily on YOLO, but DETR's ability to detect subtle, hard-to-identify parasites was substantial.
Our four-fold Test Time Augmentation (TTA) strategy enhanced prediction stability, and we believe that is what made our public score correspond well to our private score across most of our submissions.

Click here to see our solution on GitHub.