This project was started in competition for the Metis Bee Challenge on drivendata.org. The goal of the challenge was to classify different photos of bees into their respective genus — honey bee, or bumble bee. The challengers were given thousands of test and training images to naively classify the 200 x 200 pixel images. training images were labeled as honey bees or bumble bees.
I took this opportunity to develop a unique approach that judged the entire test set of images at once based on data collected from the training set. Instead of using more traditional methods, I applied common techniques used in astronomy image analysis combined with an interesting evolutionary algorithm that I had seen applied to video games. In other words, I had fun experimenting.
Beginning with the two sets of training images (honey bee and bumble bees), each image was broken up into a matrix of it red, green, and blue components. In the two sets, each color matrix was median combined with every other matrix of the same color. For example, if there were 1000 images of honey bees, each of those 1000 images would be broken into their three components to produce 3000 images. The now 1000 red images would be median combined together, as would each green and red. This process produces three images, one representing the collective red color of a honey bee from the 1000 images, one representing the collective green color, and one representing the blue.
With these images, I take a histogram of each. These histograms of the median combined images give a set of three very distinct red, green, and blue spikes. One set of spikes represents a honey bee and one set of spikes represents a bumble bee, and both sets are noticeably different. These histograms are the standard for which we will compare future pseudo-randomly drawn populations.
We now begin to analyze the test data by randomly dividing the test images into bumble bee and honey bee. These two sets are then put through the same process to produce the histograms, and the histograms of the two random sets are compared to the training histograms via different statistics such as sum of differences. The two test sets are then randomly ‘mutated’ into several different sets and then randomly ‘bred’ into each other by exchanging images between the two sets. Histograms are created from all of the sets, and they again are compared to the training histograms. The ‘fittest’ individuals are selected to mutate and breed again, while the ‘weakest’ are discarded. Often, some ‘weak’ individuals are added back into the population to encourage genetic diversity. This generational process is repeated until conversion.