Efficient Feature Selection via Genetic Algorithms

Using evolutionary algorithms for fast feature selection with large datasets

Florin Andrei
Towards Data Science

--

This is the final part of a two-part series about feature selection. Read part 1 here.

Brief recap: when fitting a model to a dataset, you may want to select a subset of the features (as opposed to using all features), for various reasons. But even if you have a clear objective function to search for the best combination of features, the search may take a long time if the number of features N is very large. Finding the best combination is not always easy. Brute-force search generally does not work beyond several dozens of features. Heuristic algorithms are needed to perform a more efficient search.

If you have N features, what you’re looking for is an N-length vector [1, 1, 0, 0, 0, 1, ...] with values from {0, 1} . Each vector component corresponds to a feature. 0 means the feature is rejected, 1 means the feature is selected. You need to find the vector that minimizes the cost / objective function you’re using.

In the previous article, we’ve looked at a classic algorithm, SFS (sequential feature search), and compared it with an efficient evolutionary algorithm called CMA-ES. We’ve started with the House Prices dataset on Kaggle which, after some processing, yielded 213 features with 1453 observations each. The model we’ve tried to fit was statsmodels.api.OLS() and the objective function was the model’s BIC — Bayesian Information Criterion, a measure of information loss. Lower BIC means a better fit, so we’re trying to minimize the objective.

In this article, we will look at another evolutionary technique: genetic algorithms. The context (dataset, model, objective) remains the same.

GA — Genetic Algorithms

Genetic algorithms are inspired by biological evolution and natural selection. In nature, living beings are (loosely speaking) selected for the genes (traits) that facilitate survival and reproductive success, in the context of the environment where they live.

Now think of feature selection. You have N features. You’re trying to find N-length binary vectors [1, 0, 0, 1, 1, 1, ...] that select the features (0 = feature rejected, 1= feature included) so as to minimize a cost / objective function.

Each such vector can be thought of as an “individual”. Each vector component (value 0 or value 1) becomes a “gene”. By judiciously applying evolution and selection, it might be possible to evolve a population of individuals in such a way as to get near the best value for the objective function we’re interested in.

Here’s GA in a nutshell. Start by generating a population of individuals (vectors), each vector of length N. The vector component values (genes) are randomly chosen from {0, 1}. In the diagram below, N=12, and the population size is 8.

GA population

After the population is created, evaluate each individual via the objective function.

Now perform selection: keep the individuals with the best objective values, and discard those with the worst values. There are many possible strategies here, from naive ranking (which, counterintuitively, doesn’t work very well), to stochastic tournament selection, which is very efficient in the long run. If you remember the explore-exploit dilemma, with GA, it’s very easy to fall into naive exploit traps that slow exploration down. GA is all about exploration. Here’s a short list of selection techniques, and check the links at the end for more info.

Once the best individuals have been selected, and the less fit ones have been discarded, it’s time to introduce variation in the gene pool via two techniques: crossover and mutation.

Crossover works exactly like in nature, when two living creatures are mating and producing offspring: genetic material from both parents is “mixed” in the descendants, with some degree of randomness.

GA crossover

Mutation, again, is pretty much what happens in nature when random mutations occur in the genetic material, and new values are introduced in the gene pool, increasing its diversity.

GA mutation

After all that, the algorithm loops back: individuals are again evaluated via the objective function, selection occurs, then crossover, mutation, etc.

Various stopping criteria can be used: the loop may break if the objective function stops improving for some number of generations. Or you may set a hard stop for the total number of generations evaluated. Or do something time-based, or watch for an external signal, etc. Regardless, the individuals with the best objective values should be considered to be the solutions to the problem.

A few words about elitism: with stochastic selection techniques such as tournament, the best, absolute top individuals in a generation may actually be discarded by pure chance — it’s unlikely, but it does happen. Elitism bypasses this, and simply decrees that the best must survive, no matter what. Elitism is an exploit technique. It may cause the algorithm to fall into local extremes, missing the global solution. Again, GA is all about exploration. My rather limited experience with GA seems to confirm the idea that an exploit bias is not beneficial for GA. But your mileage may vary; if you like to experiment with algorithm variants, GA gives you many opportunities to do so.

GA has several hyperparameters you can tune:

  • population size (number of individuals)
  • mutation probabilities (per individual, per gene)
  • crossover probability
  • selection strategies, etc.

Running trials by hand with various hyperparameter values is one way to figure out the best code. Or you could encapsulate GA in Optuna and let Optuna find the best hyperparameters — but this is computationally expensive.

GA for feature selection, in code

Here’s a simple GA code that can be used for feature selection. It uses the deap library, which is very powerful, but the learning curve may be steep. This simple version, however, should be clear enough.

# to maximize the objective
# fitness_weights = 1.0
# to minimize the objective
fitness_weights = -1.0

# copy the original dataframes into local copies, once
X_ga = X.copy()
y_ga = y.copy()

# 'const' (the first column) is not an actual feature, do not include it
X_features = X_ga.columns.to_list()[1:]

try:
del creator.FitnessMax
del creator.Individual
except Exception as e:
pass

creator.create("FitnessMax", base.Fitness, weights=(fitness_weights,))
creator.create(
"Individual", array.array, typecode='b', fitness=creator.FitnessMax
)

try:
del toolbox
except Exception as e:
pass

toolbox = base.Toolbox()
# Attribute generator
toolbox.register("attr_bool", random.randint, 0, 1)
# Structure initializers
toolbox.register(
"individual",
tools.initRepeat,
creator.Individual,
toolbox.attr_bool,
len(X_features),
)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)


def evalOneMax(individual):
# objective function
# create True/False selector list for features
# and add True at the start for 'const'
cols_select = [True] + [i == 1 for i in list(individual)]
# fit model using the features selected from the individual
lin_mod = sm.OLS(y_ga, X_ga.loc[:, cols_select], hasconst=True).fit()
return (lin_mod.bic,)


toolbox.register("evaluate", evalOneMax)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)

random.seed(0)
pop = toolbox.population(n=300)
hof = tools.HallOfFame(1)
pop, log = algorithms.eaSimple(
pop, toolbox, cxpb=0.5, mutpb=0.2, ngen=10, halloffame=hof, verbose=True
)

best_individual_ga_small = list(hof[0])
best_features_ga_small = [
X_features[i] for i, val in enumerate(best_individual_ga_small) if val == 1
]
best_objective_ga_small = (
sm.OLS(y_ga, X_ga[['const'] + best_features_ga_small], hasconst=True)
.fit()
.bic
)
print(f'best objective: {best_objective_ga_small}')
print(f'best features: {best_features_ga_small}')

The code creates the objects that define an individual and the whole population, along with the strategies used for evaluation (objective function), crossover / mating, mutation, and selection. It starts with a population of 300 individuals, and then calls eaSimple() (a canned sequence of crossover, mutation, selection) which runs for only 10 generations, for simplicity. Hall of fame with a size of 1 is defined, where the absolute best individual is preserved against being accidentally mutated / skipped during selection, etc.

Hall of fame is not elitism. Hall of fame copies the best individual from the population, and only keeps an inactive copy in a tin can. Elitism preserves the best individual in the active population from one generation to the next.

This simple code is easy to understand, but inefficient. Check the notebook in the repository for a more complex version of the GA code, which I am not going to quote here. However, running the more complex, optimized code from the notebook for 1000 generations produces these results:

best objective:  33705.569572544795
best generation: 787
objective runs: 600525
time to best: 158.027 sec

Again, the baseline BIC before any feature selection is:

baseline BIC: 34570.166173470934

And here’s the entire history of the full, optimized GA code from the notebook, running for 1000 generations, trying to find the best features. From left to right, the heatmap indicates the popularity of each feature within the population across generations (brighter shade = more popular). You can see how some features are always popular, others are rejected quickly, while yet others may become more popular or less popular as time goes by.

GA optimization history

Comparison between methods

We’ve tried three different techniques: SFS, CMA-ES, and GA. How do they compare in terms of the best objective found, and the time it took to find it?

These tests were performed on an AMD Ryzen 7 5800X3D (8/16 cores) machine, running Ubuntu 22.04, and Python 3.11.7. SFS and GA are running the objective function via a multiprocessing pool with 16 workers. CMA-ES is single-process — running it multi-process did not seem to provide significant improvements, but I’m sure that could change if more work is dedicated to making the algorithm parallel.

These are the run times. For SFS it’s the total run time. For CMA-ES and GA it’s the time to the best solution. Less is better.

SFS:    42.448 sec
GA: 158.027 sec
CMA-ES: 48.326 sec

The number of times the objective function was invoked — less is better:

SFS:    22791
GA: 600525
CMA-ES: 20000

The best values found for the objective function, compared to the baseline — less is better:

baseline BIC: 34570.1662
SFS: 33708.9860
GA: 33705.5696
CMA-ES: 33703.0705

GA was able to beat SFS at the objective function, running the objective function on as many CPU cores as were available, but it’s by far the slowest. It invoked the objective function more than an order of magnitude more times than the other methods. Further hyperparameter optimizations may improve the outcome.

SFS is quick (running on all CPU cores), but its performance is modest. It’s also the simplest algorithm by far.

If you just want a quick estimate of the best feature set, using a simple algorithm, SFS is not too bad.

OTOH, if you want the optimal objective value, CMA-ES seems to be the best.

This is the final part of a two-part series about feature selection. Read part 1 here.

Notes and links

All images were created by the author.

Repository with all code: https://github.com/FlorinAndrei/fast_feature_selection

The House Prices dataset (MIT license): https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

The deap library: https://github.com/DEAP/deap

A free tutorial for genetic algorithms: https://www.tutorialspoint.com/genetic_algorithms/index.htm

--

--

BS in Physics. MS in Data Science. Over a decade experience with cloud computing.