Stock Portfolio Construction: a proof of concept using Apache Spark

Recently I came across a conference paper by Joglekar (2014) who uses a two-stage approach to constructing low risk and stable return stock portfolios. The idea is simple:

Step 1: Perform correlation-based clustering on a set of financial instruments.

Step 2: Use a genetic algorithm to build an optimal portfolio.

Why not implement this on a massive scale using Apache Spark? In this post I will explain how (and why) to do so based on ~5 years of daily closing price histories of 2,000 stocks (NASDAQ constituents; the dataset from Chapter 9 of Advanced Analytics with Spark). 


What has clustering got to do with this?

A widely used risk management technique is portfolio diversification. This basically means that you want the stocks in your portfolio to be “different”. From a statistical point of view, one of the ways to measure this difference is correlation. Take a moment and think about the following (simplified) scenarios:

  • Most stocks in a portfolio are (highly) positively correlated.

In such situations stock prices are expected to move in the same direction - so if your forecast is correct, the return is going to be high. However, if your model is wrong, you are going to lose everything. This could be an attractive approach depending on your investment strategy, but it is also high risk.

  • Stocks show no correlation.

In this situation your winnings will be counteracted by losses so, at least in the long term, you should roughly break even (if the market as a whole does not crash). The likely return is lower, but then so is the risk.

  • Some stocks show negative correlation with the others.

Now you can exploit these negative relationships to hedge against the risk. If stock weights are selected properly and your strategy fails, high losses will be smoothed by assets whose prices move in the opposite direction.

So the take away is that we want to avoid positive correlations in the portfolio. This is where clustering comes into play. If we perform clustering in such a way that stocks are positively correlated within the same cluster but are independent or negatively correlated with stocks in other clusters, the problem boils down to avoiding stocks from the same cluster in your portfolio. In particular, the distance measure (which quantifies how far away two data points are from each other) in the clustering algorithm will be 1-cor(x,y), where x and y are historical prices of two stocks. Furthermore, we should take into account that correlation patterns change over time: that is, use time-weighted correlation putting more weight on more recent price movements as described in the Joglekar paper.


Ok, I get the clustering part, but what's with the genetic algorithm?

Elementary my dear Watson! It will be used to select portfolio size, composition, and stock weights.

Genetic algorithms beautifully mimic the process of biological evolution: you begin with a population of individuals (a set of portfolios in our setting), who are represented by a set of genes (stocks). The individuals 'reproduce' by mixing their genes and an outcome of this process is a new generation of individuals, who are more fit than their parents (that’s right, we will have a fitness function to score each portfolio).

This is only a mountain top view of the method and a more detailed introduction can be found here.

The reader with one eye on implementation practicalities will probably be asking at this point: does it scale? The following sections will explain in detail how these algorithms can be implement on an array of machines (or - in the computer science jargon - distributed) in order to reduce the running time and throughput.


Distributing the clustering algorithm

Spark’s MLlib (= machine learning library) has a few widely used clustering algorithms, which are good enough in most cases. However, the desired distance measure is quite specific so I had to implement a clustering algorithm from scratch. I chose agglomerative hierarchical clustering because it explores a range of possible clusterings and offers ways to select the number of clusters based on the data. The figure below explains how the algorithm works.

Illustration of agglomerative hierarchical clustering by Tangible Auditory Interfaces.

Illustration of agglomerative hierarchical clustering by Tangible Auditory Interfaces.

As depicted on the right hand side, it begins with single-stock clusters and at each iteration the two closest clusters are merged (I use average linkage) until a single cluster containing all stocks is reached.

Hierarchical clustering is tough to distribute, however attempts have been made (e.g., Jin et al. (2015)). My approach is to have the lower triangle of the between-stock correlation matrix sitting on a cluster as an RDD and iteratively merging (averaging) its elements. This way the correlation matrix, which we have to operate on, gets smaller and smaller during the algorithm run (so the computation also becomes faster). The sketch below shows how two hypothetical clusters, x3 and x5, are merged.

Lower triangular correlation matrix between 6 clusters stored as a Spark RDD. The diagram illustrates how the matrix has to be modified when clusters x3 and x5 are merged together. Dashed arcs indicate which elements (correlations) have to be replaced with their mean.

Lower triangular correlation matrix between 6 clusters stored as a Spark RDD. The diagram illustrates how the matrix has to be modified when clusters x3 and x5 are merged together. Dashed arcs indicate which elements (correlations) have to be replaced with their mean.

It is worth noting that each iteration requires communication between the nodes and a significant part of the dataset to be sent over the network so this particular implementation might face performance issues when the number of stocks reaches 10s of thousands. Moreover, the computing time and memory complexity of such algorithms increases quadratically (or even faster) when the dataset grows. Nonetheless, this algorithm is still useful when we need to determine the number of clusters in the dataset (i.e. take a subsample of your data, identify the number of clusters in the subsample by using hierarchical clustering, and run, for example, K-means on the full dataset), which is what I'm going to talk about next.

So now I run the algorithm on the dataset, and produce a range of possible clusterings. I still need to decide on the number of clusters that should be used.

In general, you want the number of clusters to be high enough so that only stocks with considerably high positive correlation end up in the same cluster and, on the other hand, the number of clusters should be small enough so that there are no high positive correlations between stocks from different clusters. The graph below, which visualizes correlation between clusters during the last 100 iterations of the algorithm, helps to solve this tradeoff. I chose to group the stocks into 25 clusters, i.e. the point in the graph where the average correlation hits 0.

Max (green line) and weighted average correlation between clusters (blue line) under different clusterings. The numbers of stocks in the smaller clusters are used as weights.

Max (green line) and weighted average correlation between clusters (blue line) under different clusterings. The numbers of stocks in the smaller clusters are used as weights.

High max correlation and low weighted (based on cluster size) average correlation at this point indicate that the positive correlations involve only small clusters while correlations between stocks in large clusters are low. Alternatively, one could use clustering quality measures to obtain the number of clusters based on data, however this would not necessarily mean that resulting clusters would show no positive correlation (I leave this for further improvements).

Finally, let's assess the size of the clusters (figure below). The graph shows that we have arrived at a few large clusters while the majority of them are small (22-24 are single-stock clusters). 

Number of stocks in each cluster.

Number of stocks in each cluster.

Distributing the genetic algorithm

Now the fun part. First of all, I initiate the portfolio population by generating 2000 random portfolios:

  • The number of stocks in each portfolio is chosen randomly from the set {5,6,7,..,18,19,20}
  • Stocks for each portfolio are sampled with equal probabilities without replacement
  • Stock weights are chosen randomly ensuring that they sum up to 1 (the lowest allowed weight is 1%)

Each portfolio P is scored using the following fitness function (for more details see the Joglekar paper).

f(P) = Expected Return * Shannon Entropy

In short, this fitness function prefers portfolios with higher return and portfolios where more stocks come from different clusters.

Here is the strategy, which I will use for reproduction:

  • Two parent portfolios will produce a single child portfolio
  • The parents will be sampled randomly with fitter portfolios having a higher probability to be selected
  • The size of the child will be selected randomly from the parent sizes. The probability will depend on fitness value of the parents.
  • The child portfolio is constructed by randomly selecting stocks from the parent portfolios (stocks from the fitter parent have a higher probability to be included in the child). Weights of the selected stocks are rescaled to sum up to 1 and smoothed so that no weights are lower than 1%

Since genetic algorithms are mimicking biological evolution, mutations are performed with a 20% chance. If it’s chosen to perform a mutation on a child portfolio, one or more (some random salt is involved) of the following might happen:

  • Child portfolio size is randomly changed.
  • Random stocks are removed and new ones, not present in the parents, added.
  • Stock weights are randomly changed.

Finally, we apply elitism to the fittest portfolios and move them to the next generation (other than that, only child portfolios make it through to the next generation). This way I make sure that the best portfolio found so far is not lost, and in fact is carried all the way to the final generation. All the randomness mentioned above also allows me to widely explore the space of possible portfolios and move in the direction of the global optimum.

As you may have already noticed, this method is embarrassingly parallel because the child portfolios can be constructed independently of one another. This also means that we can use Apache Spark (there are other tools of course) to construct thousands of child portfolios in parallel and reduce the running time of the algorithm. The obvious way to do this is to distribute the population of portfolios across a desired number of partitions, perform one iteration of the genetic algorithm in each one, randomly shuffle resulting child portfolios across the cluster, starting the reproduction all over again and repeating these steps the desired number of times. The diagram below illustrates this approach.

The approach of distributing the genetic algorithm.

The approach of distributing the genetic algorithm.

One of the beautiful aspects of this method is that if you already have an implementation of the genetic algorithm which works on a single node, scaling it with Spark is as simple as (using Python...)

for (i in range(N_iterations)):
  population = (population.mapPartitions(lambda x: genetic_alg_iteration(x))

Now, when everything is set, I run the genetic algorithm. The graph below plots the development of average and max fitness in the population as the algorithm progresses. There is an initial jump in both of the measures during the first iterations because it is fairly easy to do better than the initial random portfolios. However, the growth quickly slows down and shows little improvement after iteration 150, which is (hopefully) a sign of getting close to the optimum. Convergence of the fitness function depends on the design of the algorithm (mutation probability, mutation types, shuffling of parent stocks when producing a child portfolio, etc.) so further tuning its parameters will increase the performance (we leave this for future improvements).

Max and average value of the fitness function in the population of portfolios during the run of the genetic algorithm.

Max and average value of the fitness function in the population of portfolios during the run of the genetic algorithm.

Similarly, we can look at the expected return and average size of the portfolios over time (figures below). During the initial iterations expected return of the fittest portfolio fluctuates but it stabilizes quickly and then continues to grow gradually. The average return is close to the return of the fittest portfolio which (in combination with the graph above) means that we’ve found not just a single well-performing portfolio but a whole population of them. 

Expected return in the population during the run of the genetic algorithm.

Expected return in the population during the run of the genetic algorithm.

Average measures being close to the measures of the fittest portfolio might also mean that the algorithm is not exploring the solution space wide enough so further tweaking the parameters might help. Also note that expected return of the fittest portfolio is rather high (~65% at the end of the run) which might be a result of the fitness function putting too much weight on the return (again, parameter tweaking should help) or the way that stock returns were estimated. Since the goal of this post is to do a test drive of the two-stage approach, I do not concentrate on estimating the returns (which would require a model of its own and a separate blog post). Instead I use a time-weighted average of the 40-day return as an estimate.

Finally, the graph below shows that large portfolios are preferred to small ones, which helps to better diversify them. The next step in this analysis would be to perform backtesting for the fittest portfolio in order to validate this strategy, which I will do in one of my future posts.

Average size (in terms of the number of stocks) of a portfolio during the run of the genetic algorithm.

Average size (in terms of the number of stocks) of a portfolio during the run of the genetic algorithm.



Thank you for your patience. This was a long one. And in case you did not follow along, here is a summary of what was done:

  • The goal was to demonstrate the capabilities of distributed computing in constructing a low risk and stable return portfolio given a set of 2000 stocks.
  • The risk can be managed by diversifying the portfolio, which from a statistical point of view can be interpreted as avoiding positively correlated stocks in the same portfolio.
  • The first step was to use agglomerative hierarchical clustering running on Spark to cluster positively correlated stocks together. This simplifies the risk management challenge to simply avoiding stocks from the same cluster in a portfolio.
  • Finally we ran a distributed version of a genetic algorithm to optimize portfolio size, composition, and stock weights.
  • The results show that the approach taken is computationally promising and might find its applications in the financial sector.
  • To further improve the method one would test automatic selection of the number of stock clusters, tune the parameters of the genetic algorithm, and build a set of models which will yield estimates of expected stock returns.
  • Finally, to get rid of the scalability issues that come with my implementation of agglomerative hierarchical clustering, one could either use the implementation described in Jin et al. (2015) or turn to other well-known algorithms (for example, use K-means to explore reasonable values of K and choose the number of clusters based on some metric, e.g. the one described in Jung et. al. (2003))



  1. Jin, Chen, et al. "A Scalable Hierarchical Clustering Algorithm Using Spark." Big Data Computing Service and Applications (BigDataService), 2015 IEEE First International Conference on. IEEE, 2015.
  2. Joglekar, Sachin R. "Two-Stage Stock Portfolio Construction: Correlation Clustering and Genetic Optimization." The Twenty-Seventh International Flairs Conference. 2014.
  3. Jung, Yunjae, et al. "A decision criterion for the optimal number of clusters in hierarchical clustering." Journal of Global Optimization 25.1 (2003): 91-111.