How to simulate big datasets in Python using parallel computing - Part III : Generating real world data using parallel computing

Photo by Nikunj Maheshwari

This post picks up from the last post where we ended up simulating rate of change of a parameter using parallel computing. It focuses on the last step of this series - generating real world data using the simulated rates via parallel computing in Python.

Simulating Conditions

Before you start the actual simulation, you need to ask yourself -

  • How much data (number of observations) do you need to simulate?
  • If it is a batch process that generates real world data (for example, a manufacturing pipeline), how long does each batch lasts for?
  • As this would be a time-dependent data, what is the required frequency that mimics the real world data?

For example, a company has been manufacturing a chemical compound for 2 years using multiple batches. Each batch runs for 10 days, and data is collected for several key performance indicators (KPIs) at a frequency of 1 min. In this case, a batch will contain 14,400 observations (10 days x 24 hours x 60 minutes x 1). In our example, we will simulate a batch run of 14 days with the frequency of data collection set at 30 seconds. To solve this problem, we will use parallel computing in Python and our custom-made simulating function.

Parallel Computing in Python - Pool Process

To carry out the simulation on a powerful server, we can utilise Python’s parallel computing function called pool. We can launch several pool workers (processes) simultaneously, each simulating a batch for 2 weeks at a frequency of 30 seconds. For our purpose, we will launch 75 workers, each running 15 batches, one at a time. The following code explains it well -

sim_df = pool.map(simulate_corr, runs, chunksize=15)

Here, pool size is 75, chunksize is 15, runs is a list of 1125 (it tells to run each pool worker 15 times), and simulate_corr is the function which is passed to each pool worker. Global variables samples and freq have been initialised with 40320 (14 days x 24 hours x 60 min x 2 for a frequency of 30 seconds) and 30 respectively. Once all pool workers have finished their assigned tasks, the output is consolidated into a single df called sim_df. As each worker is executed independently, there is no dependency between any workers, and this simulates a ‘batch’ production environment (i.e. each batch is run independently of each other).

Simulating Function

As we want to simulate observations which are comparable to the original dataset, one of the key parameters to keep in check is correlation. In a manufacturing environment, several KPIs would be dependent on each other, thereby having a correlation among them. For example, an increase in oxygen levels is correlated with a decrease in nitrogen levels in a bioreactor. To simulate real world data, we must try to keep the correlation as close as possible to the correlation in the original dataset.

To add randomness in how we pick a column to simulate against, we follow the algorithm as described - to simulate a row of observation, pick a random column, and simulate other columns with respect to the random column, keeping the correlation the same as in the original dataset. This is done by the line below -

nextValue = start + (sdev[col]/sdev[randomCol])*corr*delta*freq

Here, start is the starting value (mean) or the last value simulated, randomCol is a random column, col is one of the other columns, corr is the correlation between randomCol and col in the original dataset, delta is a random simulated rate value of col (we simulated this in Part II), and freq is in seconds. sdev is the standard deviation of columns calculated from the original dataset. This formula gives the next simulated value (nextValue) for col and further checks are done to ensure this value remains within the upper and lower bounds. The above process is repeated until all columns have been simulated with respect to randomCol. The new row is appended to a data.frame and the loop continues until the Pool process reaches the maximum number of iterations.

Conclusion

In this example, the simulations ran for ~40 days on a 80 core server, each core with 4GB RAM. The resulting dataset contained ~44 million time series simulated observations, all created from scratch and using the original small dataset. The complete code and detailed explanation is available on my Github here.

Thank you for reading. I will be starting a new series on how to analyse big data and build machine learning models using Spark, and we will be using part of this dataset for the analyses.

Avatar
Nikunj Maheshwari, PhD
Senior Manager - Data Science

My research interests include application of machine learning to Big Data analytics, and teaching data science with R and Python.

Related