Chapter 3: Stochastic Simulations in pymc2

This tutorial will show you how to use GemPy for stochastic simulation of geological models. We will address two approaches for this: (i) Monte Carlo forward simulation, treating input data as uncertain parameter distributions; (ii) Bayesian inference, where we extent the approach with the use of likelihood functions to constrain the stochastic modeling results with additional data.

Preparation

Import GemPy, matplotlib for plotting, numpy and pandas for data handling.

import sys, os
sys.path.append("../..")

# import gempy
import gempy as gp

# inline figures in jupyter notebooks
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pn
import theano
WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.

Initialize an example model

First we define the cube size and model extent of our model and initialize the GemPy data object:

# set cube size and model extent
cs = 50
extent = (3000, 200, 2000)  # (x, y, z)
res = (120, 4, 80)
# initialize geo_data object
geo_data = gp.create_data([0, extent[0],
                           0, extent[1],
                           0, extent[2]],
                          resolution=[res[0],  # number of voxels
                                      res[1],
                                      res[2]])

Then we use pandas to load the example data stored as csv files:

geo_data.set_interfaces(pn.read_csv("../input_data/tutorial_ch3_interfaces",
                                    index_col="Unnamed: 0"))
geo_data.set_orientations(pn.read_csv("../input_data/tutorial_ch3_foliations",
                                    index_col="Unnamed: 0"))
# let's have a look at the upper five interface data entries in the dataframe
gp.get_data(geo_data, 'interfaces', verbosity=1).head()
X X_std Y Y_std Z Z_std annotations formation formation_number group_id isFault order_series series
0 250 0 0 0 996 0 ${\bf{x}}_{\alpha \,{\bf{1}},0}$ Layer 2 1 l2_a False 1 Default serie
1 2500 0 200 0 1149 0 ${\bf{x}}_{\alpha \,{\bf{1}},1}$ Layer 2 1 l2_b False 1 Default serie
2 2250 0 100 0 1298 0 ${\bf{x}}_{\alpha \,{\bf{1}},2}$ Layer 2 1 l2_b False 1 Default serie
3 2750 0 0 0 995 0 ${\bf{x}}_{\alpha \,{\bf{1}},3}$ Layer 2 1 l2_b False 1 Default serie
4 500 0 200 0 1149 0 ${\bf{x}}_{\alpha \,{\bf{1}},4}$ Layer 2 1 l2_a False 1 Default serie

We can visualize (and modify making use of an interactive backend) the stratigraphic pile. Since the formations are arbitary we can set them by:

# Original pile
gp.get_sequential_pile(geo_data)
<gempy.sequential_pile.StratigraphicPile at 0x7fda8cfcb160>
../_images/ch3_10_1.png
# Ordered pile
gp.set_order_formations(geo_data, ['Layer 2', 'Layer 3', 'Layer 4','Layer 5'])
gp.get_sequential_pile(geo_data)
<gempy.sequential_pile.StratigraphicPile at 0x7fd9fe0bdac8>
../_images/ch3_11_1.png
# and at all of the foliation data
gp.get_data(geo_data, 'orientations', verbosity=0)
X Y Z G_x G_y G_z dip azimuth polarity formation series
0 500 100 1148 -0.516992 -0.00855937 0.855947 31.1355 269.051 1 Layer 2 Default serie
1 2500 100 1147.33 0.516122 -0.0142732 0.856397 31.0857 91.5841 1 Layer 2 Default serie

Visualize the input data

Now let’s have a look at the data in the xz-plane:

gp.plot_data(geo_data, direction="y")
plt.xlim(0,3000)
plt.ylim(0,2000);
../_images/ch3_14_0.png

At this point we should store the input data object as a pickle, for future reference:

gp.data_to_pickle(geo_data, "./pickles/ch3-pymc2_tutorial_geo_data")

Compile the interpolator function

Now that we have some input data, the next step is to compile the interpolator function of GemPy with the imported model setup and data:

interp_data = gp.InterpolatorData(geo_data, u_grade=[1], compile_theano=True)
Compiling theano function...
Compilation Done!
Level of Optimization:  fast_compile
Device:  cpu
Precision:  float32
Number of faults:  0

Afterwards we can compute the geological model:

lith_block, fault_block = gp.compute_model(interp_data)

And plot a section:

gp.plot_section(geo_data, lith_block[0], 2, plot_data = True)
../_images/ch3_22_0.png

Setting up the pymc-Functions

pymc has two distinct types of objects: deterministic and stochastic objects. As the pymc documentation puts it: “A Stochastic object represents a variable whose value is not completely determined by its parents, and a Deterministic object represents a variable that is entirely determined by its parents.” Stochastic objects can essentially be seen as parameter distributions or likelihood functions, while Deterministic objects can be seen as function that take a specific input and return a specific (determined) output for this input. An example for the latter would be the modeling function of GemPy, which takes a specific set of input parameters and always creates the same model from those parameters.

import pymc

Setting up the parameter distributions

For conducting a stochastic simulation of the geological model, we need to consider our input data (dips and layer interfaces) as uncertain - i.e. as distributions.

# Checkpoint in case you did not execute the cells above
geo_data = gp.read_pickle("./pickles/ch3-pymc2_tutorial_geo_data.pickle")
gp.get_data(geo_data, 'orientations', verbosity=1).head()
G_x G_y G_z X X_std Y Y_std Z Z_std annotations ... azimuth_std dip dip_std formation formation_number group_id isFault order_series polarity series
0 -0.516992 -0.00855937 0.855947 500 NaN 100 NaN 1148 NaN ${\bf{x}}_{\beta \,{\bf{1}},0}$ ... NaN 31.1355 NaN Layer 2 1 l2_a False 1 1 Default serie
1 0.516122 -0.0142732 0.856397 2500 NaN 100 NaN 1147.33 NaN ${\bf{x}}_{\beta \,{\bf{1}},1}$ ... NaN 31.0857 NaN Layer 2 1 l2_b False 1 1 Default serie

2 rows × 21 columns

So let’s assume the vertical location of our layer interfaces is uncertain, and we want to represent this uncertainty by using a normal distribution. To define a normal distribution, we need a mean and a measure of deviation (e.g. standard deviation). For convenience the input data is already grouped by a “group_id” value, which allows us to collectively modify data that belongs together. In this example we want to treat the vertical position of each layer interface, on each side of the anticline, as uncertain. Therefore, we want to perturbate the respective three points on each side of the anticline collectively.

These are our unique group id’s, the number representing the layer, and a/b the side of the anticline.

group_ids = np.unique(geo_data.interfaces["group_id"])
print(group_ids)
['l2_a' 'l2_b' 'l3_a' 'l3_b' 'l4_a' 'l4_b' 'l5_a' 'l5_b']

As a reminder, GemPy stores data in two main objects, an InputData object (called geo_data in the tutorials) and a InpterpolatorInput object (interp_data) in tutorials. geo_data contains the original data while interp_data the data prepared (and compiled) to compute the 3D model.

Since we do not want to compile our code at every new stochastic realization, from here on we will need to work with thte interp_data. And remember that to improve float32 to stability we need to work with rescaled data (between 0 and 1). Therefore all the stochastic data needs to be rescaled accordingly. The object interp_data contains a property with the rescale factor (see below. As default depends on the model extent), or it is possible to add the stochastic data to the pandas dataframe of the geo_data—when the InterpolatorInput object is created the rescaling happens under the hood.

interface_Z_modifier = []

# We rescale the standard deviation
std = 20./interp_data.rescaling_factor

# loop over the unique group id's and create a pymc.Normal distribution for each
for gID in group_ids:
    stoch = pymc.Normal(gID+'_stoch', 0, 1./std**2)
    interface_Z_modifier.append(stoch)

our list of parameter distribution:

interface_Z_modifier
[<pymc.distributions.new_dist_class.<locals>.new_class 'l2_a_stoch' at 0x7fd9e04e31d0>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l2_b_stoch' at 0x7fd9fd2e9240>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l3_a_stoch' at 0x7fd9fdfdff28>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l3_b_stoch' at 0x7fd9e04e3208>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l4_a_stoch' at 0x7fd9e04e36a0>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l4_b_stoch' at 0x7fd9e04e3978>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l5_a_stoch' at 0x7fd9e04e3748>,
 <pymc.distributions.new_dist_class.<locals>.new_class 'l5_b_stoch' at 0x7fd9e04e37b8>]

Let’s have a look at one:

# sample from a distribtion
samples = [interface_Z_modifier[3].rand() for i in range(10000)]
# plot histogram
plt.hist(samples, bins=24, normed=True);
plt.xlabel("Z modifier")
plt.vlines(0, 0, 0.01)
plt.ylabel("n");
../_images/ch3_36_0.png

Now we need to somehow sample from these distribution and put them into GemPy

Input data handling

First we need to write a function which modifies the input data for each iteration of the stochastic simulation. As this process is highly dependant on the simulation (e.g. what input parameters you want modified in which way), this process generally can’t be automated.

The idea is to change the column Z (in this case) of the rescaled dataframes in our interp_data object (which can be found in interp_data.geo_data_res). First we simply create the pandas Dataframes we are interested on:

import copy
# First we extract from our original intep_data object the numerical data that is necessary for the interpolation.
# geo_data_stoch is a pandas Dataframe

# This is the inital model so it has to be outside the stochastic frame
geo_data_stoch_init = copy.deepcopy(interp_data.geo_data_res)
gp.get_data(geo_data_stoch_init, numeric=True).head()
X Y Z G_x G_y G_z dip azimuth polarity
interfaces 0 0.2501 0.4801 0.5299 NaN NaN NaN NaN NaN NaN
1 0.7001 0.5201 0.5605 NaN NaN NaN NaN NaN NaN
2 0.6501 0.5001 0.5903 NaN NaN NaN NaN NaN NaN
3 0.7501 0.4801 0.5297 NaN NaN NaN NaN NaN NaN
4 0.3001 0.5201 0.5605 NaN NaN NaN NaN NaN NaN
@pymc.deterministic(trace=True)
def input_data(value = 0,
               interface_Z_modifier = interface_Z_modifier,
               geo_data_stoch_init = geo_data_stoch_init,
               verbose=0):
    # First we extract from our original intep_data object the numerical data that is necessary for the interpolation.
    # geo_data_stoch is a pandas Dataframe
 #   geo_data_stoch = gp.get_data(interp_data_original.geo_data_res, numeric=True)

    geo_data_stoch = gp.get_data(geo_data_stoch_init, numeric=True)
    # Now we loop each id which share the same uncertainty variable. In this case, each layer.
    for e, gID in enumerate(group_ids):
        # First we obtain a boolean array with trues where the id coincide
        sel = gp.get_data(interp_data.geo_data_res, verbosity=2)['group_id'] == gID

        # We add to the original Z value (its mean) the stochastic bit in the correspondant groups id
        geo_data_stoch.loc[sel, 'Z']  += np.array(interface_Z_modifier[e])

    if verbose > 0:
        print(geo_data_stoch)

    # then return the input data to be input into the modeling function. Due to the way pymc2 stores the traces
    # We need to save the data as numpy arrays
    return [geo_data_stoch.xs('interfaces')[["X", "Y", "Z"]].values, geo_data_stoch.xs('orientations').values]

Modeling function

Second, we need a function that takes the modified input data output by the above function, and created our geological model from it. Although, we could store the model itself it tends to be a too large file once we make thousands of iterations. For this reason is preferible to keep the input data and the geological model split and only to store the input data ( which we will use to reconstruct each geological model since that operation is deterministic).

@pymc.deterministic(trace=False)
def gempy_model(value=0,
                input_data=input_data, verbose=True):

    # modify input data values accordingly
    interp_data.geo_data_res.interfaces[["X", "Y", "Z"]] = input_data[0]

    # Gx, Gy, Gz are just used for visualization. The theano function gets azimuth dip and polarity!!!
    interp_data.geo_data_res.orientations[["G_x", "G_y", "G_z", "X", "Y", "Z",  'dip', 'azimuth', 'polarity']] = input_data[1]

    try:
        # try to compute model
        lb, fb = gp.compute_model(interp_data)
        if True:
            gp.plot_section(interp_data.geo_data_res, lb[0], 0, plot_data=True)
           # gp.plot_data(interp_data.geo_data_res, direction='y')

        return lb, fb

    except np.linalg.linalg.LinAlgError as err:
        # if it fails (e.g. some input data combinations could lead to
        # a singular matrix and thus break the chain) return an empty model
        # with same dimensions (just zeros)
        if verbose:
            print("Exception occured.")
        return np.zeros_like(lith_block), np.zeros_like(fault_block)
../_images/ch3_43_0.png
interp_data.geo_data_res.orientations
G_x G_y G_z X X_std Y Y_std Z Z_std annotations ... azimuth_std dip dip_std formation formation_number group_id isFault order_series polarity series
0 0.3001 0.5001 0.555743 -0.516992 NaN -0.008559 NaN 0.855947 NaN ${\bf{x}}_{\beta \,{\bf{1}},0}$ ... NaN 31.135451 NaN Layer 2 1 l2_a False 1 1.0 Default serie
1 0.7001 0.5001 0.559481 0.516122 NaN -0.014273 NaN 0.856397 NaN ${\bf{x}}_{\beta \,{\bf{1}},1}$ ... NaN 31.085652 NaN Layer 2 1 l2_b False 1 1.0 Default serie

2 rows × 21 columns

We then create a pymc model with the two deterministic functions (input_data and gempy_model), as well as all the prior parameter distributions stored in the list interface_Z_modifier:

params = [input_data, gempy_model, *interface_Z_modifier]
model = pymc.Model(params)

Then we set the number of iterations:

iterations = 100

Then we create an MCMC chain (in pymc an MCMC chain without a likelihood function is essentially a Monte Carlo forward simulation) and specify an hdf5 database to store the results in:

RUN = pymc.MCMC(model, db="hdf5", dbname="./pymc-db/ch3-pymc2_tutorial-db")

and we are finally able to run the simulation:

RUN.sample(iter=100, verbose=0)
[-----------------100%-----------------] 100 of 100 complete in 6.3 sec

Analyzing the results

When we want to analyze the results, we first have to load the stored geo_data object:

(this part is only necessary if the notebook above is not executed)

geo_data = gp.read_pickle("./pymc-db/ch3-pymc2_tutorial_geo_data.pickle")

Check the stratigraphic pile for correctness:

gp.get_sequential_pile(geo_data)
<gempy.sequential_pile.StratigraphicPile at 0x7fd9e04e3a90>
../_images/ch3_56_1.png

Then we can then compile the GemPy modeling function:

interp_data = gp.InterpolatorData(geo_data, u_grade=[1])

Now we can reproduce the original model:

lith_block, fault_block = gp.compute_model(interp_data)
gp.plot_section(geo_data, lith_block[0], 0)
Compiling theano function...
Compilation Done!
Level of Optimization:  fast_compile
Device:  cpu
Precision:  float32
Number of faults:  0
../_images/ch3_60_1.png

But of course we want to look at the perturbation results. We have a class for that:

(in the mid term the most important methods of this class will be moved to the gempy main framework—i.e. gp)

import gempy.posterior_analysis
import importlib
importlib.reload(gempy.posterior_analysis)
<module 'gempy.posterior_analysis' from '../../gempy/posterior_analysis.py'>

Which allows us to load the stored pymc2 database

dbname = "ch3-pymc2_tutorial-db"
post = gempy.posterior_analysis.Posterior(dbname)

Alright, it tells us that we did not tally any GemPy models (we set the trace flag for the gempy_model function to False!). But we can just replace the input data with the ones stored at each iteration. So let’s plot the model result of the 85th iteration:

post.change_input_data(interp_data, 80)
<gempy.interpolator.InterpolatorData at 0x7fd9dff67c50>

Then we compute the model and plot it:

lith_block, fault_block = gp.compute_model(interp_data)
gp.plot_section(interp_data.geo_data_res, lith_block[0], 2, plot_data=True)
../_images/ch3_68_0.png

or the 34th:

post.change_input_data(interp_data, 15)
lith_block, fault_block = gp.compute_model(interp_data)
gp.plot_section(interp_data.geo_data_res, lith_block[0], 2, plot_data=True)
../_images/ch3_70_0.png

or the 95th:

post.change_input_data(interp_data, 95)
lith_block, fault_block = gp.compute_model(interp_data)
gp.plot_section(geo_data, lith_block[0], 2)
../_images/ch3_72_0.png

As you can see, we have successfully perturbated the vertical layer interface positions - although only like a 100 times. While some models represent reasonable, geologically meaningful systems, some may not. This is due to the stochastic selection of input parameters and an inherent problem of the approach of Monte Carlo forward simulation - results do not get validated in any sense. This where Bayesian inference comes into play, as a method to constrain modeling outcomes by the use of geological likelihood functions. We will introduce you to the approach and how to use it with GemPy and pymc in the following chapter.

ver, sim = gp.get_surfaces(interp_data,lith_block[1], None, original_scale= False)
gp.plot_surfaces_3D_real_time(interp_data, ver, sim, posterior=post, alpha=1)