Heatmap of Mean Values in 2D Histogram Bins

22 Jan 2019

Download heatmapBins.py Here

In this post we will look at how to use the pandas python module and the seaborn python module to create a heatmap of the mean values of a response variable for 2-dimensional bins from a histogram.

The final product will be

Final heatmap of z vs Features 0 and 1

Let’s get started by including the modules we will need in our example.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Use a seed to have reproducible results.

np.random.seed(20190121)

Simulate Data

Now, we simulate some data. We will have two features, which are both pulled from normalized gaussians. The response variable z will simply be a linear function of the features: z = x - y.

nSamples = 1000 
nCut = 10

def zFunction(X):
    # z = x - y
    return X[:, 0] - X[:, 1]

print('Generating Data')
data = np.random.normal(size = (nSamples, 2))
data = pd.DataFrame(data)
data['z'] = zFunction(data.values)
print(data.info())

Here is the output of the data’s information.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
0    1000 non-null float64
1    1000 non-null float64
z    1000 non-null float64
dtypes: float64(3)
memory usage: 23.5 KB
None

Let’s take a look at a scatter plot.

plt.clf()
plt.title('Feature Data')
plt.scatter(data.loc[:, 0], data.loc[:, 1])
plt.xlabel('Feature 0')
plt.ylabel('Feature 1')
plt.tight_layout()
plt.savefig('graphs/features.svg')

Here is the output.

Scatter Plot of Features

Let’s also take a look at a density plot using seaborn.

plt.clf()
sns.jointplot(data[0], data[1], kind = 'kde')
plt.gcf().suptitle('Density of Features')
plt.tight_layout()
plt.savefig('graphs/density.svg')

Here is the output.

Density Plot of Features

Make Cuts for Using Pandas Groupby

Next, let us use pandas.cut() to make cuts for our 2d bins.

cuts = pd.DataFrame({str(feature) + 'Bin' : pd.cut(data[feature], nCut) for feature in [0, 1]})
print('at first cuts are pandas intervalindex.')
print(cuts.head())
print(cuts.info())

The bin values are of type pandas.IntervalIndex. Here is the head of the cuts dataframe.

              0Bin              1Bin
0  (-0.476, 0.148]  (-1.012, -0.387]
1   (-1.1, -0.476]    (0.237, 0.861]
2   (0.148, 0.773]    (0.237, 0.861]
3   (0.773, 1.397]    (0.237, 0.861]
4   (0.773, 1.397]  (-1.012, -0.387]

Here is the information on the cuts dataframe.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
0Bin    1000 non-null category
1Bin    1000 non-null category
dtypes: category(2)
memory usage: 2.3 KB
None

Note, that the types of the bins are labeled as category, but one should use methods from pandas.IntervalIndex to work with them.

Do the Groupby and Make Heatmap

Now, let’s find the mean of z for each 2d feature bin; we will be doing a groupby using both of the bins for Feature 0 and Feature 1.

means = data.join(cuts).groupby( list(cuts) ).mean()
means = means.unstack(level = 0) # Use level 0 to put 0Bin as columns.

# Reverse the order of the rows as the heatmap will print from top to bottom.
means = means.iloc[::-1]
print(means.head())
print(means['z'])

Let’s take a look at the head of means.

                              0      ...                  z
0Bin           (-2.979, -2.349]      ...      (2.646, 3.27]
1Bin                                 ...                   
(2.735, 3.359]              NaN      ...                NaN
(2.11, 2.735]               NaN      ...                NaN
(1.486, 2.11]               NaN      ...           0.971420
(0.861, 1.486]              NaN      ...           1.590939
(0.237, 0.861]        -2.973171      ...           2.311033

[5 rows x 30 columns]

As we an see, we need to specify means['z'] to get the means of the response variable z. This gives

0Bin              (-2.979, -2.349]      ...        (2.646, 3.27]
1Bin                                    ...                     
(2.735, 3.359]                 NaN      ...                  NaN
(2.11, 2.735]                  NaN      ...                  NaN
(1.486, 2.11]                  NaN      ...             0.971420
(0.861, 1.486]                 NaN      ...             1.590939
(0.237, 0.861]           -3.355331      ...             2.311033
(-0.387, 0.237]          -2.616789      ...             2.746465
(-1.012, -0.387]         -1.733825      ...                  NaN
(-1.636, -1.012]               NaN      ...             4.001416
(-2.26, -1.636]                NaN      ...                  NaN
(-2.891, -2.26]                NaN      ...                  NaN

[10 rows x 10 columns]

Let’s now graph a heatmap for the means of z.

plt.clf()
sns.heatmap(means['z']) 
plt.title('Means of z vs Features 0 and 1')
plt.tight_layout()
plt.savefig('graphs/means1.svg')

This gives the graph:

Heatmap with Interval Labels

As we can see, the x and y labels are intervals; this makes the graph look cluttered. Let us now use the left endpoint of each interval as a label. We will use pandas.IntervalIndex.left.

plt.clf()
sns.heatmap(means['z'], xticklabels = means['z'].columns.map(lambda x : x.left),
                        yticklabels = means['z'].index.map(lambda x : x.left))
plt.title('Means of z vs Features 0 and 1')
plt.tight_layout()
plt.savefig('graphs/means2.svg')

This gives our final graph:

Final graph

Download heatmapBins.py Here