Download heatmapBins.py Here
In this post we will look at how to use the pandas
python module and the seaborn
python module to
create a heatmap of the mean values of a response variable for 2-dimensional bins from a histogram.
The final product will be
Let’s get started by including the modules we will need in our example.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Use a seed to have reproducible results.
np.random.seed(20190121)
Simulate Data
Now, we simulate some data. We will have two features, which are both pulled from normalized gaussians. The
response variable z
will simply be a linear function of the features: z = x - y
.
nSamples = 1000
nCut = 10
def zFunction(X):
# z = x - y
return X[:, 0] - X[:, 1]
print('Generating Data')
data = np.random.normal(size = (nSamples, 2))
data = pd.DataFrame(data)
data['z'] = zFunction(data.values)
print(data.info())
Here is the output of the data’s information.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
0 1000 non-null float64
1 1000 non-null float64
z 1000 non-null float64
dtypes: float64(3)
memory usage: 23.5 KB
None
Let’s take a look at a scatter plot.
plt.clf()
plt.title('Feature Data')
plt.scatter(data.loc[:, 0], data.loc[:, 1])
plt.xlabel('Feature 0')
plt.ylabel('Feature 1')
plt.tight_layout()
plt.savefig('graphs/features.svg')
Here is the output.
Let’s also take a look at a density plot using seaborn
.
plt.clf()
sns.jointplot(data[0], data[1], kind = 'kde')
plt.gcf().suptitle('Density of Features')
plt.tight_layout()
plt.savefig('graphs/density.svg')
Here is the output.
Make Cuts for Using Pandas Groupby
Next, let us use pandas.cut()
to make cuts for our 2d bins.
cuts = pd.DataFrame({str(feature) + 'Bin' : pd.cut(data[feature], nCut) for feature in [0, 1]})
print('at first cuts are pandas intervalindex.')
print(cuts.head())
print(cuts.info())
The bin values are of type pandas.IntervalIndex
. Here is the head of the cuts
dataframe.
0Bin 1Bin
0 (-0.476, 0.148] (-1.012, -0.387]
1 (-1.1, -0.476] (0.237, 0.861]
2 (0.148, 0.773] (0.237, 0.861]
3 (0.773, 1.397] (0.237, 0.861]
4 (0.773, 1.397] (-1.012, -0.387]
Here is the information on the cuts
dataframe.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
0Bin 1000 non-null category
1Bin 1000 non-null category
dtypes: category(2)
memory usage: 2.3 KB
None
Note, that the types of the bins are labeled as category
, but one should use methods from pandas.IntervalIndex
to work with them.
Do the Groupby and Make Heatmap
Now, let’s find the mean of z
for each 2d feature bin; we will be doing a groupby using both of the bins
for Feature 0 and Feature 1.
means = data.join(cuts).groupby( list(cuts) ).mean()
means = means.unstack(level = 0) # Use level 0 to put 0Bin as columns.
# Reverse the order of the rows as the heatmap will print from top to bottom.
means = means.iloc[::-1]
print(means.head())
print(means['z'])
Let’s take a look at the head of means
.
0 ... z
0Bin (-2.979, -2.349] ... (2.646, 3.27]
1Bin ...
(2.735, 3.359] NaN ... NaN
(2.11, 2.735] NaN ... NaN
(1.486, 2.11] NaN ... 0.971420
(0.861, 1.486] NaN ... 1.590939
(0.237, 0.861] -2.973171 ... 2.311033
[5 rows x 30 columns]
As we an see, we need to specify means['z']
to get the means of the response variable z
. This gives
0Bin (-2.979, -2.349] ... (2.646, 3.27]
1Bin ...
(2.735, 3.359] NaN ... NaN
(2.11, 2.735] NaN ... NaN
(1.486, 2.11] NaN ... 0.971420
(0.861, 1.486] NaN ... 1.590939
(0.237, 0.861] -3.355331 ... 2.311033
(-0.387, 0.237] -2.616789 ... 2.746465
(-1.012, -0.387] -1.733825 ... NaN
(-1.636, -1.012] NaN ... 4.001416
(-2.26, -1.636] NaN ... NaN
(-2.891, -2.26] NaN ... NaN
[10 rows x 10 columns]
Let’s now graph a heatmap for the means of z
.
plt.clf()
sns.heatmap(means['z'])
plt.title('Means of z vs Features 0 and 1')
plt.tight_layout()
plt.savefig('graphs/means1.svg')
This gives the graph:
As we can see, the x and y labels are intervals; this makes the graph look cluttered. Let us
now use the left endpoint of each interval as a label. We will use pandas.IntervalIndex.left
.
plt.clf()
sns.heatmap(means['z'], xticklabels = means['z'].columns.map(lambda x : x.left),
yticklabels = means['z'].index.map(lambda x : x.left))
plt.title('Means of z vs Features 0 and 1')
plt.tight_layout()
plt.savefig('graphs/means2.svg')
This gives our final graph: