In this post we will look at how to use the pandas python module and the seaborn python module to
create a heatmap of the mean values of a response variable for 2-dimensional bins from a histogram.
The final product will be
Let’s get started by including the modules we will need in our example.
Simulate Data
Now, we simulate some data. We will have two features, which are both pulled from normalized gaussians. The
response variable z will simply be a linear function of the features: z = x - y.
Here is the output of the data’s information.
Let’s take a look at a scatter plot.
Here is the output.
Let’s also take a look at a density plot using seaborn.
Here is the output.
Make Cuts for Using Pandas Groupby
Next, let us use pandas.cut() to make cuts for our 2d bins.
The bin values are of type pandas.IntervalIndex. Here is the head of the cuts dataframe.
Here is the information on the cuts dataframe.
Note, that the types of the bins are labeled as category, but one should use methods from pandas.IntervalIndex
to work with them.
Do the Groupby and Make Heatmap
Now, let’s find the mean of z for each 2d feature bin; we will be doing a groupby using both of the bins
for Feature 0 and Feature 1.
Let’s take a look at the head of means.
As we an see, we need to specify means['z'] to get the means of the response variable z. This gives
Let’s now graph a heatmap for the means of z.
This gives the graph:
As we can see, the x and y labels are intervals; this makes the graph look cluttered. Let us
now use the left endpoint of each interval as a label. We will use pandas.IntervalIndex.left.