The other day (which also just happened to be my birthday), I had a complete and utter brain cramp as I searched for the name of what is perhaps the most commonly used graph to display distributions of discrete and continuous data. Of course my loving family lost NO time in observing that, given my advanced age, it was a wonder that I remembered much of anything at all. The word that I was searching for was probably (they added helpfully) in the same place as my often-misplaced glasses. Ah, yes — they are a hilarious bunch.
Days later, when out of the blue the name — HISTOGRAM!!! — finally came to me, I decided to treat it as a sign that I should write about it in this week’s newsletter.
The graph whose name successfully eluded me that day, the histogram, is perhaps the type most commonly used to display distributions. While a bar chart pictures frequency distribution for either nominal or ordinal data, a histogram depicts this distribution for discrete or continuous data. The horizontal axis displays the true limits — the points that separate one from its fellows — of the various intervals. For example, the boundary between the first two classes in serum cholesterol levels shown below is 119 mg/100ml; that boundary is the true upper limit of the interval 80-119, the true lower limit of 120-159.
The vertical axis of a histogram depicts either the frequency or the relative frequency of observations within each interval. Here are two histograms that I created to display the relative frequency of serum cholesterol levels in men of two different age groups.
You’ll notice that, unlike a simple bar chart, where each bar is clearly separate from every other bar, the bars in a histogram touch to remind us that the data being displayed on the x axis is continuous. To emphasize this unity — and in another difference from bar charts, which often rely on sorting and ranking for visual interpretation of the data — the bars of a histogram cannot be rearranged.
It is also important to note that the frequency associated with each interval in a histogram is represented not by the height of the bar above it but by the bar’s own actual area. In the first histogram above, 1% of the total area corresponds to the 13 observations lying between 80 and 110 mg/100ml. 14% of the area corresponds to the 150 observations between 120 and 160 mg/100ml. The area of the entire histogram equals 100%. I divided the data into these two age groups and created two different displays to help you to see how well a histogram shows the distribution of the data. In the first graph, we can see that the younger men’s serum cholesterol skews to the left (or is lower) compared to men in the older age group. It’s also clear that the average for the younger men is in the 160-199 range versus 40 mg/100ml higher for older men (or, it’s hell to get old).
If we want to compare the distribution of the two groups of cholesterol levels directly, we can create a frequency polygon — similar to a histogram in many ways, but an easier method for comparing two sets of data and highlighting the shape of the distribution.
A frequency polygon, which uses the same two axes as a histogram, is constructed by placing a point at the center of each interval such that the height of the point is equal to the frequency or relative frequency associated with that interval. Points are also placed on the horizontal axis at the midpoints of the intervals immediately preceding and immediately following the intervals that contain observations (that is why in this graph you will see the value 59.5 immediately preceding the first midpoint value of 99.5 that was displayed on the histograms above). The frequency polygon makes it much easier to see how the distribution of the two sets of data differ. (And it’s still hell to get old.)
I have no idea why I couldn’t think of the word “histogram” the other day. Advancing age (after all, I was a year older), even increased cholesterol levels, may had have something to do with it. On the other hand, the slip-up inspired me to write about histograms, and perhaps doing so burned the term into my brain.
Now…where did I leave my glasses?
1 Comment
Meic Goodyear · May 9, 2014 at 6:48 am
The colour coding and series labelling is exemplary, but I have deep reservations about the polygons as presented here. Yes, they make the comparison easier, but they imply things that the data used do not support. The histograms make it clear that there are only eight values in each, and each value refers to a defined range. The use of continuous numeric scales for the horizontal axes, combined with thick lines without point markers, invites the reader to interpolate an (x,y) value for any point on the line, and this is surely wrong.
I suggest making the lines faint or dotted, and adding prominent data markers, and using a line chart rather than x-y scatterplot.