Linear Regression
The Physics Hypertextbook™
© 1998-2008 by Glenn Elert -- A Work in Progress
All Rights Reserved -- Fair Use Encouraged
prev | up | next
Discussion
introduction
Given a pile of data …
| x1, x2, x3, … xn |
and |
y1, y2, y3, … yn |
Find the line of best fit.
y = mx + b
We are shooting for a minimal amount of error as measured by the sum of the
squares of the residuals. This method is called a least squares fit and is
probably the most common form of best fit line.
| |
n |
|
n |
|
| R2 = |
∑ |
(∆yi)2 = |
∑ |
[(mxi + b) − yi]2 = minimum |
| |
i = 1 |
|
i = 1 |
|
This occurs where each of the partial derivatives is zero. (The limits on
the summations will be omitted out of laziness from now on.)
| ∂ |
R2 = 2 ∑{[(mxi + b) − yi ] xi} = 0 |
| ∂m |
|
and |
| ∂ |
R2 = 2 ∑[(mxi + b) − yi ] = 0 |
| ∂b |
|
After a bit of algebra, you get these equations …
| m = |
n∑(xiyi) − ∑xi ∑yi |
| n ∑(xi2) − (∑xi)2 |
|
and |
| b = |
∑(xi2) ∑yi − ∑xi ∑(xiyi) |
| n ∑(xi2) − (∑xi)2 |
|
or, if you work a bit more algebra, these equations …
| m = |
∑(xiyi) − n x y |
| ∑(xi2) − n (x)2 |
|
and |
| b = |
y ∑(xi2) − x ∑(xiyi) |
| ∑(xi2) − n (x)2 |
|
How good is the line of best fit? Are some bests better than others? Here's
one way to decide. Swap the explanatory and response variables.
x = m'y + b'
The slope of this new linear equation is the same as the old one with all
the x's replaced by y's and vice versa. (Note that the numerator hasn't
really changed.)
| m' = |
n∑(yixi) − ∑yi ∑xi |
| n ∑(yi2) − (∑yi)2 |
Now, multiply this slope by the old slope. Don't ask why, just do it.
| m m' = |
⎛ ⎜ ⎝ |
n∑(xiyi) − ∑xi ∑yi |
⎞⎛ ⎟⎜ ⎠⎝ |
n∑(yixi) − ∑yi ∑xi |
⎞ ⎟ ⎠ |
| n ∑(xi2) − (∑xi)2 |
n ∑(yi2) − (∑yi)2 |
This product is known as the coefficient of determination
| r2 = |
(n∑(xiyi) − ∑xi ∑yi)2 |
| (n ∑(xi2) − (∑xi)2) |
(n ∑(yi2) − (∑yi)2) |
and its square root is called the coefficient of correlation.
| r = |
n∑(xiyi) − ∑xi ∑yi |
| √(n ∑(xi2) − (∑xi)2) |
√(n ∑(yi2) − (∑yi)2) |
Summary
Problems
practice
- electric-energy.txt
In the United States, electric energy is measured in kilowatt hours and purchased with dollars.
- Determine the equation of the best fit straight line.
- Explain the significance of the coefficients m, b, and r2.
Solutions …
- Here's what the graph looks like.
- The slope (m) of a linear function is the rate of change of the vertical
quantity (y) with respect to the horizontal quantity (x). It should be apparent
that the slope of this graph is the average price for electricity per kilowatt
hour. The real question should be, why doesn't this graph intercept the vertical
axis at the origin? Surely, if I were to use no energy I should pay no money.
When I don't go to a restaurant, I don't get charged. Why should electricity
be any different? Well, there are two answers to this question. One is that
utility companies as legal monopolies are trying to extract every penny they
can from their captive customers. The second, which is the contention of the
utilities themselves, is that there are fixed expenses associated with every
customer regardless of how much energy they consume: maintenance, administration,
insurance, etc. Such fixed expenses are gathered under the umbrella term "basic
service charges". Thus, in this bill …
- (the slope) is the average price for electric energy: 14.7¢
per kilowatt hour;
- (the y-intercept) is the average basic service
charge: $9.81 per month. Note the lone data point on the left hand side
of the graph that illustrates this policy. The entire house was on vacation
for the entire month, the electricity was shut off at the circuit breaker,
and yet still there was a charge.
- (the correlation coefficient) shows the correlation between
the energy and price: 0.96. The more important quantity for this analysis
is the square of this number -- the coefficient of determination: r2 = 0.92.
This number shows that 92% of the variation in these electric bills is
due to the amount of energy consumed. The remaining 8% variation is due
to seasonal effects (electricity is always more expensive in summer when
air conditioner use drives up demand) and bracket billing policies (the
first quarter megawatt hour
or so is cheaper than the rest with this utility). These secondary variations
are evident in the data point in the extreme upper right hand corner of
the graph that lies well above the line of best fit. It occured during
the summer when rates and consumption were at their highest.
- standard-atmosphere.txt
This text file provides standard meteorological data for the earth's atmosphere as a function of altitude above sea level.
- Find the transformation that will relate the pressure to altitude with a linear equation.
- Write the nonlinear equation that results.
Solutions …
- Start by examining a graph of the raw data.
| |
| unadjusted data |
 |
| |
Looks like it could be some sort of inverse relationship, but none of
them work. "Inverse this power. Inverse that power." It seems as if nothing can straighten it out.
| |
|
|
| invert |
invert and square |
invert and square root |
 |
 |
 |
| |
|
|
Did I say it looks like some sort of inverse relationship? Then why does
it intercept the y-axis? An inverse relationship would be infinite at zero. It would never
cross the vertical axis. So what's going on?
This graph shows exponential decay. The cure for this deviation from
linearity is a logarithmic function. Base 10, base e, it doesn't matter.
Now we have a straight line.
| |
| log base 10 |
 |
| |
- Rearranging the variables gives …
| |
|
| y = |
mx + b |
| log(P) = |
m·h + b |
| 10log(P) = |
10m·h + b = 10m·h × 10b = 10b × 10m·h |
| P = |
102.04 × 10−0.0641·h = 110 × 10−0.0641·h |
| P = |
110 kPa × 10−(h ÷ 16 km) |
| |
|
Some commentary on the values of the coefficients
- After transformation, the slope of the linear fit becomes a multiplier in
an exponent. The magnitude of the reciprocal of this value is an
important number. Whenever the altitude has this value or multiples
of this value the exponent will be a whole number. The slope calculated
was −0.0641 km−1. (Inverse kilometers are used as the unit to cancel the kilometers in the
height.) The reciprocal of this value is about 16 km, which is conveniently equal to 10 miles for the Americans. At this altitude, the exponent in our function would
equal negative one and the atmospheric pressure would be one-tenth
of its value at sea level. At twice this altitude, roughly 32 km, the exponent would equal negative two and the pressure would be one-one-hundredth
its value at sea level. At three times this altitude, 48 km, the exponent would equal negative three and the pressure would be one-one-thousandth
its sea level value. And so on, getting ever smaller, but never reaching
zero. This is what it means for a quantity to decay exponentially.
| |
| altitude (km) |
pressure (atm) |
comment |
| 304 |
10−19 |
space shuttle orbit |
| … |
… |
… |
| 96 |
0.000,001 |
highest airplane flight |
| 80 |
0.000,01 |
|
| 64 |
0.000,1 |
|
| 48 |
0.001 |
highest unmanned balloon flight |
| 32 |
0.01 |
highest manned balloon flight |
| 16 |
0.1 |
50% higher than most commercial flights |
| 0 |
1 |
sea level |
| |
- This coefficient should equal equal the atmospheric pressure at sea level,
however, the value calculated (110 kPa) is significantly different from the value of the standard atmosphere
(101.325 kPa). Such is the nature of statistical analysis.
- Once again, it's really r2 we're
interested in. This number is very close but not equal to one (r2 = 0.999).
The atmosphere behaves quite simply when it comes to pressure and
can be adequately described by an exponential decay model.
- dash-world-records.txt (utf-8)
The text file referenced above has data on the world records for the 100 m dash. The data are broken up into four groups:
- men's electronically-timed world records,
- men's hand-timed world records,
- women's electronically-timed world records, and
- women's hand-timed world records
Analyze this data.
- Perform a linear regression on both men's and women's world record times.
- Explain the significance of the numerical results.
- Make an interesting prediction.
Source: International Association of Athletics Federations (IAAF)
Solutions …
- The graph …
- The numbers …
| |
|
|
| men |
|
women |
| |
|
|
|
y = mx + b
m = −0.008998 s/yr
b = +27.74 s
r = −0.9491 |
|
|
y = mx + b m = −0.02399 s/yr
b = +58.32 s r = −0.9199 |
|
| |
|
|
| The slope of this graph shows us that men's times are decreasing
at approximately 0.01 seconds each year. |
|
Women's times are decreasing faster, 0.02 seconds per year,
approximately twice the rate of men |
| |
|
|
| The y-intercept would be the world record
in the year zero (a year that does not exist, by the way). Extrapolating
this linear fit back 20 centuries would be a stupid thing to do. Surely
there was someone around at the turn of the first millennium who could
run a hundred meters in under 27 seconds.. |
|
The y-intercept for women is extra foolish. Nearly a minute
to run 100 m? I don't think so. Linear regression is nice, but
it isn't a religion. You don't have to believe everything it says. |
| |
|
|
| The r value gives us an indication of how well
the data can be explained by a linear model. Squaring −0.9491
gives us 0.9008, which means 90% of the variation in men's world record
100 m
dash times is linear. That's quite a reasonable
fit to an artificial model. |
|
The fit is not quite as tight for the women's times. Squaring r = −0.9199
yields a coefficient of determination of 0.8462. Thus a linear model
only explains 85% of the variation in women's world record 100 m
dash times. Still pretty good, as far as I'm concerned, for a messy
data set like this one. |
| |
|
|
- I find it somewhat surprising that the trends in world record times can
be so well explained by a linear model. I would have expected that the data
would show the athletes approaching some limit. Surely, humans can't keep
running faster and faster indefinitely. There must be some performance "wall" ahead
of them -- something to keep them from running faster than a speeding bullet.
As far as the last century goes, this appears not to be the case. Times have
been shrinking at a steady rate. Assuming they keep up like this, women sprinters
will eventually outrun their male counterparts some time in the middle of
the Twenty-first Century. We can even predict the year at which the transition
will occur. Set the two regression equations equal and see what happens.
| |
|
|
| (mx + b)men |
= |
(mx + b)women |
| (−0.008998 x + 27.74) |
= |
(−0.02399 x + 58.32) |
| (0.02399 − 0.008998) x |
= |
(58.32 − 27.74) |
| 0.014992 x |
= |
30.58 |
| x |
= |
(30.58 ÷ 0.014992) |
| x |
= |
2040 |
| |
|
|
If you really felt that world record times would follow a linear progression
you might even try determining the day in 2040 when the women catch up to
the men. But since I recognize the limitations of this model, I won't
be entering the office "men-vs.-women-hundred-meter-dash" pool.
In fact, if we choose a slightly different data set, we'll end up predicting
a significantly different transition year. These calculations are left
as an exercise for the reader.
- anscombe-data.txt
Source: Anscombe, F.J. "Graphs in Statistical Analysis," The American Statistician. Vol. 27, No. 1 (1973): 19.
This problem is not finished. These data sets have been rigged to have the same slope (0.50), y-intercept (3.00), and correlation (0.82).
- A linear fit is useful here.
- A linear fit is not useful here. This is probably a quadratic.
- That one outlier should be removed and a linear fit tried again.
- The linear fit is strongly affected by that one outlier. Without it, however, there isn't enough variation to see a trend. There isn't much that can be done with this one.
statistical
- For each of the following data sets …
- determine the equation of the best fit straight line(s) and
- explain the significance of the coefficients m, b, and r2.
Here are the data sets …
- braking-distance.txt
In this road test, braking distances were measured for different
cars traveling at 60 mph and 80 mph. Graph these distances against one another.
Source: "Road Test Summary." Road & Track. (July 1998): 186-87.
- satellite-failures.txt
Satellites in low earth orbit (LEO) operate between 250 and 1500 km above the ground. Because Earth's atmosphere extends hundreds of miles
into space, LEOs eventually experience enough friction that they
fall back to earth and burn up. The accompanying text files gives
the number of low earth orbit satellites that reentered the earth's
atmosphere and the number of sunspots for each year since 1969. Graph
the number of reentered satellites vs. the number of sunspots.
Source:
NASA Goddard Space Flight Center.
- soap.txt
Two bars of soap in a bathroom shower were weighed almost every day
for about two weeks. Graph the mass of each as a function of time.
- standard-atmosphere.txt
This text file provides standard meteorological data for the earth's
atmosphere as a function of altitude above sea level. Graph temperature
as a function of altitude for the tropospheric portion of the atmosphere
from sea level to 11 km. (Do not analyze the entire data set. The atmosphere above 11 km behaves much differently.)
- toaster.txt
The duration of the toast cycle was measured for different light-dark
settings of a two-slot electric bread toaster. Graph cycle time as
a function of light-dark setting for this toaster when it held one
and two slices of bread
- wavelength-of-light.txt
In this experiment the wavelengths of the visible line spectra for
an excited gas were measured using two different methods. Graph these
trials against one another.
- For each of the following data sets …
- find the transformation that will relate the two variables with a linear
equation and
- write the nonlinear equation that results.
Here are the data sets …
- aerodynamic-drag.txt
In this experiment students measured the aerodynamic drag on a weighted
party balloon falling at different speeds.
- constant-force.txt
In this experiment different masses were subject to the same force
and their accelerations recorded.
- milk-freshness.txt
The following data were taken from a milk carton sold in North Carolina.
Source: Greenler, Robert. Chasing the Rainbow. Milwaukee, WI: Elton-Wolf, 2000: 140.
- moore-law.txt
This data set shows the number of switches in a computer for various
years in the Twentieth Century.
Source: IBM Gallery of Science & Art, 590 Madison Avenue, Second Floor, New York, NY 10022 (July 1992).
- resonance-tube.txt
In this experiment various tuning forks of known frequency were held
above a resonance tube, which was used to determine the wavelength
of the sound emitted.
- Answer the questions associated with the following data sets.
- Determine the year when women sprinters will run as fast as their male counterparts
in the 100 m dash using …
- dash-electronic-timing.txt (utf-8)
only those world records that were timed electronically (as opposed
to manually).
- dash-olympic-gold-medals.txt (utf-8)
olympic gold medal winners (as opposed to world record setters).
Compare your results with those obtained in pracice problem 3
Source: International Association of Athletics Federations (IAAF)
- co2-mauna-loa.txt
Mauna Loa Observatory on the "Big Island" of Hawaii has been recording atmospheric carbon dioxide concentrations for
nearly half a century beginning in the year 1958. Readings are taken
continuously, but only the monthly averages are reported. Values
are reported in parts per million (ppm)
- Construct a graph of atmospheric CO2 concentration vs. time.
- What two obvious behaviors are revealed in your graph?
- Split the data set in half and perform a linear regression analysis on the
data for the years …
- 1958-1980 and
- 1981-2006.
- Compare the behavior of CO2 levels in the first half of the data set to the second half.
Source: Scripps Institution of Oceanography
- gw-vardo.txt
Global warming is most easily observed in long term temperature measurements
taken at high latitudes (near the poles). Vardø is a village in the extreme northeast of Norway on the Barents Sea. Despite
being a few degrees north of the Arctic Circle, its harbor remains
ice free due to the warm North Atlantic drift current (an extension
of the Gulf Stream). Vardø's climate is mild for its latitude,
which means it varies from a few °C above freezing in the summer
to a few °C below freezing in the winter. A location with such
a stable climate is a good place to check for human induced climate
change.
- Construct a graph of monthly average temperature vs. time for the period
1881 to 2006.
- Using linear regression, determine the following quantities for the whole
data set …
- the rate of change of temperature in °C per century
- the uncertainty in this value
- the coefficient of determination
- the root-mean-square error (if you have the ability to calculate this number)
- Divide the data set up into four equal intervals of roughly 378 months (31.5 years) and repeat.
- Compile your results in a table like the one below and comment on the manner
in which temperatures have changed at Vardø in this 125 year
period. (Use the results of all four calculated columns in your
analysis, not just the rate of temperature change.)
| |
|
|
|
|
|
| time interval |
ΔT/Δt (°C/100 y) |
uncertainty (°C/100 y) |
r2 |
rmse (°C) |
| overall |
(1881-2006) |
|
|
|
|
| 1st quarter |
(1881-1912) |
|
|
|
|
| 2nd quarter |
(1912-1943) |
|
|
|
|
| 3rd quarter |
(1944-1975) |
|
|
|
|
| 4th quarter |
(1975-2006) |
|
|
|
|
| Source: NASA Goddard Institue for Space Science |
| |
- gw-central-park.txt
[Note: This is an extension of the previous problem, but it can be worked
on independently with little loss of meaning.]
Surface air temperatures have increased in New York City on the order
of one degree celsius in the Twentieth Century -- consistent with
the trend of global warming. New York is the largest city in the
United States and the fourth largest metropolitan area on the planet.
8.5 million people live within the city limits and an additional
10 million are within commuting distance. With a gross metropolitan
product approaching one trillion dollars ($1015) the economy
of New York City is larger than that of all but a dozen or so nations.
This geographic concentration of people and economic power
must certainly have an effect on the local climate. Repeat the analysis
described in the previous problem using 125 years worth of temperature
measurements taken in Central
Park in New York City.
- Construct a graph of monthly average temperature vs. time for the period
1881 to 2006.
- Using linear regression, determine the following quantities for the whole
data set …
- the rate of change of temperature in °C per century
- the uncertainty in this value
- the coefficient of determination
- the root-mean-square error (if you have the ability to calculate this number)
- Divide the data set up into four equal intervals of roughly 378 months (31.5 years) and repeat.
- Compile your results in a table like the one below and comment on the manner
in which temperatures have changed at New York City in this 125
year period. (Use the results of all four calculated columns
in your analysis, not just the rate of temperature change.)
| |
|
|
|
|
|
| time interval |
ΔT/Δt (°C/100 y) |
uncertainty (°C/100 y) |
r2 |
rmse (°C) |
| overall |
(1881-2006) |
|
|
|
|
| 1st quarter |
(1881-1912) |
|
|
|
|
| 2nd quarter |
(1912-1943) |
|
|
|
|
| 3rd quarter |
(1944-1975) |
|
|
|
|
| 4th quarter |
(1975-2006) |
|
|
|
|
| Source: NASA Goddard Institue for Space Science |
algebraic
- Idea for a problem set. Transform the following nonlinear equations into
linear equations by the appropriate change of variables. For each transformed
equation, identify …
- the new x variable,
- the new y variable,
- the slope, m, and
- the y intercept.
Here are the equations …
- equation one
- equation two
- and so on
Resources
- gender gap in the 100 m dash
- anscombe data set
prev | up | next