Linear Regression

The Physics Hypertextbook
© 1998-2008 by Glenn Elert -- A Work in Progress
All Rights Reserved -- Fair Use Encouraged

prev | up | next


Discussion

introduction

Given a pile of data …

x1x2x3, … xn and y1y2y3, … yn

Find the line of best fit.

y = mx + b

We are shooting for a minimal amount of error as measured by the sum of the squares of the residuals. This method is called a least squares fit and is probably the most common form of best fit line.

   n    n  
R2 =  (∆yi)2 =  [(mxi + b) − yi]2 = minimum
  i = 1   i = 1  

This occurs where each of the partial derivatives is zero. (The limits on the summations will be omitted out of laziness from now on.)

 R2 = 2 ∑{[(mxi + b) − yi ] xi} = 0
m
and
 R2 = 2 ∑[(mxi + b) − yi ] = 0
b

After a bit of algebra, you get these equations …

m =  n∑(xiyi) − ∑xi ∑yi
n ∑(xi2) − (∑xi)2
and
b =  ∑(xi2) ∑yi − ∑xi ∑(xiyi)
n ∑(xi2) − (∑xi)2

or, if you work a bit more algebra, these equations …

m =  ∑(xiyi) − n x y
∑(xi2) − n (x)2
and
b =  y ∑(xi2) − x ∑(xiyi)
∑(xi2) − n (x)2

How good is the line of best fit? Are some bests better than others? Here's one way to decide. Swap the explanatory and response variables.

x = m'y + b'

The slope of this new linear equation is the same as the old one with all the x's replaced by y's and vice versa. (Note that the numerator hasn't really changed.)

m' =  n∑(yixi) − ∑yi ∑xi
n ∑(yi2) − (∑yi)2

Now, multiply this slope by the old slope. Don't ask why, just do it.

m m' = 

n∑(xiyi) − ∑xi ∑yi ⎞⎛
⎟⎜
⎠⎝
n∑(yixi) − ∑yi ∑xi

n ∑(xi2) − (∑xi)2 n ∑(yi2) − (∑yi)2

This product is known as the coefficient of determination

r2 =  (n∑(xiyi) − ∑xi ∑yi)2
(n ∑(xi2) − (∑xi)2) (n ∑(yi2) − (∑yi)2)

and its square root is called the coefficient of correlation.

r =  n∑(xiyi) − ∑xi ∑yi
√(n ∑(xi2) − (∑xi)2) √(n ∑(yi2) − (∑yi)2)

Summary

Problems

practice

  1. electric-energy.txt
    In the United States, electric energy is measured in kilowatt hours and purchased with dollars.
    1. Determine the equation of the best fit straight line.
    2. Explain the significance of the coefficients m, b, and r2.
    Solutions …
    1. Here's what the graph looks like.
       
       
    2. The slope (m) of a linear function is the rate of change of the vertical quantity (y) with respect to the horizontal quantity (x). It should be apparent that the slope of this graph is the average price for electricity per kilowatt hour. The real question should be, why doesn't this graph intercept the vertical axis at the origin? Surely, if I were to use no energy I should pay no money. When I don't go to a restaurant, I don't get charged. Why should electricity be any different? Well, there are two answers to this question. One is that utility companies as legal monopolies are trying to extract every penny they can from their captive customers. The second, which is the contention of the utilities themselves, is that there are fixed expenses associated with every customer regardless of how much energy they consume: maintenance, administration, insurance, etc. Such fixed expenses are gathered under the umbrella term "basic service charges". Thus, in this bill …
      1. (the slope) is the average price for electric energy: 14.7¢ per kilowatt hour;
      2. (the y-intercept) is the average basic service charge: $9.81 per month. Note the lone data point on the left hand side of the graph that illustrates this policy. The entire house was on vacation for the entire month, the electricity was shut off at the circuit breaker, and yet still there was a charge.
      3. (the correlation coefficient) shows the correlation between the energy and price: 0.96. The more important quantity for this analysis is the square of this number -- the coefficient of determination: r2 = 0.92. This number shows that 92% of the variation in these electric bills is due to the amount of energy consumed. The remaining 8% variation is due to seasonal effects (electricity is always more expensive in summer when air conditioner use drives up demand) and bracket billing policies (the first quarter megawatt hour or so is cheaper than the rest with this utility). These secondary variations are evident in the data point in the extreme upper right hand corner of the graph that lies well above the line of best fit. It occured during the summer when rates and consumption were at their highest.
  2. standard-atmosphere.txt
    This text file provides standard meteorological data for the earth's atmosphere as a function of altitude above sea level.
    1. Find the transformation that will relate the pressure to altitude with a linear equation.
    2. Write the nonlinear equation that results.
    Solutions …
    1. Start by examining a graph of the raw data.
       
      unadjusted data
       
      Looks like it could be some sort of inverse relationship, but none of them work. "Inverse this power. Inverse that power." It seems as if nothing can straighten it out.
           
      invert invert and square invert and square root
           
      Did I say it looks like some sort of inverse relationship? Then why does it intercept the y-axis? An inverse relationship would be infinite at zero. It would never cross the vertical axis. So what's going on?

      This graph shows exponential decay. The cure for this deviation from linearity is a logarithmic function. Base 10, base e, it doesn't matter. Now we have a straight line.
       
      log base 10
       
    2. Rearranging the variables gives …
         
      y = mx + b
      log(P) = m·b
      10log(P) = 10m·h + b = 10m·h × 10b = 10b × 10m·h
      P = 102.04 × 10−0.0641·h = 110 × 10−0.0641·h
      P = 110 kPa × 10−(h ÷ 16 km)
         
      Some commentary on the values of the coefficients
      1. After transformation, the slope of the linear fit becomes a multiplier in an exponent. The magnitude of the reciprocal of this value is an important number. Whenever the altitude has this value or multiples of this value the exponent will be a whole number. The slope calculated was −0.0641 km−1. (Inverse kilometers are used as the unit to cancel the kilometers in the height.) The reciprocal of this value is about 16 km, which is conveniently equal to 10 miles for the Americans. At this altitude, the exponent in our function would equal negative one and the atmospheric pressure would be one-tenth of its value at sea level. At twice this altitude, roughly 32 km, the exponent would equal negative two and the pressure would be one-one-hundredth its value at sea level. At three times this altitude, 48 km, the exponent would equal negative three and the pressure would be one-one-thousandth its sea level value. And so on, getting ever smaller, but never reaching zero. This is what it means for a quantity to decay exponentially.
         
        altitude (km) pressure (atm) comment
        304 10−19 space shuttle orbit
        96 0.000,001 highest airplane flight
        80 0.000,01  
        64 0.000,1  
        48 0.001 highest unmanned balloon flight
        32 0.01 highest manned balloon flight
        16 0.1 50% higher than most commercial flights
        0 1 sea level
         
      2. This coefficient should equal equal the atmospheric pressure at sea level, however, the value calculated (110 kPa) is significantly different from the value of the standard atmosphere (101.325 kPa). Such is the nature of statistical analysis.
      3. Once again, it's really r2 we're interested in. This number is very close but not equal to one (r2 = 0.999). The atmosphere behaves quite simply when it comes to pressure and can be adequately described by an exponential decay model.
  3. dash-world-records.txt (utf-8)
    The text file referenced above has data on the world records for the 100 m dash. The data are broken up into four groups:
    1. men's electronically-timed world records,
    2. men's hand-timed world records,
    3. women's electronically-timed world records, and
    4. women's hand-timed world records
    Analyze this data.
    1. Perform a linear regression on both men's and women's world record times.
    2. Explain the significance of the numerical results.
    3. Make an interesting prediction.
    Source: International Association of Athletics Federations (IAAF)

    Solutions …
    1. The graph …
       
      [magnify]
       
    2. The numbers …
           
      men   women
           
      y = mx + b
      m = −0.008998 s/yr
      b = +27.74 s
      r = −0.9491
       
      y = mx + b
      m = −0.02399 s/yr
      b = +58.32 s
      r = −0.9199
           
      The slope of this graph shows us that men's times are decreasing at approximately 0.01 seconds each year.   Women's times are decreasing faster, 0.02 seconds per year, approximately twice the rate of men
           
      The y-intercept would be the world record in the year zero (a year that does not exist, by the way). Extrapolating this linear fit back 20 centuries would be a stupid thing to do. Surely there was someone around at the turn of the first millennium who could run a hundred meters in under 27 seconds..   The y-intercept for women is extra foolish. Nearly a minute to run 100 m? I don't think so. Linear regression is nice, but it isn't a religion. You don't have to believe everything it says.
           
      The r value gives us an indication of how well the data can be explained by a linear model. Squaring 0.9491 gives us 0.9008, which means 90% of the variation in men's world record 100 m dash times is linear. That's quite a reasonable fit to an artificial model.   The fit is not quite as tight for the women's times. Squaring r = 0.9199 yields a coefficient of determination of 0.8462. Thus a linear model only explains 85% of the variation in women's world record 100 m dash times. Still pretty good, as far as I'm concerned, for a messy data set like this one.
           
    3. I find it somewhat surprising that the trends in world record times can be so well explained by a linear model. I would have expected that the data would show the athletes approaching some limit. Surely, humans can't keep running faster and faster indefinitely. There must be some performance "wall" ahead of them -- something to keep them from running faster than a speeding bullet. As far as the last century goes, this appears not to be the case. Times have been shrinking at a steady rate. Assuming they keep up like this, women sprinters will eventually outrun their male counterparts some time in the middle of the Twenty-first Century. We can even predict the year at which the transition will occur. Set the two regression equations equal and see what happens.
           
      (mx + b)men  =  (mx + b)women
      (−0.008998 x + 27.74)  =  (−0.02399 x + 58.32)
      (0.02399 − 0.008998) x  =  (58.32 − 27.74)
      0.014992 x  =  30.58
      x  =  (30.58 ÷ 0.014992)
      x  =  2040
           
      If you really felt that world record times would follow a linear progression you might even try determining the day in 2040 when the women catch up to the men. But since I recognize the limitations of this model, I won't be entering the office "men-vs.-women-hundred-meter-dash" pool. In fact, if we choose a slightly different data set, we'll end up predicting a significantly different transition year. These calculations are left as an exercise for the reader.
  4. anscombe-data.txt
    Source: Anscombe, F.J. "Graphs in Statistical Analysis," The American Statistician. Vol. 27, No. 1 (1973): 19. This problem is not finished. These data sets have been rigged to have the same slope (0.50), y-intercept (3.00), and correlation (0.82).
    1. A linear fit is useful here.
         
       
      [magnify]  
         
    2. A linear fit is not useful here. This is probably a quadratic.
         
      [magnify] [magnify]
         
    3. That one outlier should be removed and a linear fit tried again.
         
      [magnify] [magnify]
         
    4. The linear fit is strongly affected by that one outlier. Without it, however, there isn't enough variation to see a trend. There isn't much that can be done with this one.
         
      [magnify] [magnify]
         

statistical

  1. For each of the following data sets …
      1. determine the equation of the best fit straight line(s) and
      2. explain the significance of the coefficients m, b, and r2.
    Here are the data sets …
    1. braking-distance.txt
      In this road test, braking distances were measured for different cars traveling at 60 mph and 80 mph. Graph these distances against one another.
      Source: "Road Test Summary." Road & Track. (July 1998): 186-87.
    2. satellite-failures.txt
      Satellites in low earth orbit (LEO) operate between 250 and 1500 km above the ground. Because Earth's atmosphere extends hundreds of miles into space, LEOs eventually experience enough friction that they fall back to earth and burn up. The accompanying text files gives the number of low earth orbit satellites that reentered the earth's atmosphere and the number of sunspots for each year since 1969. Graph the number of reentered satellites vs. the number of sunspots.
      Source: NASA Goddard Space Flight Center.
    3. soap.txt
      Two bars of soap in a bathroom shower were weighed almost every day for about two weeks. Graph the mass of each as a function of time.
    4. standard-atmosphere.txt
      This text file provides standard meteorological data for the earth's atmosphere as a function of altitude above sea level. Graph temperature as a function of altitude for the tropospheric portion of the atmosphere from sea level to 11 km. (Do not analyze the entire data set. The atmosphere above 11 km behaves much differently.)
    5. toaster.txt
      The duration of the toast cycle was measured for different light-dark settings of a two-slot electric bread toaster. Graph cycle time as a function of light-dark setting for this toaster when it held one and two slices of bread
    6. wavelength-of-light.txt
      In this experiment the wavelengths of the visible line spectra for an excited gas were measured using two different methods. Graph these trials against one another.
  2. For each of the following data sets …
      1. find the transformation that will relate the two variables with a linear equation and
      2. write the nonlinear equation that results.
    Here are the data sets …
    1. aerodynamic-drag.txt
      In this experiment students measured the aerodynamic drag on a weighted party balloon falling at different speeds.
    2. constant-force.txt
      In this experiment different masses were subject to the same force and their accelerations recorded.
    3. milk-freshness.txt
      The following data were taken from a milk carton sold in North Carolina.
      Source: Greenler, Robert. Chasing the Rainbow. Milwaukee, WI: Elton-Wolf, 2000: 140.
    4. moore-law.txt
      This data set shows the number of switches in a computer for various years in the Twentieth Century.
      Source: IBM Gallery of Science & Art, 590 Madison Avenue, Second Floor, New York, NY 10022 (July 1992).
    5. resonance-tube.txt
      In this experiment various tuning forks of known frequency were held above a resonance tube, which was used to determine the wavelength of the sound emitted.
  3. Answer the questions associated with the following data sets.
    1. Determine the year when women sprinters will run as fast as their male counterparts in the 100 m dash using …
      1. dash-electronic-timing.txt (utf-8)
        only those world records that were timed electronically (as opposed to manually).
      2. dash-olympic-gold-medals.txt (utf-8)
        olympic gold medal winners (as opposed to world record setters).
      Compare your results with those obtained in pracice problem 3
      Source: International Association of Athletics Federations (IAAF)
    2. co2-mauna-loa.txt
      Mauna Loa Observatory on the "Big Island" of Hawaii has been recording atmospheric carbon dioxide concentrations for nearly half a century beginning in the year 1958. Readings are taken continuously, but only the monthly averages are reported. Values are reported in parts per million (ppm)
      1. Construct a graph of atmospheric CO2 concentration vs. time.
      2. What two obvious behaviors are revealed in your graph?
      3. Split the data set in half and perform a linear regression analysis on the data for the years …
        1. 1958-1980 and
        2. 1981-2006.
      4. Compare the behavior of CO2 levels in the first half of the data set to the second half.
      Source: Scripps Institution of Oceanography
    3. gw-vardo.txt
      Global warming is most easily observed in long term temperature measurements taken at high latitudes (near the poles). Vardø is a village in the extreme northeast of Norway on the Barents Sea. Despite being a few degrees north of the Arctic Circle, its harbor remains ice free due to the warm North Atlantic drift current (an extension of the Gulf Stream). Vardø's climate is mild for its latitude, which means it varies from a few °C above freezing in the summer to a few °C below freezing in the winter. A location with such a stable climate is a good place to check for human induced climate change.
      1. Construct a graph of monthly average temperature vs. time for the period 1881 to 2006.
      2. Using linear regression, determine the following quantities for the whole data set …
        1. the rate of change of temperature in °C per century
        2. the uncertainty in this value
        3. the coefficient of determination
        4. the root-mean-square error (if you have the ability to calculate this number)
      3. Divide the data set up into four equal intervals of roughly 378 months (31.5 years) and repeat.
      4. Compile your results in a table like the one below and comment on the manner in which temperatures have changed at Vardø in this 125 year period. (Use the results of all four calculated columns in your analysis, not just the rate of temperature change.)
                 
      time interval ΔTt
      (°C/100 y)
      uncertainty
      (°C/100 y)
      r2 rmse
      (°C)
      overall (1881-2006)
      1st quarter (1881-1912)
      2nd quarter (1912-1943)
      3rd quarter (1944-1975)
      4th quarter (1975-2006)
      Source: NASA Goddard Institue for Space Science
       
    4. gw-central-park.txt
      [Note: This is an extension of the previous problem, but it can be worked on independently with little loss of meaning.]
      Surface air temperatures have increased in New York City on the order of one degree celsius in the Twentieth Century -- consistent with the trend of global warming. New York is the largest city in the United States and the fourth largest metropolitan area on the planet. 8.5 million people live within the city limits and an additional 10 million are within commuting distance. With a gross metropolitan product approaching one trillion dollars ($1015) the economy of New York City is larger than that of all but a dozen or so nations. This geographic concentration of people and economic power must certainly have an effect on the local climate. Repeat the analysis described in the previous problem using 125 years worth of temperature measurements taken in Central Park in New York City.
      1. Construct a graph of monthly average temperature vs. time for the period 1881 to 2006.
      2. Using linear regression, determine the following quantities for the whole data set …
        1. the rate of change of temperature in °C per century
        2. the uncertainty in this value
        3. the coefficient of determination
        4. the root-mean-square error (if you have the ability to calculate this number)
      3. Divide the data set up into four equal intervals of roughly 378 months (31.5 years) and repeat.
      4. Compile your results in a table like the one below and comment on the manner in which temperatures have changed at New York City in this 125 year period. (Use the results of all four calculated columns in your analysis, not just the rate of temperature change.)
                 
      time interval ΔTt
      (°C/100 y)
      uncertainty
      (°C/100 y)
      r2 rmse
      (°C)
      overall (1881-2006)
      1st quarter (1881-1912)
      2nd quarter (1912-1943)
      3rd quarter (1944-1975)
      4th quarter (1975-2006)
      Source: NASA Goddard Institue for Space Science

algebraic

  1. Idea for a problem set. Transform the following nonlinear equations into linear equations by the appropriate change of variables. For each transformed equation, identify …
      1. the new x variable,
      2. the new y variable,
      3. the slope, m, and
      4. the y intercept.
    Here are the equations …
    1. equation one
    2. equation two
    3. and so on

Resources


prev | up | next

Another quality webpage by

Glenn Elert
eglobe logo home | contact

bent | chaos | eworld | facts | physics