Simple Linear Regression

Learning Objectives Coverage

LO1: Describe a simple linear regression model, how the least squares criterion is used to estimate regression coefficients, and the interpretation of these coefficients

Core Concept

Simple linear regression extends the correlation analysis from Topic 9 into a predictive framework. Where correlation tells us whether and how strongly two variables are related, regression tells us the specific nature of that relationship and allows us to make predictions. The model Y = b₀ + b₁X + ε forms the basis for beta estimation in equity analysis, factor models in portfolio management, and yield curve modeling in fixed income.

Key components:

  • Dependent variable (Y): The outcome being predicted
  • Independent variable (X): The predictor variable
  • Intercept (b₀): Y-value when X equals zero
  • Slope (b₁): Change in Y for one-unit change in X
  • Error term (ε): Unexplained variation

Formulas & Calculations

  • Main formula: Y = b₀ + b₁X + ε formula exam-focus
    • b₁ = Σ[(X - X̄)(Y - Ȳ)] / Σ[(X - X̄)²] formula
    • b₀ = Ȳ - b₁X̄ formula
  • HP 12C steps:
    1. Enter paired data points using statistics mode
    2. Press [f] [REG] to calculate regression coefficients
    3. Display b₁: [RCL] [3], Display b₀: [RCL] [4]
  • Common variations: Standardized coefficients, natural log transformations

Practical Examples

  • Traditional Finance Example: Predicting stock returns (Y) based on market returns (X)
    • Given: Stock returns = 2% + 1.2 × Market returns
    • If market return = 5%, predicted stock return = 2% + 1.2(5%) = 8%
  • Calculation walkthrough: Least squares minimizes Σε² = Σ(Y - Ŷ)²
  • Interpretation: Slope coefficient represents systematic risk (beta), intercept represents alpha

DeFi Application

  • Protocol example: Predicting Uniswap V3 TVL based on ETH price movements defi-application
  • Implementation: Smart contracts can use linear regression oracles for dynamic fee adjustments
  • Advantages/Challenges: Real-time data availability vs. gas costs for on-chain calculations

LO2: Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these assumptions may have been violated

Core Concept

  • Definition: Four critical assumptions must hold for valid regression results: linearity, homoscedasticity, independence, and normality
  • Why it matters: Violated assumptions lead to biased estimates and invalid statistical tests
  • Key components:
    • Linearity: Relationship is truly linear
    • Homoscedasticity: Constant error variance
    • Independence: Errors are uncorrelated
    • Normality: Errors follow normal distribution

Formulas & Calculations

  • Residual calculation: e = Y - Ŷ = Y - (b₀ + b₁X)
  • Durbin-Watson test: DW = Σ(eₜ - eₜ₋₁)² / Σeₜ²
  • HP 12C steps: Calculate residuals by subtracting predicted values from actual values
  • Common variations: Breusch-Pagan test for homoscedasticity, Jarque-Bera test for normality

Practical Examples

  • Traditional Finance Example: Testing beta stability in CAPM model
  • Calculation walkthrough: Plot residuals vs. fitted values to check for patterns
  • Interpretation: Random scatter indicates valid assumptions; patterns suggest violations

DeFi Application

  • Protocol example: Validating assumptions in yield farming return predictions
  • Implementation: Automated residual analysis in DeFi analytics platforms
  • Advantages/Challenges: Continuous monitoring vs. computational complexity

LO3: Calculate and interpret measures of fit and formulate and evaluate tests of fit and of regression coefficients in a simple linear regression

Core Concept

  • Definition: Measures of fit assess how well the regression model explains the variation in the dependent variable
  • Why it matters: Determines model reliability for investment decisions and risk management
  • Key components:
    • R²: Coefficient of determination
    • Adjusted R²: R² adjusted for degrees of freedom
    • F-statistic: Overall model significance
    • t-statistics: Individual coefficient significance

Formulas & Calculations

  • R² formula: R² = SSR/SST = 1 - (SSE/SST) formula exam-focus
    • SST = Σ(Y - Ȳ)² (Total Sum of Squares)
    • SSR = Σ(Ŷ - Ȳ)² (Regression Sum of Squares)
    • SSE = Σ(Y - Ŷ)² (Error Sum of Squares)
  • F-statistic: F = MSR/MSE = [SSR/1] / [SSE/(n-2)]
  • t-statistic: t = (b₁ - 0) / sb₁
  • HP 12C steps: Use correlation function [f] [REG], square result for R²

Practical Examples

  • Traditional Finance Example: Portfolio beta has R² = 0.64, meaning 64% of return variation is explained by market movements
  • Calculation walkthrough: F-test determines if the relationship is statistically significant
  • Interpretation: Higher R² indicates better fit, but consider overfitting risks

DeFi Application

  • Protocol example: Measuring goodness of fit in automated market maker price prediction models
  • Implementation: Dynamic R² monitoring for rebalancing strategies
  • Advantages/Challenges: Real-time model validation vs. gas optimization

LO4: Describe the use of analysis of variance (ANOVA) in regression analysis, interpret ANOVA results, and calculate and interpret the standard error of estimate in a simple linear regression

Core Concept

  • Definition: ANOVA decomposes total variation into explained (regression) and unexplained (error) components
  • Why it matters: Provides framework for testing overall model significance and quantifying prediction accuracy
  • Key components:
    • Total variation (SST)
    • Explained variation (SSR)
    • Unexplained variation (SSE)
    • Standard error of estimate (SEE)

Formulas & Calculations

  • ANOVA identity: SST = SSR + SSE formula exam-focus
  • Standard error of estimate: SEE = √[SSE/(n-2)] formula
  • Mean square regression: MSR = SSR/1
  • Mean square error: MSE = SSE/(n-2)
  • HP 12C steps: Calculate SEE using standard deviation function on residuals

Practical Examples

  • Traditional Finance Example: ANOVA table shows F = 25.6, p < 0.001, indicating significant relationship
  • Calculation walkthrough: SEE = 2.50 of actual values
  • Interpretation: Lower SEE indicates more precise predictions

DeFi Application

  • Protocol example: ANOVA analysis of liquidity provision returns in automated market makers
  • Implementation: Standard error bands for impermanent loss prediction
  • Advantages/Challenges: Continuous recalibration vs. transaction costs

LO5: Calculate and interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model and a value for the independent variable

Core Concept

  • Definition: Prediction intervals provide range estimates for future observations, accounting for both parameter uncertainty and inherent variability
  • Why it matters: Enables risk-aware decision making in volatile DeFi and traditional markets
  • Key components:
    • Point prediction: Single best estimate
    • Prediction interval: Range with specified confidence level
    • Confidence interval: Range for mean response
    • Standard error of prediction

Formulas & Calculations

  • Point prediction: Ŷ = b₀ + b₁X
  • Prediction interval: Ŷ ± t(α/2,n-2) × SEE × √[1 + 1/n + (X-X̄)²/Σ(X-X̄)²] formula exam-focus
  • Confidence interval for mean: Ŷ ± t(α/2,n-2) × SEE × √[1/n + (X-X̄)²/Σ(X-X̄)²] formula
  • HP 12C steps: Use normal distribution for large samples, t-distribution for small samples

Practical Examples

  • Traditional Finance Example: Predict stock price with 95% confidence: 4.20
  • Calculation walkthrough: Wider intervals for X values far from mean
  • Interpretation: Prediction intervals are wider than confidence intervals

DeFi Application

  • Protocol example: Predicting token price ranges for automated rebalancing
  • Implementation: Dynamic prediction intervals in yield optimization strategies
  • Advantages/Challenges: Adaptive confidence levels vs. computational overhead

LO6: Describe different functional forms of simple linear regressions

Core Concept

  • Definition: Transformations of variables to capture non-linear relationships while maintaining linear regression framework
  • Why it matters: Many financial relationships are non-linear but can be linearized through transformations
  • Key components:
    • Log-linear models
    • Linear-log models
    • Log-log models
    • Polynomial terms

Formulas & Calculations

  • Log-linear: ln(Y) = b₀ + b₁X (exponential growth)
  • Linear-log: Y = b₀ + b₁ln(X) (diminishing returns)
  • Log-log: ln(Y) = b₀ + b₁ln(X) (elasticity model)
  • HP 12C steps: Use [LN] function to transform variables before regression

Practical Examples

  • Traditional Finance Example: Stock returns vs. log of market cap (size effect)
  • Calculation walkthrough: ln(Price) = 2.5 + 0.3×Time captures compound growth
  • Interpretation: Log transformations often stabilize variance and linearize relationships

DeFi Application

  • Protocol example: Log-log model for liquidity vs. trading volume in AMMs
  • Implementation: Transformed variables in yield curve modeling
  • Advantages/Challenges: Better model fit vs. interpretation complexity

Core Concepts Summary (80/20 Principle)

Must-Know Concepts

  1. Least Squares Estimation: Minimizes sum of squared residuals to find best-fit line
  2. R² Interpretation: Proportion of variation in Y explained by X
  3. Assumption Violations: Check residual plots for patterns indicating problems
  4. Statistical Significance: Use t-tests for coefficients, F-test for overall model
  5. Prediction vs. Confidence Intervals: Prediction intervals are wider due to additional uncertainty

Quick Reference Table

ConceptFormulaWhen to UseDeFi Equivalent
Simple RegressionY = b₀ + b₁X + εLinear relationshipsToken price vs. TVL
SSR/SSTModel fit assessmentAMM efficiency metrics
F-statisticMSR/MSEOverall significanceProtocol risk models
Prediction IntervalŶ ± t×SEE×√[…]ForecastingYield range prediction

Comprehensive Formula Sheet

Essential Formulas

Formula 1: Simple Linear Regression
Y = b₀ + b₁X + ε
Where: Y = dependent variable, X = independent variable, 
       b₀ = intercept, b₁ = slope, ε = error term
Used for: Modeling linear relationships

Formula 2: Least Squares Coefficients
b₁ = Σ[(X - X̄)(Y - Ȳ)] / Σ[(X - X̄)²]
b₀ = Ȳ - b₁X̄
Where: X̄, Ȳ = sample means
Used for: Coefficient estimation

Formula 3: Coefficient of Determination
R² = SSR/SST = 1 - (SSE/SST)
Where: SST = total sum of squares, SSR = regression sum of squares,
       SSE = error sum of squares
Used for: Measuring goodness of fit

Formula 4: F-statistic
F = MSR/MSE = [SSR/1] / [SSE/(n-2)]
Where: MSR = mean square regression, MSE = mean square error
Used for: Testing overall model significance

Formula 5: Standard Error of Estimate
SEE = √[SSE/(n-2)]
Where: n = sample size
Used for: Measuring prediction accuracy

Formula 6: Prediction Interval
Ŷ ± t(α/2,n-2) × SEE × √[1 + 1/n + (X-X̄)²/Σ(X-X̄)²]
Used for: Forecasting with uncertainty bounds

HP 12C Calculator Sequences

Operation 1: Linear Regression Setup
RPN Steps: [f] [CLx], enter data pairs [ENTER] [Σ+], [f] [REG]
Example: Calculate slope and intercept from data points

Operation 2: Correlation Coefficient
RPN Steps: [f] [REG], [RCL] [7] (displays correlation)
Example: r = 0.85 indicates strong positive relationship

Operation 3: Prediction Calculation
RPN Steps: [RCL] [4], X value [ENTER], [RCL] [3], [×], [+]
Example: Predict Y when X = 10 using stored coefficients

Practice Problems

Basic Level (Understanding)

  1. Problem: A regression of stock returns (Y) on market returns (X) yields: Y = 0.02 + 1.3X. The R² = 0.56.
    • Given: Regression equation and R²
    • Find: Interpret the coefficients and R²
    • Solution:
      • Intercept (0.02): Stock has 2% expected return when market return is 0%
      • Slope (1.3): For each 1% increase in market return, stock return increases 1.3%
      • R² (0.56): 56% of stock return variation is explained by market movements
    • Answer: The stock has above-market sensitivity (beta > 1) and modest explanatory power

Intermediate Level (Application)

  1. Problem: A DeFi protocol’s TVL (Y, in millions) is regressed against token price (X, in dollars): Y = 50 + 15X, SEE = $25M, n = 30.
    • Given: Regression equation, standard error, sample size
    • Find: 95% prediction interval when token price = $10
    • Solution:
      • Point prediction: Ŷ = 50 + 15(10) = $200M
      • t₀.₀₂₅,₂₈ ≈ 2.048
      • Prediction interval: 200 ± 2.048 × 25 × √[1 + 1/30 + (10-X̄)²/Σ(X-X̄)²]
      • Assuming standard terms: 200 ± 51.2
    • Answer: TVL prediction ranges from 251.2M with 95% confidence

Advanced Level (Analysis)

  1. Problem: Analyze a yield farming return model using log transformations. Original model: ln(Yield) = 2.5 + 0.8×ln(Risk), R² = 0.72, F = 45.6
    • Given: Log-log regression with goodness-of-fit measures
    • Find: Interpret the elasticity coefficient and evaluate model adequacy
    • Solution:
      • Elasticity interpretation: 1% increase in risk leads to 0.8% increase in yield
      • Model fit: 72% of yield variation explained by risk
      • F-test: Highly significant relationship (F = 45.6 >> F₀.₀₅,₁,₂₈ ≈ 4.2)
      • Economic meaning: Diminishing returns to risk-taking
    • Answer: Model shows strong risk-return relationship with less than proportional yield increases for additional risk, consistent with efficient market theory

DeFi Applications & Real-World Examples

Traditional Finance Context

  • Institution Example: Banks use regression to model loan default rates based on credit scores
  • Market Application: Beta estimation for portfolio risk management
  • Historical Case: CAPM validation studies using market index regression

DeFi Parallels

  • Protocol Implementation: Compound protocol uses regression-based interest rate models defi-application
  • Smart Contract Logic: Automated market makers employ regression for price discovery
  • Advantages: Real-time recalibration, transparent algorithms, 24/7 operation
  • Limitations: Gas costs, oracle dependencies, model risk in volatile markets

Case Studies

  1. Case 1: Uniswap V3 Liquidity Prediction
    • Background: AMM needs to predict optimal liquidity ranges
    • Analysis: Regression of trading volume on price volatility and TVL
    • Outcomes: Improved capital efficiency through dynamic range adjustment
    • Lessons learned: Non-linear relationships require careful transformation

Common Pitfalls & Exam Tips

Frequent Mistakes

  • Mistake 1: Confusing correlation with causation - regression shows association, not causation
  • Mistake 2: Ignoring assumption violations - always check residual plots
  • Mistake 3: Over-interpreting R² - high R² doesn’t guarantee good predictions outside sample range

Exam Strategy

  • Time management: Allocate 4-5 minutes per regression problem
  • Question patterns: Often combined with hypothesis testing and confidence intervals
  • Quick checks: Verify R² is between 0 and 1, check units in predictions

Key Takeaways

Essential Points

✓ Simple linear regression models Y = b₀ + b₁X + ε where b₁ represents the marginal effect ✓ R² measures explained variation; higher values indicate better model fit ✓ Four key assumptions: linearity, homoscedasticity, independence, normality ✓ F-test evaluates overall model significance; t-tests evaluate individual coefficients ✓ Prediction intervals are wider than confidence intervals due to additional uncertainty

Memory Aids

  • Mnemonic: “LINE” for assumptions (Linearity, Independence, Normality, Equal variance)
  • Visual: Scatter plot with best-fit line and residual plots
  • Analogy: Regression is like finding the “average” relationship between variables

Cross-References & Additional Resources

Source Materials

  • Primary Reading: Volume 1, Chapter 10, Simple Linear Regression
  • Key Sections: Least squares estimation, assumption testing, ANOVA
  • Practice Questions: End-of-chapter problems 1-15

External Resources

  • Videos: Khan Academy statistics series on regression
  • Articles: “Regression Analysis in Finance” - Finance Research Foundation
  • Tools: Excel regression analysis, R statistical software, Python scipy

Review Checklist

Before moving on, ensure you can:

  • Explain each learning objective in your own words
  • Calculate regression coefficients using least squares method
  • Complete ANOVA table and interpret F-statistic
  • Check regression assumptions using residual analysis
  • Calculate and interpret prediction intervals
  • Identify appropriate functional form transformations
  • Apply concepts to both traditional finance and DeFi scenarios