In order to reach this goal,
data is
- collected,
- analyzed,
- interpreted,
- presented and
- organized
so that conclusions can be drawn from it.
Descriptive vs inferential statistics
Two main methods for statistics:
- Descriptive statistics: summarize data using indexes (mean, standard deviation etc.)
- Inferential statiistics: draw conclusions from data
Descriptive statistics summarizes and describes data with numbers. It doesn't make predictions on these data. Typical descriptive numbers are:
- mean age
- highest income
- number of buyers of a product
Inferential statistics is important for statistical hypothesis testing and
Statistical methods
All explanatory variables continuous: Regression
All explanatory variables categorical: Analysis of variance (Anova)
All explanatory variables both: Analysis of Covariance (Ancova)
Response variable:
- continuous: normal regression, anova or ancova
- proportion: logistic regression
- count: Log linear models
- Binary: Binary logistic analysis
- Time-at-death: Survival analysis
Regression Analysis
regression analysis is used to describe the relationship between: a response variable Y and one or more predictor variables X
1 ... Xₙ (n=1: simple regression, n>1: multivariate regression)
Y must be a continuous variable, X can be continuous, discrete or categorical.
Model
A
model describes the relationship between variables. Thus, it is the basis to make predictions.
The most basic model is the simple linear regression:
Y=β0 + β1X + ε
ε is the error term
A generalization of the simple linear regression is the multiple linear regression:
Y=β0 + β1U + Β2V + β3W + ε
U, V and W are the predictors, Y the response.
A model is useful if there is an approximate linear relationship between the predictors and the response.
In
R, a linear model is built with
lm
(which returns a model object).