Search notes:

statistics

An important goal of statistics is to investigate casuality.
Two major types of causal studies:
In order to reach this goal, data is so that conclusions can be drawn from it.

Statistical tests and procedures

Descriptive vs inferential statistics

Two main methods for statistics:
Descriptive statistics summarizes and describes data with numbers. It doesn't make predictions on these data. Typical descriptive numbers are:
Statistical inference uses data analysis to decude properties of its underlying probablity distribution.
Inferential statistics is important for statistical hypothesis testing and

Random sampling vs stratified sampling

Random sampling does not ensure that the composition of the sample matches the composition of the population.

Statistical methods

All explanatory variables continuous: Regression
All explanatory variables categorical: Analysis of variance (Anova)
All explanatory variables both: Analysis of Covariance (Ancova)
Response variable:

Regression Analysis

regression analysis is used to describe the relationship between: a response variable Y and one or more predictor variables X1 ... Xₙ (n=1: simple regression, n>1: multivariate regression)
Y must be a continuous variable, X can be continuous, discrete or categorical.

Sampling

The test of the random sample is this: Does every name or thing in the whole group have an equal chance to be in the sample?
… how do you get a random sample within the stratification? The obvious thing is to start with a list of everybody and go after name chosen from it at random; but that is too expensive. So you go into the streets ­ and bias your sample against stay-at-homes. You go from door to door by day ­ and miss most of the employed people. You switch to evening interviews ­ and neglect the movie-goers and night-clubbers.

Bias

Suppose you were to send to a group … a questionnaire that included this query: »Do you like to answer questionnaires?«
… bias introduced by unknown factors. It seems likely that the most effective factor was a tendency that must always be allowed for in reading polls result, a deisre to give a pleasing answer.

Model

A model describes the relationship between variables. Thus, it is the basis to make predictions.
The most basic model is the simple linear regression:
Y=β0 + β1X + ε
ε is the error term
A generalization of the simple linear regression is the multiple linear regression:
Y=β0 + β1U + Β2V + β3W + ε
U, V and W are the predictors, Y the response.
A model is useful if there is an approximate linear relationship between the predictors and the response.
In R, a linear model is built with lm (which returns a model object).

Test of relationship

Types of errors

Software

TODO

Simpson's Paradox

Econometrics

Econometrics applies statistical methods to economic data.
The basic tool for econometrics is the multiple linear regression.

Links

How to lie with statistics
http://stats.stackexchange.com

See also

probability, Probablity ditributions
null hypothesis
Data-mining applies statistics to discover knowledge from data.
Level of measurements
The Python standard library statistics.

Index