Statistics Reference¶

Estimated time to read: 7 minutes

One-Way ANOVA (Analysis of Variance) - Fisher, R.A. (1918)¶

One-way ANOVA is a statistical method used to test the difference between the means of three or more independent groups. It is based on the F-distribution and it works by comparing the variance (variation) between the groups with the variance within the groups.

The formula for the F statistic in one-way ANOVA is:

\[F = \frac{{MSB}}{{MSW}}\]

Where: - MSB is the mean sum of squares between the groups - MSW is the mean sum of squares within the groups

Hypothesis Testing, Type I and Type II Errors, Significance Level - Neyman, J., & Pearson, E.S. (1928)¶

Hypothesis testing is a statistical method that is used to make decisions using data from a study. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

A Type I error occurs when we reject a true null hypothesis (also known as a "false positive").
A Type II error occurs when we fail to reject a false null hypothesis (also known as a "false negative").

The significance level (commonly denoted as α) is the probability of rejecting the null hypothesis when it is true (i.e., our tolerance for a Type I error).

Practical Significance - Cohen, J. (1994)¶

Practical significance refers to the magnitude of the difference, not just whether the difference between groups is statistically significant.

For example, a study might find a statistically significant difference between two groups, but the difference might be so small that it has no practical relevance.

Chi-Square Tests for One and Two Samples - Pearson, K. (1900) and Yates, F. (1934)¶

The chi-square test is a statistical test of independence to determine the dependency between two variables.

For a one-sample chi-square test, the test statistic is:

\[\chi^2 = \sum \frac{{(O_i - E_i)^2}}{{E_i}}\]

Where: - O_i are the observed frequencies - E_i are the expected frequencies

For a two-sample chi-square test, the test statistic is:

\[\chi^2 = \sum \sum \frac{{(O_{ij} - E_{ij})^2}}{{E_{ij}}}\]

Where: - O_ij are the observed frequencies in each cell of the contingency table - E_ij are the expected frequencies in each cell of the contingency table

Confidence Interval - Neyman, J. (1937)¶

A confidence interval provides an estimated range of values which is likely to include an unknown population parameter.

The formula for a confidence interval for a population mean, when the population standard deviation is known, is:

\[CI = \bar{X} \pm Z*\frac{{\sigma}}{{\sqrt{n}}}\]

Where: - \(\bar{X}\) is the sample mean - Z is the Z-score from the standard normal distribution corresponding to the desired confidence level (for example, Z = 1.96 for a 95% confidence interval) - \(\sigma\) is the population standard deviation - n is the sample size

Correlation & Simple Linear Regression - Pearson, K. (1895)¶

Correlation measures the strength and direction of association between two continuous variables. Pearson's correlation coefficient ® is the most widely used correlation coefficient.

The formula for Pearson's r is:

\[ r = \frac{\sum (x_i - \bar{X})(y_i - \bar{Y})}{\sqrt{\sum (x_i - \bar{X})^2 \sum (y_i - \bar{Y})^2}} \]

Where: - \(x_i\) and \(y_i\) are the individual sample points indexed with i - \(\bar{X}\) and \(\bar{Y}\) are the means of the x and y variables

Simple linear regression is a statistical method that allows us to summarise and study relationships between two continuous (quantitative) variables. The formula for simple linear regression is:

\[ y = \beta_0 + \beta_1x + \varepsilon \]

Where: - y is the dependent variable - \(\beta_0\) is the y-intercept - \(\beta_1\) is the slope of the line - x is the independent variable - \(\varepsilon\) is the error term

T-Type and Z-Type Tests - Gosset, W.S. (1908)¶

T-tests and Z-tests are hypothesis tests that compare mean values. A t-test is used when the population parameters (mean and standard deviation) are not known and have to be estimated from the data. The formula for a one-sample t-test is:

\[ t = \frac{\bar{X} - \mu}{\frac{s}{\sqrt{n}}} \]

Where: - \(\bar{X}\) is the sample mean - \(\mu\) is the population mean - s is the sample standard deviation - n is the sample size

A Z-test is used when the population parameters are known. The formula for a one-sample Z-test is:

\[ Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

Where: - \(\bar{X}\) is the sample mean - \(\mu\) is the population mean - \(\sigma\) is the population standard deviation - n is the sample size

Mann-Whitney U Test (Wilcoxon Rank-Sum Test) - Mann, H.B., & Whitney, D.R. (1947)¶

The Mann-Whitney U test (also known as the Wilcoxon rank-sum test) is a nonparametric test used to compare two sample means that come from the same population and used to test whether two sample means are equal.

The formula to calculate U is:

\[ U = n_1n_2 + \frac{n_1(n_1+1)}{2} - R_1 \]

Where: - \(n_1\) and \(n_2\) are the sample sizes - \(R_1\) is the sum of ranks in the first sample

It's important to remember that these are very simplified explanations of these statistical methods and tests, and their use in practice can often be much more complex, depending on a study's specific design and goals.

Linear Regression (first invented in the 19^th century):¶

This supervised learning algorithm models the relationship between a continuous target variable and one or more independent variables by fitting a linear equation to the data. The formula in MathJax format is:

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon\]

where \(y\) is the target variable, \(x_1, x_2, ..., x_n\) are the independent variables, \(\beta_0, \beta_1, ..., \beta_n\) are the parameters of the model, and \(\epsilon\) is the error term.

Support Vector Machine (SVM) (first invented in 1964 by Vladimir N. Vapnik and Alexey Ya. Chervonenkis)¶

This supervised learning algorithm is used mainly for classification tasks but is also suitable for regression tasks. SVM separates classes by finding a decision boundary that maximises the margin between the closest points of different classes (called support vectors). In the case of linearly separable binary classification, the decision function is given by:

\[f(x) = \text{sign}(\mathbf{w} \cdot \mathbf{x} - b)\]

where \(\mathbf{w}\) is the normal vector to the hyperplane, \(\mathbf{x}\) represents the data points, and \(b\) is the bias term. Note that the actual SVM algorithm involves more complexity, such as the use of a kernel function for non-linearly separable cases.

Naive Bayes (based on Bayes' Theorem, formulated by Thomas Bayes in the 18^th century)¶

This is a supervised learning algorithm used for classification tasks. The algorithm assumes that the features are independent.

Given a class variable \(y\) and dependent feature vector \(x_1\) through \(x_n\), the formula for Naive Bayes is:

\[P(y|x_1, ..., x_n) = \frac{P(y)P(x_1, ..., x_n|y)}{P(x_1, ..., x_n)}\]

However, because \(P(x_1, ..., x_n)\) is constant given the input, it can be removed to simplify the equation. So, we get:

\[P(y|x_1, ..., x_n) \propto P(y)\prod_{i=1}^{n}P(x_i|y)\]

The class with the highest posterior probability is the output of the prediction.

Logistic Regression (developed in the 1950s for use in biological sciences)¶

This is a supervised learning algorithm primarily used for binary classification problems. The basis of logistic regression is the logistic function, which takes in any real-valued number and maps it to a value between 0 and 1. Given an input vector \(x\), the logistic regression model predicts the probability \(p\) that \(y = 1\) as follows:

\[p = \frac{1}{1 + e^{-(\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}}\]

where \(x_1, x_2, ..., x_n\) are the features of the data, and \(\beta_0, \beta_1, ..., \beta_n\) are the