SUMMARY BBS1003
Week 1
Statistics works on the basis of creating a statistical model to predict what will happen.
The type of variable determines which statistical technique can be used in a proper way.
Variable: recorded piece of information or a characteristic about a person in a study
- Specific for each subject or individual, the opposite is a constant
o Binary variable: variables which only take two values (true or false, yes or no)
Identifiers: identify an individual (student number, employer ID) but are not variables
Categorical (qualitative) variable: people or things are placed in groups or categories
- Number of categories may be determined naturally or chosen by the researcher
Nominal, Ordinal
Numeric (quantitive) variables: numbers
o Discrete: inferiorly, describes a whole number
- e.g., number of births per day
o Continuous: measured on a continuous scale (with decimals)
- e.g., age, weight, income, temperature, time
Interval, Ratio
Numeric variables can always be converted into categorical variables Levels of Data
1. Nominal: ordering is not based on magnitude or size (e.g., disease, sex, country of birth, colour)
- There is no ranking order and the space between the scores does not have any meaning
- Presented in bar chart
- The only average that can be taken is Mode
2. Ordinal: ordering based on magnitude or size, there is a ranking order (e.g., size of coffee, (dis)agree)
- The space between the rankings is not always the same
3. Ratio-scale: 0 is meaningful = 0 means absence
4. Interval-scale: 0 has no meaning = 0 does not mean absence
- Ratio and Interval data are summarized in a line graph.
Interval variables contain the same information as nominal and ordinal variables plus the extra information
that differences between scores can be meaningfully interpreted.
Summarizing Data
Frequency Table: percentages show distribution
- For a small group, frequency is more meaningful
- For a bigger group, proportion and percentages are more
meaningful
- Most of the times, there is looked at the cumulative percentage,
which sums all percentages.
Bar chart: separate bars to indicate separate categories
with no continuity between them
Qualitative, Nominal/ Ordinal Variables
- The frequency (counts) at the Y-axis and the score at the X-axis
- The bars are not connected and the distance between the bars does not have a meaning
Pie chart: total is represented by whole circle]
Histogram: non-separate bars dealing with continuous data
- Before drawing a histogram, look at a table showing class intervals, boundaries, widths and
frequencies Do not look at height of the bar, but the surface area of the bar.
- The width of each bar is meaningful Chosen by the researcher itself
- The height of each bar is calculated as frequency / class width
- The total area under all bars will be exactly equal to one
Quantitive, Interval and Ratio Variables
The left boundary of each class is included in the class interval ([) and the right boundary is an open
boundary meaning that the boundary is not included in the class interval = (.
Measures of Central Tendency
Average: a ‘typical’ or ‘middle’ number
Central tendency: central or typical value for a probability
distribution.
- Represent more numbers within one number (4 3 1 6 1 7)
o Mean: sum of all numbers divided by total of numbers
- Used for only quantitive data (numbers) which is fairly symmetrical
- Can be seen as the centre of mass/gravity of a massive piece of
wood in the shape of the distribution and the area under the curve.
o Median: order all the numbers from low to high, find the middle
number
- Used for only quantitive data, when the data is skewed.
Even total of numbers: take middle number
Uneven: two middle numbers, take the mean of these.
o Mode: most common number in a data set, most frequent value
When there is no most common number, there is no mode.
Measures of Spread
The variance and Standard Deviation are
single measures which summarize
differences between scores.
It should be noted that both of these
summary measures represent exactly the
same information.
Variance: measure of how peeked/flat the
distribution is
- Average of the squared differences from the
mean (Standard Deviations)
- Where the variance gives a rough idea of the
spread, the Standard Deviation is more
concrete
Standard Deviation: just the square root of the
variance.
Changes to Variance and Mean
- If all values go through a change (e.g all increase by 19 or multiplied by 2), the Mean also increases by
this exact amount or factor.
- The Variance does not change by the addition or subtraction to values.
If values are multiplied by X, the variance will increase by X^2, as the values are squared in the
variance equation. There is a difference in calculating the variance when you have a sample instead of a whole population.
In these cases, divide the variance by N-1.
This takes into account how different the sample may be from the whole population. Normal Distribution
Normal Distribution: how large data sets will look when plotted
- Mean = Median = Mode At the centre of the normal distribution
- Unimodal: only 1 peak in distribution
- Symmetrical when divided into two
Standard Deviation: average distance between one observation and the Mean
- Spread of the normal distribution, can be either above or below the Mean
- Used to measure how much variation exists in distribution
Low standard deviation: mean values are close to the mean = lower variability
High standard deviation: mean values spread out over a large range = higher variability
Σ (𝑋 − 𝑀)2
𝑆𝐷 = √
𝑛 − 1
For the Standard Deviation, there is the 68/95/99.7 – 1/2/3 SD rule.
- If we extend 1 SD above the mean (+) and 1 SD below the mean (-), approximately 68% of the
observations are contained within this interval.
- Approximately 95% of the population would be between 2 SD (Standard Deviations) above the mean
and 2 SD below the mean for a normal distribution How many Standard Deviations an observed value is from the Mean, is represented in a Z-score.
Standardize = Unit conversion to SD?
Z-score: used to measure how many standard deviations above or below the mean a particular
score is
Modality, Skewness and Kurtosis
Modality: number of peaks in distribution
- Most distributions are normal (unimodal), but they can also be bimodal or multimodal
Skewness: measure of symmetry of distributions
- When the peak is off-centre, on tale of the distribution is larger than the other
- Every time, the total area under the curve will be exactly equal to 1.
𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛
𝑆𝑤𝑒𝑘𝑛𝑒𝑠𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
o Positive skewness: graph goes to the left
- Leaving tail pointing to right Positive direction
- Mode < Median < Mean
o Negative skewness: graph goes to the right
- Leaving tail pointing to the left
Negative direction
- Mean < Median > Mode
- Positive skewness is more common than negative skewness.
- Skewness affects the Central Tendency and therefore the mode/median/mean are not in the same place.
- Normal distributions are symmetrical and well-shaped, with skewness = 0, because the mean equals de
median.
Kurtosis: measuring shape of the curve
o Mesokurtic: normal distribution, K = 0
o Leptokurtic (positive kurtosis): peaks sharpy with flat tails
- Less variability, K > 0
o Platykurtic (negative kurtosis): flattened graph, highly
dispersed, K< 0
Boxplot: shows median, IQR and range all in one diagram
o Median (Q2): point that cuts dataset in half
- In the box, there is 50% of all observations (75% - 25%)