| Title | Author | Created | Published | Tags |
| ----------------------------------------- | ---------- | ------------------ | ------------------ | -------------------------------------------------- |
| Central Tendency, Variability, & Position | Jon Marien | September 11, 2025 | September 11, 2025 | [[#classes\|#classes]], [[#MATH26367\|#MATH26367]] |
# Central Tendency, Variability, & Position
## General Terms & Definitions
### Population
The complete collection of measurements, objects or individuals under study.
### Parameter
Numerical characteristic of a population.
### Census
Survey of the entire population.
### Sample
Subset of a population.
### Statistic
Numerical characteristic of a sample.
### Sampling Methods
There are many methods of taking samples. Random sampling reduces the chance of introducing a bias or sampling error. In Random Sampling, all elements of a population have an equal chance of being selected for the sample.
### Data Types
When collecting data we can record attributes, measurements, counts etc... The type of data that we end up with depends on what type of things we record. Data can be classified as either qualitative or quantitative.
- Qualitative data involves categorical variables.
- Quantitative data involves numerical variables that are either discrete or continuous:
- A discrete numerical variable can be determined by counting a quantity.
- A continuous numerical variable can be determined by measuring a quantity.
### Frequency Distributions and Histograms
Frequency distributions organize data items into compressed form without obscuring essential facts and patterns. They also provide insight into patterns in data. A histogram is a graphical representation of the frequency distribution.
---
# Central Tendency
## Measures of Central Tendency
- Measures of "" "" allows us to find a single score that:
- defines the centre of the distribution
- best represents the entire distribution
- There are 3 measures:
- Mean
- Median
- Mode
### Mean
- Also referred to as the average
$ \frac{\text{Sum of scores}}{\text{Number of scores}}$
$\\\text{Population Mean: }\mu=\frac{\Sigma X}N$
$\\\text{Sample Mean: }\bar{x}=\frac{\Sigma X}n$
### Examples
1. What is the mean for the following set:
- 2, 3, 1, 4, 3, 1, 4, 6, 3
$ \begin{aligned}\text{Mean}=\frac{\sum x}{n}&=\frac{2+3+1+4+3+1+4+6+3}{9}\\&=\frac{27}9\\&=3\end{aligned}$
2. ![[image-955.png]]
- Step 1: Calculate ๐โ๐ฅ
- Step 2: Calculate โใ๐โ๐ฅใ
- Step 3: Find total number of participants, ๐
- Step 4: Mean = (โใ๐โ๐ฅ)รท๐ใ
- $\begin{aligned}&\sum f\cdot x\:=\:10+18+24+21+24+10=107\\&\mathrm{n}=1+2+3+3+4+2=15\\&\mathrm{Mean}=107\div15=7.13\end{aligned}$
### Weighted Mean
Combining information from two samples.
Example: One sample has a mean of M = 4 and a second sample has a mean of M = 8. A researcher wants to combine them into a single set of scores. What is the mean for the combined set of scores if sample 1 has n = 7 scores and sample 2 has n n = 8 scores?
$ \text{Overall mean (M)}=\frac{\Sigma x\text{ )overall sum of all scores in the combines samples})}{n\text{ (total number of individuals in the both samples})}$
$ \sum x=4\cdot7+8\cdot8=92\\n=7+8=15\quad\text{Weighted mean}=92\div15=6.13$
## Median
- The median is:
- The middle value of a set of scores in ascending order
- Score that divides a distribution in half
![[image-958.png]]
![[image-964.png]]
## Mode
- The mode is:
- The most frequent score in a distribution
![[image-959.png]]
- Sample of scores can have more than one mode.
`import null from * as py
```
line x =10
x +x = y
```
$ skew=\frac{3(mean-median)}{deviation}=\frac{3(M-P_{50})}{s}$
## How to Choose an Appropriate Measure of Central Tendency
![[image-961.png]]
### Shape of Distribution
![[image-962.png]]
### Skewed Data
- The difference between mean and median can be used to measure the amount of skew in the data.
- Skew is a measure of how asymmetrically the data is distributed.
$ \text{Calculating the skew: skew =}\frac{3(mean-median)}{deviation}=\frac{3(M-P_{50})}{s}$
# Measures of Variability
- A measure of variability allows us to:
- understand the extent to which scores in a distribution are close together (clustered) or far apart (spread out)
- Some measures of variability include:
- Range
- Variance
- Standard Deviation
## Range
- The distance between the smallest score and the largest score in a distribution.
- Range is the **simplest** measure of variability.
- Is used to decide how to group data for a frequency distribution.
$\begin{array}{}62,&76,&71,&80,&75,&87,&78,&70,&85,&86,\\93,&59,&87,&69,&91,&88,&58,&64,&83,&60\end{array}$
$Range: 93-58 = 35$
## Standard Deviation
- **Measure of the average distance from the mean.**
- Uses mean as its reference point.
- Most common measure of variability.
![[image-965.png]]
### Population Standard Deviation
![[image-966.png]]
![[image-967.png]]![[image-968.png]]
![[image-969.png]]
### Sample Standard Deviation
Sample Variance: $s^2=\frac{\Sigma(x-M)^2}{n-1}$
Also written as:
$ s^2=\frac{ss}{n-1}=\frac{ss}{df}$
# Percentiles
Percentiles divide a data set into 100 equal parts where the $p^\text{th}$ percentile is a value that at most p% of the observations in the data set are less than this value and the remainder are greater.
$ Percentile=\frac{(\text{number of values below x})+0.5}{\text{Total number of values}}100\%$
To find a data value corresponding to a given percentile: Arrange the data in order from the lowest to highest.
Substitute into the formula: $๐=๐๐/100$
- Where $n$ is the total number of points and p is the percentile.
- If $c$ is not a whole number, round up to the next whole number. Starting at the lowest value, count over to the the next number that corresponds to the rounded-up value.
- If $c$ is a whole number, use the value halfway between c and c+1values when counting up from the lowest value.
## Example
Find the 10th percentile and the 50th percentile of the
following test scores: $45, 51, 53, 59, 62, 64, 66, 76, 86, 88, 89, 91$.
$\mathbb{P}_{10}=\frac{12\times10}{100}=1.2$
Which is not an integer, so we round 1.2 up to 2. The second data
value is 51, so $\mathbb{P}_{10}=51$.
$ P_{50}=\frac{12\times{10}}{100}=6$
Which is an integer, so we take the average of the 6th and 7th value, so:
$ \begin{array}{c}\mathrm{P}_{50}=(64+66)/2\\=65\end{array}$
## Quartiles
![[image-977.png]]
![[image-978.png]]
$QR=Q_3-Q_1=87-56=31$
## Boxplots
A boxplot (or box-and-whisker-diagram) is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q1, the median, and the third quartile, Q3.
- Find the 5-number summary.
- Construct a scale with values that include the minimum and maximum data values.
- Construct a box (rectangle) extending from Q1 to Q3 and draw a line in the box at the value of Q2 (median).
- Draw lines extending outward from the box to the minimum and maximum values.
![[image-980.png]]
### Example
Draw a box-and-whisker plot for the following data set:
4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4
1. Order the set:
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1
2. Find the median of the entire set. Since there are seventeen values in this list, we need the ninth value: The median is Q2 = 4.4
3. Next, we need the medians of the two halves. Since we used the "4.4" in the middle of the list, we can't re-use it, so the two remaining data sets are:
3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4 and 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1
4. The first half has eight values, so the median is the average of the middle two:
Q1 = (4.3 + 4.3)/2 = 4.3.
The median of the second half is Q3 = (4.7 + 4.8)/2 = 4.75.
![[image-987.png]]
## Outliers
Values that are distant from the majority of the data.
They can have a greater effect on the mean than the median.
How to decide about outliers? One way is:
- Check for any data value that is smaller than Q1 -1.5(IQR) or larger than Q3 +1.5(IQR).
![[image-989.png]]
### Rules
- There are no hard-and-fast rules on what to do with outliers, nor is there complete agreement among statisticians on ways to identify them.
- Obviously, if they occurred as a result of an error, an attempt should be made to correct the error or else the data value should be omitted entirely.
- When they occur naturally by chance, the statistician must make a decision about whether to include them in the data set.