WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.060 --> 00:00:06.180
We will next cover a couple of basic statistical
concepts that are related to data. So descriptive
00:00:06.180 --> 00:00:13.140
statistics are some things, some numbers that we
calculate from our data. They are like summaries
00:00:13.140 --> 00:00:19.860
of our data. To understand this basic concept
is important, because quite often our more
00:00:19.860 --> 00:00:27.000
complicated models try to explain differences in
mean, or try to explain variation assume something
00:00:27.000 --> 00:00:33.570
about the variance and so on. So to understand
what these more complicated things do we have to
00:00:33.570 --> 00:00:40.800
understand the basics. And this is partly high
school mathematics but it's useful to revise it
00:00:40.800 --> 00:00:47.490
now before we go into more complicated things.
There are first important thing to know is the
00:00:48.690 --> 00:00:55.770
concept of central tendency and this person.
So these are data about three thousand one
00:00:55.770 --> 00:01:02.700
hundred seventy one working age males from the
United States. We have date on their heights.
00:01:02.700 --> 00:01:10.950
This shows the distribution of the heights, so we
have some people that are very short we have some
00:01:10.950 --> 00:01:18.090
people that are very tall, and most people fall
into these bins somewhere in the middle. So it's
00:01:18.090 --> 00:01:26.670
bar here presents are a group of people, how many
people fall into their category or for example 170
00:01:26.670 --> 00:01:32.910
to 175 centimeters of height. So the bar presents
the amount of people. And this is a histogram,
00:01:32.910 --> 00:01:40.860
it presents how these heights are distributed.
Then we have there the Kernel density plot of
00:01:40.860 --> 00:01:47.430
the same data. So Kernel density plot shows us
the distribution in another way. So we just have
00:01:47.430 --> 00:01:53.940
a line, and this is our the probability density
function, that's the what's it called formally,
00:01:53.940 --> 00:02:00.180
and the height here tells what is the relative
probability of observing a person here,
00:02:00.180 --> 00:02:08.250
versus a person here for example. The area under
the curve is always one, so if this is a the scale
00:02:08.250 --> 00:02:17.430
here is is it in the tens, then the scale here
must be 0.0 or something. So that shows us that
00:02:17.430 --> 00:02:22.290
there are most people are on them in the middle,
and then there are a few small short people and
00:02:22.290 --> 00:02:31.140
a few tall people. Now the concepts of central
tendency tells us are where this distribution
00:02:31.140 --> 00:02:41.580
is located at. So are the people roughly disputed
around 175 centimeters or are they perhaps about
00:02:41.580 --> 00:02:51.300
160 centimeters or 180 centimeters. So it tells
us what is the location that's an another commonly
00:02:51.300 --> 00:03:00.030
used term for or for this where this distribution
is actually at on the axis. We have two measures
00:03:00.030 --> 00:03:05.940
of central tendency that are the most important.
The mean is the most commonly used. So mean is
00:03:05.940 --> 00:03:11.760
just they are the average, you take a some
of all these peoples heights and you divide
00:03:11.760 --> 00:03:19.560
them by the number of people. Then if you have
median which is the height of a typical person.
00:03:19.560 --> 00:03:25.350
The median is calculated by putting these people
in a line, so that the Saudis person is in front
00:03:25.350 --> 00:03:31.560
and then the tallest person is in the back, and
everyone there is a ordered based on their height,
00:03:31.560 --> 00:03:36.630
and then you take the person who is right in the
middle. So it's the mid most observations value.
00:03:36.630 --> 00:03:45.780
Median is a useful statistic for quantifying
what is a typical person like in the population,
00:03:45.780 --> 00:03:52.890
because it's not sensitive to some people that
were very tall or very short. For example if we
00:03:52.890 --> 00:03:58.440
had a person here that was a 1 million centimeters
tall, which of course is impossible. Then the mean
00:03:58.440 --> 00:04:06.000
would be affected but the median wouldn't. So mean
and median tells us what is that a typical person
00:04:06.000 --> 00:04:13.500
or the typical company or whatever you're studying
like. The other important concept is this person.
00:04:13.500 --> 00:04:21.060
So this person tells us how wide this distribution
is, so is everyone about the same size or is
00:04:21.060 --> 00:04:30.060
everyone between 174 and 176 centimeters, or are
people between 150 centimeters and 2 meters. So
00:04:30.060 --> 00:04:37.380
this person tells how widely these persons
are separated. The most common or the used
00:04:37.380 --> 00:04:43.230
measure of this person is standard deviation.
I'm not going to present you the definition,
00:04:43.230 --> 00:04:49.770
but it's important to know that 1 standard
deviation is about plus or minus. One standard
00:04:49.770 --> 00:04:57.180
deviation cover about two thirds of the data,
then are plus or minus two standard deviations
00:04:57.180 --> 00:05:05.250
cover about 95% of the data. So these green lines
show the two standard deviations and then 95% of
00:05:05.250 --> 00:05:14.700
the people about fit into this area. So standard
deviation can be if standard deviation was large,
00:05:15.600 --> 00:05:21.570
here it's about seven point four centimeters. If
it was 10 centimeters it would mean that these
00:05:21.570 --> 00:05:30.000
two standard deviations would be about on bit less
than 2 meters, and this minus 2 standard deviation
00:05:30.000 --> 00:05:36.360
would be a bit more than 150 centimeters. So
it would tell us that people's heights vary
00:05:36.360 --> 00:05:43.860
more. So standard deviation tells how much the
observations vary. There is this a joke about
00:05:43.860 --> 00:05:52.440
why standard deviation is important. There are two
statisticians and one is 150 centimeters tall one
00:05:52.440 --> 00:06:00.540
is 160 centimeters tall, and they are crossing a
river that has a mean depth of 120 centimeters,
00:06:00.540 --> 00:06:07.800
and they're debating on why should they cross
or not. They decide not to because the mean
00:06:07.800 --> 00:06:13.380
doesn't tell what is the the deepest part.
So we have to understand also how much the
00:06:13.380 --> 00:06:18.720
their depth of the river varies instead of
just knowing what is the average depth of
00:06:18.720 --> 00:06:23.310
the river. So standard deviation tells us how
much variation there is in in the observations.
00:06:23.310 --> 00:06:31.260
Then there's the concept of standardization
that is also important. Standardization can
00:06:31.260 --> 00:06:38.460
be useful and it can be harmful depending on the
context but it's important to understand why we
00:06:38.460 --> 00:06:44.400
standardize and when. For example Corps Laysan,
which I have mentioned before, is a standardized
00:06:44.400 --> 00:06:50.670
measure. So it applies standardises the idea of
standardization is that you take the observations
00:06:50.670 --> 00:06:58.470
they are distributed like that and the mean is
at 175 about, and standard deviation about it's
00:06:58.470 --> 00:07:06.300
about seven centimeters. You subtract the mean
from every observation, and you divide by the
00:07:06.300 --> 00:07:12.900
standard deviation. That gives you a new variable
that has a mean of zero, and standard deviation
00:07:12.900 --> 00:07:22.290
of exactly one. So we are basically throwing away
there are data about the location and this person
00:07:22.290 --> 00:07:29.190
and we are just retaining the data on where this
each individual is located related to our other
00:07:29.190 --> 00:07:35.250
individuals, and we also retain the overall
shape of the distribution. This can sometimes
00:07:35.250 --> 00:07:47.430
make things easier to interpret. For example if I
say that ah I'm 176 centimeters tall, it may tell
00:07:47.430 --> 00:07:52.110
you something about my height if you know what
the height of other heights of population is,
00:07:52.110 --> 00:08:00.690
if I would say that my height is at the mean,
then everyone understands that typical finish
00:08:00.690 --> 00:08:07.140
males are about 50% of the time they're taller
the me and 50% of the time they're shorter than
00:08:07.140 --> 00:08:12.870
me so I'm average height. So standardization can
make things easier to interpret, but it can also
00:08:12.870 --> 00:08:18.510
make things harder to interpret depending on the
context. So standardization destroys information
00:08:18.510 --> 00:08:25.290
by eliminating an information about the they
are the central tendency, or the location and
00:08:25.290 --> 00:08:31.620
the dispersion from the data. Then there's also
variance which is another measure of dispersion,
00:08:31.620 --> 00:08:39.540
and various is related to standard deviation. It's
used because it's more convenient for some some
00:08:39.540 --> 00:08:45.510
computations and sometimes variance is easier to
interpret. For example in regression analysis we
00:08:45.510 --> 00:08:52.080
are assess how much of the variance the model
explains of the dependent variable we don't
00:08:52.080 --> 00:08:58.740
do that in standard deviation metric we do it in
very symmetric. So the standard deviation has same
00:08:58.740 --> 00:09:03.900
unit as the original variable. So if standard
deviation is seven, then we know that these
00:09:03.900 --> 00:09:12.660
bars are 7 centimeters from the mean, and if we
multiply this variance variable by 2 then standard
00:09:12.660 --> 00:09:19.890
deviation doubles, so that's that's convenient.
Then our variance measures the same thing,
00:09:19.890 --> 00:09:25.020
it measure dispersion as well but on a different
metric, and variance is defined as the mean of
00:09:25.020 --> 00:09:29.640
square differences from the mean. So we take its
observation we subtract the mean and we take a
00:09:29.640 --> 00:09:36.210
square or raised to the second power, and then
we take a mean of those squares. That gives us
00:09:36.210 --> 00:09:43.320
the various. Variance and standard deviations
are related so that the standard deviation of
00:09:43.320 --> 00:09:48.300
the data is the square root of the variance, and
variance is the square of the standard deviation.
00:09:48.300 --> 00:09:58.260
We work with either typically if you just want
to interpret how a variable is distributed. We
00:09:58.260 --> 00:10:03.480
look at the standard deviation because it sits in
a metric that is easier to understand. So standard
00:10:03.480 --> 00:10:11.460
deviation is 7 centimeters we can immediately
say that our that the people are 60% or something
00:10:11.460 --> 00:10:18.840
of the people are between our 170 and 185. So
that's how standard deviations are interested.
00:10:18.840 --> 00:10:24.570
Variance is 54 point 79, so that doesn't really
really tell us where people are located at,
00:10:24.570 --> 00:10:30.780
but variance is useful for some other purposes
and particularly in more complicated models we
00:10:30.780 --> 00:10:39.030
use variances. Sometimes you report both so that's
what possible as well. The concept of variance is
00:10:39.030 --> 00:10:46.650
important to understand the concept of covariance.
So the idea of that the variance was the mean of
00:10:46.650 --> 00:10:53.040
differences of each observation from the mean
observation to the second power. So it's the
00:10:53.040 --> 00:10:59.910
same as our difference from the mean multiplied
by difference from the mean. Then we have another
00:10:59.910 --> 00:11:09.550
statistic called covariance so here we have data
on height and weight. Height and weight. Their
00:11:09.550 --> 00:11:15.280
covariance tells us how strongly person's height
is related to the persons weight. So we can see
00:11:15.280 --> 00:11:20.500
here that are those people who tend to be tall or
taller tend to also be heavier, so there's a core
00:11:20.500 --> 00:11:28.120
covariance here. The covariance measures how much
two variables buried together and it's defined
00:11:28.120 --> 00:11:36.310
similarly to variance. Except that are you don't
multiply one variable with itself. Instead you
00:11:36.310 --> 00:11:43.510
multiply one variable with another and you take
a mean of that. Then the concept of correlation,
00:11:43.510 --> 00:11:50.860
which many of you probably know, is just the
covariance between standardized variables and
00:11:50.860 --> 00:11:58.780
correlation varies between minus 1 and plus 1. So
correlation is a measure of standardized measure
00:11:58.780 --> 00:12:04.660
of linear Association. When correlation is 1 then
you know that two things are perfectly related,
00:12:04.660 --> 00:12:11.020
when it's minus 1 you know that two things are
perfectly are negatively related. When it's zero
00:12:11.020 --> 00:12:19.480
then they are linearly unrelated. So correlation
is a measure of linear Association. That means
00:12:19.480 --> 00:12:27.340
that it measures how strongly observations are
clustered in line. So this is a scatter plot of
00:12:27.340 --> 00:12:36.730
two observations and one is a line. 0.8 is the
observations are very closely clustered on the
00:12:36.730 --> 00:12:44.350
line, then are 0.4 is something that we observe
with the plain eye. Zero means that there is no
00:12:44.350 --> 00:12:50.620
linear relationship, and then our negative
correlations means that when one observation
00:12:50.620 --> 00:12:56.980
in one variable increases, then another one
decreases. So that's the same except the
00:12:56.980 --> 00:13:04.060
directions opposite. The correlation doesn't
tell us what is the magnitude of the change,
00:13:04.060 --> 00:13:12.130
so we can say that on this is the correlation of
1. There is a huge effect of the X variable on the
00:13:12.130 --> 00:13:17.560
Y variable. This is the correlation of 1 as well
there is a small effect of X variable on the Y
00:13:17.560 --> 00:13:23.650
variable. So there the Y variable here doesn't
increase as strongly with X variable us here,
00:13:23.650 --> 00:13:29.950
so correlation doesn't tell us about the
magnitude of the effect. It just tells us
00:13:29.950 --> 00:13:35.830
how strong the association is and this is zero
correlation, because why variable doesn't vary,
00:13:35.830 --> 00:13:42.760
and then we have the negative correlations
here. Importantly correlation is a measure
00:13:42.760 --> 00:13:49.600
of linear Association, so here we have two
variables that are clearly associated. So
00:13:49.600 --> 00:13:56.170
there's a clear pattern but it's nonlinear. Here
is another pattern that's nonlinear and these are
00:13:56.170 --> 00:14:03.820
this is a weak positive correlation and this
is a clear association but it's nonlinear. So
00:14:03.820 --> 00:14:11.560
correlation only tells us if we can describe the
data with a line. There could be some other kind
00:14:11.560 --> 00:14:18.460
of relationship as well. So saying that two
variables are uncorrelated doesn't mean that
00:14:18.460 --> 00:14:26.800
they are not related statistically, just means
that relationship cannot be expressed as a line.