WEBVTT 00:00:01.600 --> 00:00:04.720 In social science research we  typically want to make causal claims.  00:00:05.600 --> 00:00:10.640 What makes causal claims challenging is that  we need to eliminate rival explanations.  00:00:10.640 --> 00:00:17.360 We can't simply say that because x and y are  correlated, therefore x must be the cause of y.  00:00:18.240 --> 00:00:24.080 The gold standard for eliminating rival  explanations for a correlation between x and y   00:00:24.080 --> 00:00:27.200 is the experiment. In an experimental research   00:00:27.200 --> 00:00:31.920 we randomize our sample into two groups. The treatment group and the control group.  00:00:31.920 --> 00:00:35.200 The treatment receives x the  control does not receive x.  00:00:36.080 --> 00:00:37.680 Then we observe the outcome y   00:00:38.240 --> 00:00:43.200 after an appropriate time delay. That allows us to make clean causal claims.  00:00:43.760 --> 00:00:48.400 Unfortunately in most of the cases  we can't study things experimentally.  00:00:48.960 --> 00:00:53.840 For example if we want to study what is the  effects of early internationalization and   00:00:53.840 --> 00:01:00.480 firm survival, we can't just take firms and  tell them to internalize early, randomly.  00:01:01.120 --> 00:01:03.600 We would have to buy all the  firms and that's impractical.  00:01:04.560 --> 00:01:09.280 In practice most of the time we  have to work with what we observe.  00:01:09.280 --> 00:01:14.640 We work with observational data and  we have to use statistical techniques   00:01:14.640 --> 00:01:19.280 for controlling for alternate explanations. One of the most commonly used statistical   00:01:19.280 --> 00:01:24.080 techniques is the regression analysis. And different variations of regression analysis   00:01:24.640 --> 00:01:31.280 are pretty much covering like 90% of what  is published in social science journals.  00:01:31.920 --> 00:01:36.000 Let's take a look at how statistical  controlling for alternate explanations   00:01:36.000 --> 00:01:38.800 and how regression analysis works. And what are the key ideas.  00:01:39.600 --> 00:01:46.480 So the idea of controlling is that when  we observe a correlation between x and y,   00:01:46.480 --> 00:01:51.920 let's say we observe a correlation between  CEO-gender and profitability of a company.  00:01:52.640 --> 00:02:00.560 We should rule out the potential spuriousness  of that correlation so correlations can exist   00:02:00.560 --> 00:02:04.000 for multiple different reasons. Here in Singelton & Straits   00:02:04.560 --> 00:02:07.840 they give two examples. I like the example of firefighters.  00:02:08.400 --> 00:02:12.800 The number of firefighters in a  fire scene is highly correlated   00:02:12.800 --> 00:02:15.360 with the amount of fire damage after the fire.  00:02:16.000 --> 00:02:22.160 Can we say that increasing the number of  firefighters in the scene causes more damage?  00:02:22.880 --> 00:02:27.200 Probably not. There is a third variable the size of the fire.  00:02:27.200 --> 00:02:31.200 That is the cause of the firefighters  and it's also the cause of the damage.  00:02:31.760 --> 00:02:35.520 If there's a big fire then  more firefighters are sent in.  00:02:35.520 --> 00:02:39.440 If there's a big fire then  there will be more damage.  00:02:39.440 --> 00:02:44.480 And the size of the fire creates a spurious  correlation between the number of firefighters   00:02:44.480 --> 00:02:47.360 and the fire damage. With statistical controlling   00:02:48.160 --> 00:02:56.400 we try to eliminate these spurious correlations. Or we try to get the causal effect from an   00:02:56.400 --> 00:03:00.320 observed correlation that is  partly causal partly spurious.  00:03:00.960 --> 00:03:06.000 How do we do that? Let's take a look at a Talouselämä 500 example.  00:03:06.000 --> 00:03:13.920 So this is an interesting finding just the fact  in 2005 the ROA of many women-led companies in   00:03:13.920 --> 00:03:19.680 the largest 500 Finnish companies was 4.7  % points higher than man-led companies.  00:03:20.240 --> 00:03:25.040 Let's assume that we have already  ruled out chances and explanations.  00:03:25.040 --> 00:03:32.240 We want to understand whether it's actually the  women that cause this profitability difference.  00:03:32.240 --> 00:03:37.360 How do we deal with that? We see from our observed data that there   00:03:37.360 --> 00:03:43.120 is an overlap between CEO-gender and performance. The variables are correlated if we have a bar   00:03:43.120 --> 00:03:49.600 chart it shows that female-led companies  ROA is 18.5 for male-led companies is 14.1.  00:03:49.600 --> 00:03:55.520 These are just made up numbers but the difference  is roughly the same ballpark than the 4.7.  00:03:56.400 --> 00:04:00.480 How do we deal with these  alternative possible explanations?  00:04:00.480 --> 00:04:05.680 Well first of all, we need to  have some kind of theory of why   00:04:05.680 --> 00:04:11.600 there would be a correlation that is not causal. We could for example say that there's a third   00:04:11.600 --> 00:04:15.200 variable company size that  influences the correlation.  00:04:15.920 --> 00:04:22.640 We could say that small companies  are more likely to have women CEOs   00:04:22.640 --> 00:04:27.680 and small companies are more profitable. Therefore company size causes a spurious   00:04:27.680 --> 00:04:33.920 correlation between gender and performance. Now we want to with statistical adjustments   00:04:33.920 --> 00:04:39.440 or statistical techniques we want to understand  how much of this correlation mark here with   00:04:39.440 --> 00:04:45.840 a1 is due to the spurious part c  and how much is the causal part a2.  00:04:46.800 --> 00:04:51.520 Assuming that company size is the only  factor causing a spurious correlation.  00:04:52.320 --> 00:04:58.080 The simplest possible strategy for  eliminating the rival explanation of size   00:04:58.720 --> 00:05:02.400 is to make the companies  more comparable by matching.  00:05:02.960 --> 00:05:08.400 So let's assume that most of the women-led  companies have less than 250 employees.  00:05:09.520 --> 00:05:16.320 If men-led companies tend to be larger then  comparing large men-led companies and small women   00:05:16.320 --> 00:05:19.760 and companies is not a fair comparison. We don't know whether it's the   00:05:19.760 --> 00:05:23.760 gender or the size effect. What we can do is matching.  00:05:23.760 --> 00:05:27.680 Or analyzing a subsample. We could for example analyze   00:05:27.680 --> 00:05:34.000 a sub sample of companies with just  250 people or less and a compare.  00:05:34.880 --> 00:05:39.040 With this subsample we will find that  main companies are still a bit less   00:05:39.040 --> 00:05:43.840 profitable than women companies but the  difference is not as great as before.  00:05:44.720 --> 00:05:52.960 And based on these kind of analysis we could say  that yes there's a correlation between CEO-gender   00:05:52.960 --> 00:06:00.080 and profitability but it is mostly explained  by size differences with this artificial data.  00:06:01.120 --> 00:06:05.200 This strategy is very simple to  understand and it is very simple to   00:06:05.200 --> 00:06:10.720 apply but it's also fairly limited. It is limited because if we start   00:06:10.720 --> 00:06:18.080 matching typically, if we have let's say five  different explanations for the correlation.  00:06:18.080 --> 00:06:23.360 We have let's say industry effect, we have  past performance, we have size effect,   00:06:23.920 --> 00:06:28.800 we could have other effects as well. If we try to match on many different variables   00:06:28.800 --> 00:06:33.040 then the problem is that we don't  really find any more matches.  00:06:33.040 --> 00:06:38.800 If we want to find two sets of companies  that are equal on 5 different characteristics   00:06:39.680 --> 00:06:43.680 our sample will just run out. So in practice matching works   00:06:43.680 --> 00:06:48.800 for simple problems for more complicated  problems we use statistical modeling.  00:06:49.920 --> 00:06:54.320 This would be an example of a regression model. So we would say that return on assets is some   00:06:54.320 --> 00:07:00.080 function of CEO-gender plus company  size we would code CEO-gender as   00:07:00.080 --> 00:07:06.160 1 being female 0 being male and then we tell  our computer to estimate this model for us.  00:07:06.160 --> 00:07:10.720 So computer gives us estimates of these  betas called regression coefficients   00:07:10.720 --> 00:07:13.360 and then we interpret them. I'll talk about regression   00:07:13.360 --> 00:07:17.360 a bit more later in this video. But these are the two main strategies   00:07:17.360 --> 00:07:24.240 for statistical controlling. Either we try to  make the samples more comparable by matching.  00:07:24.240 --> 00:07:29.840 Or we build some kind of statistical  model that adjusts the difference.  00:07:31.680 --> 00:07:33.680 To do so we need to have control variables.  00:07:34.720 --> 00:07:38.880 Control variables are the alternate  explanations that for the correlation.  00:07:39.440 --> 00:07:46.080 If we see a correlation between CEO-gender and  profitability and we want to make the claim   00:07:46.080 --> 00:07:52.960 that CEO-gender actually influences profitability. The difference is in profitability is due to some   00:07:52.960 --> 00:07:56.400 of those companies have female  CEOs and some have male CEOs.  00:07:57.200 --> 00:08:04.320 We need to consider what a skeptic would claim. So if someone does not buy our claim that naming   00:08:04.320 --> 00:08:11.040 a woman as a CEO causes profitability to increase  that skeptic needs to develop a counterargument.  00:08:11.040 --> 00:08:15.760 For example smaller companies are more  profitable more likely to have women CEOs.  00:08:16.320 --> 00:08:21.680 Companies in asset heavy industries are more men  dominated, they also have less return on assets   00:08:21.680 --> 00:08:25.680 because returns is divided by larger assets. And so on.  00:08:25.680 --> 00:08:31.760 We need to consider what are the alternate  explanations for those for the correlation.  00:08:31.760 --> 00:08:35.440 And those alternate explanations  will be our control variables.  00:08:36.160 --> 00:08:41.360 Quite often when you see an empirical study  you see this kind of section about controls.  00:08:41.360 --> 00:08:45.680 And this paper by Heckmann,  explains the controls quite well.  00:08:45.680 --> 00:08:49.760 So they are saying that these are  alternate explanations for the data  00:08:49.760 --> 00:08:53.840 And then we rule out those alternate  explanations using statistical model.  00:08:54.640 --> 00:09:00.560 Importantly an alternate explanation needs to  be correlated with the explanatory variable.  00:09:00.560 --> 00:09:05.040 We say that company size  correlates with CEO-gender   00:09:05.040 --> 00:09:10.160 and also company size or a control variable  needs to be a cause of the dependent variable.  00:09:11.040 --> 00:09:15.440 We say that it's not the women  CEO that causes the profitability   00:09:15.440 --> 00:09:20.320 difference rather it's the company size  that causes the profitability difference.  00:09:20.320 --> 00:09:26.320 And company size is correlated with women's  CEO and that causes experience correlation.  00:09:27.360 --> 00:09:31.200 Let's take a look at the example how it works. This is from Heckmann's paper again.  00:09:31.200 --> 00:09:35.200 We have the correlations and we  have the regression coefficients.  00:09:35.200 --> 00:09:41.680 Regression coefficients tell us what is  the effect after controlling for others.  00:09:41.680 --> 00:09:44.480 I'll talk a bit more about  regression a bit later in the video.  00:09:44.480 --> 00:09:47.520 But let's try to understand the  idea of statistical controlling.  00:09:47.520 --> 00:09:53.360 So here we have a correlation between patient  satisfaction and and age of a physician.  00:09:54.080 --> 00:09:58.880 That correlation is 0.9. It's not statistically significant   00:09:58.880 --> 00:10:02.000 but we don't care about that now. It's a positive correlation.  00:10:02.720 --> 00:10:06.480 Then regression analysis tells  us that there is a negative   00:10:06.480 --> 00:10:09.680 causal effect of age. How come?  00:10:11.360 --> 00:10:14.800 Well the reason for this  correlation is that there's   00:10:14.800 --> 00:10:21.760 actually a spurious correlation in the data. If we look at just tenure and age they are   00:10:21.760 --> 00:10:30.080 highly correlated at 0.69. So tenure is how long a  person has been employed at this medical company.  00:10:30.080 --> 00:10:34.000 And age is the age of the person. Obviously if you are just   00:10:34.960 --> 00:10:41.040 30 late 20s you just graduated from medical  school you can't have much experience.  00:10:41.040 --> 00:10:46.080 If you are closer to retirement then you typically  have long tenure in the place where you work.  00:10:46.960 --> 00:10:50.640 Tenure and age are highly  correlated for that reason.  00:10:50.640 --> 00:10:53.840 Older people tend to be more  experienced than younger people.  00:10:54.640 --> 00:10:57.760 We can see that tenure or your experience is   00:10:57.760 --> 00:11:02.000 highly, affects highly the dependent  variable the customer satisfaction.  00:11:04.320 --> 00:11:07.680 Now we have a situation where  there's a spurious correlation.  00:11:08.320 --> 00:11:16.400 Older people are more experienced, experienced  causes customer satisfaction scores to be higher.  00:11:17.760 --> 00:11:22.320 age has a negative causal  effect on customer satisfaction.  00:11:23.120 --> 00:11:27.200 How do we interpret that? We interpreted that if two people   00:11:27.200 --> 00:11:31.840 have the same amount of experience, then  the patients prefer the younger one.  00:11:32.480 --> 00:11:40.000 If two people are of the same age patients  strongly prefer the one with more experience.  00:11:40.800 --> 00:11:44.640 We can calculate the value of the  spurious correlation from this diagram   00:11:44.640 --> 00:11:50.640 by simply multiplying that regression  coefficient 0.33 and correlation .69.  00:11:50.640 --> 00:11:56.000 And we can see that there is 0.23  spurious correlation between age   00:11:56.000 --> 00:12:03.200 and customer satisfaction due to tenure. And we can do this for all other variables.  00:12:03.200 --> 00:12:08.960 But if we simply compare this 0.23 here the  spurious part of the correlation between   00:12:10.320 --> 00:12:14.960 age and customer satisfaction, we  have the estimated causal effect.  00:12:14.960 --> 00:12:19.600 We take a sum we get pretty close  to the opposite correlation of 0.09.  00:12:20.160 --> 00:12:23.680 Of course there are other variables that  affect that also can cause a spurious   00:12:23.680 --> 00:12:28.960 correlation but this is the most important one. So the idea of statistical controlling is that   00:12:28.960 --> 00:12:38.320 we try to take the observed correlations and then  take that observation correlation into two parts.  00:12:38.320 --> 00:12:43.200 We have the part that is spurious and where  the part that corresponds to the causal effect.  00:12:43.760 --> 00:12:49.680 If we can eliminate all spurious estimated  correlation then we have a clean causal effect.  00:12:51.200 --> 00:12:57.600 Regression and other techniques  typically use the term holding constant.  00:12:58.320 --> 00:13:03.760 Control variables are held constant and  this is what Singleton & Straits explain.  00:13:03.760 --> 00:13:11.040 Holding constant can mean two different things. If we have matching or if we have some kind   00:13:11.040 --> 00:13:15.760 of experimental study where we  have some control on our sample.  00:13:15.760 --> 00:13:20.880 Then holding constant can mean that if we want  to eliminate for example gender differences from   00:13:20.880 --> 00:13:27.680 the analysis, we study only men or only women. Quite often this kind of holding constant by   00:13:27.680 --> 00:13:33.840 actually making a variable to be the  same in our sample is not possible.  00:13:34.560 --> 00:13:41.600 If we want to eliminate the effects of company  size we cannot sample companies that all have   00:13:41.600 --> 00:13:45.040 exactly, for example 100 employees. That would not be possible.  00:13:45.840 --> 00:13:50.960 Another way to understand holding  constant is statistical control.  00:13:51.680 --> 00:13:55.120 So in statistical controlling  like in regression analysis.  00:13:55.680 --> 00:14:00.320 The term holding constant means  that we statistically estimate   00:14:00.320 --> 00:14:06.000 what would be the difference of one variable  if all other variables were the same.  00:14:06.880 --> 00:14:12.880 Other variables are not exactly the same  actually but by statistical analysis we   00:14:12.880 --> 00:14:17.840 can answer the question: what would be the  difference if everything else was the same.  00:14:18.720 --> 00:14:25.120 So for example what would be the difference  between women and men led companies in ROA.  00:14:25.120 --> 00:14:29.840 If the women and men like companies were  the same size and in the same industry.  00:14:30.480 --> 00:14:37.200 Statistical controlling allows us to  answer this kind of questions without   00:14:37.200 --> 00:14:42.960 actually having to observe companies that are  all of the same size and in the same industry.  00:14:42.960 --> 00:14:45.920 And this is why statistical  controlling is very useful.  00:14:47.200 --> 00:14:53.680 Linear regression analysis as i mentioned already  is the basic tool for statistical controlling.  00:14:54.480 --> 00:15:02.320 Many other tools are variants of this  technique but a basic understanding of this   00:15:02.320 --> 00:15:09.120 technique takes you a long way in understanding  quantitative research and analysis results.  00:15:09.840 --> 00:15:13.680 The idea of linear regression  is in a two variable case.  00:15:13.680 --> 00:15:16.960 If we have x variable here this  is the explanatory variable.  00:15:16.960 --> 00:15:20.080 This is the one that we would  manipulate in an experiment.  00:15:20.080 --> 00:15:24.560 We have the y variable the outcome. This is the thing that we would observe   00:15:24.560 --> 00:15:29.280 after the experiment but these are actually  observed data so we don't manipulate x.  00:15:30.160 --> 00:15:35.440 What regression analysis does is that it  finds the best line that explains the data.  00:15:35.440 --> 00:15:38.960 so it tries to find a line  that explains the mean of   00:15:38.960 --> 00:15:48.240 the y the dependent variable for each value of x. In math we write the regression equation like that.  00:15:48.240 --> 00:15:53.440 so if a multiple different explanatory  variables then the expected value or the   00:15:53.440 --> 00:16:00.160 mean of the y variable is assumed to be the sum  of the effects of all these different variables.  00:16:01.120 --> 00:16:04.640 Of course this is a critical  assumption for regression analysis   00:16:05.280 --> 00:16:11.120 if different effects act multiplicatively. For example then we would   00:16:11.120 --> 00:16:15.120 need different kinds of model. But regression analysis is a good starting point.  00:16:15.120 --> 00:16:19.440 We assume that the CEO  effect our CEO-gender effect,   00:16:19.440 --> 00:16:24.560 the industry effect and company size effect,  they're all added together and that gives us   00:16:24.560 --> 00:16:32.400 some kind of expected value for the performance. In our Heckmann's paper example we can for example   00:16:32.400 --> 00:16:37.680 say that patient satisfaction is a function it's  a sum or weighted sum of physics and productivity,   00:16:37.680 --> 00:16:43.120 physician quality physician accessibility. And the regression analysis tells us what   00:16:43.120 --> 00:16:48.720 would be the effect of increasing productivity  if quality and accessibility stayed the same.  00:16:48.720 --> 00:16:54.240 So regression analysis can be used to  answer these kind of what if questions.  00:16:54.240 --> 00:16:58.160 What if one variable changes  and others stay the same?  00:16:58.960 --> 00:17:07.760 What if we name a women as a CEO of a company and  industry and size of the company stay the same?  00:17:07.760 --> 00:17:13.920 That gives us an estimate of the causality. Regression analysis is not a magic wand   00:17:13.920 --> 00:17:18.000 or magical tool that always gives us  valid estimates of causal relations.  00:17:18.880 --> 00:17:24.160 If you are study about regression analysis if  you want to become a professional researcher,   00:17:24.160 --> 00:17:29.840 then you will need to read up books about  the econometrics that tell you that these   00:17:29.840 --> 00:17:35.760 six assumptions are needed for regression. But there are really two important things that   00:17:35.760 --> 00:17:40.720 you need to understand if you want to get started  in understanding what quantitative research   00:17:40.720 --> 00:17:45.840 using regression is about. The first assumption is that all   00:17:45.840 --> 00:17:54.400 relevant controls are included in the model. If we regress company profitability ROA on   00:17:54.400 --> 00:18:01.520 CEO-gender and company size a skeptic  comes and says that: "No the correlation   00:18:01.520 --> 00:18:08.160 that you have is because of the industry." If we don't include industry in our analysis   00:18:08.160 --> 00:18:11.440 but industry is actually  correlated with the CEO-gender   00:18:11.440 --> 00:18:16.320 and an important cause in company performance  then regression analysis is not trustworthy.  00:18:16.880 --> 00:18:22.000 This applies to all statistical  analysis or quantitative research.  00:18:22.800 --> 00:18:27.920 Having the right controls the most  important alternative explanations   00:18:27.920 --> 00:18:35.520 included in the model is critically important. Technically this refers to the mlr4 assumption.  00:18:36.160 --> 00:18:39.280 Another important assumption is  that all relationships are linear.  00:18:39.920 --> 00:18:46.480 So when we have the y variable and we  increase x by one unit the effect on y   00:18:46.480 --> 00:18:51.120 variable the dependent variable is always the  same regardless of the current value of x.  00:18:52.480 --> 00:18:57.760 That that can be true or  approximately true for some cases.  00:18:57.760 --> 00:19:02.160 For example if we consider the size  of the fire and the amount of damage   00:19:02.160 --> 00:19:06.880 that the fire makes could be linear. So if the building is twice as large   00:19:07.600 --> 00:19:10.880 and the fire is twice as large  there could be twice as much damage.  00:19:11.920 --> 00:19:16.160 But it's not always true for example if  we consider the effects of education.  00:19:17.440 --> 00:19:23.360 Having elementary school education,  middle school education, the first 9 years   00:19:23.360 --> 00:19:25.760 probably does not make as much of a difference   00:19:26.400 --> 00:19:32.800 than the final years of your university degree  when you're working toward your master's degree.  00:19:33.440 --> 00:19:41.920 So in that case the returns on education probably  follow a more closely an exponential line,   00:19:41.920 --> 00:19:47.280 than exponential curve than just a line. So not all years are equal.  00:19:47.280 --> 00:19:50.880 So the linearity is another important  assumption on regression analysis.  00:19:50.880 --> 00:19:55.920 In practice when a professional researcher  applies these techniques there are   00:19:55.920 --> 00:20:02.480 diagnostics that allow us to test for linearity. And there are also some diagnostics that allow for   00:20:02.480 --> 00:20:05.840 testing if all variables that are  included in the model and so on.  00:20:06.480 --> 00:20:10.400 But just to understand what regression  does when it is useful it is important   00:20:10.400 --> 00:20:17.520 to understand these two assumptions. Regression analysis and its different extensions   00:20:18.560 --> 00:20:25.280 for example nonlinear case, for the case  where we have observations that are clustered.  00:20:25.280 --> 00:20:30.880 For example students within classes. They cover probably at least 90%   00:20:30.880 --> 00:20:37.040 of the research that is done in social sciences. If you understand regression analysis then you   00:20:37.040 --> 00:20:40.400 have a pretty good foundation  of understanding other things.  00:20:40.400 --> 00:20:44.960 Because they are simply extensions  and variants of this simple technique.  00:20:44.960 --> 00:20:50.160 So the idea of regression is that y one  dependent variable is a weighted sum of   00:20:50.160 --> 00:20:55.840 x is the independent variable. Let's summarize in statistical controlling the   00:20:55.840 --> 00:21:00.800 important thing is that we have control variables. Control variables and alternative explanation for   00:21:00.800 --> 00:21:06.240 a correlation between the x variable  for example CEO-gender, y variable   00:21:06.240 --> 00:21:12.080 for example company performance. And controls are held constant either   00:21:12.080 --> 00:21:18.320 by choosing our sample in a way that studies  only for example companies of the same size.  00:21:19.040 --> 00:21:23.040 Or more typically they are held  constant using statistical techniques.  00:21:23.040 --> 00:21:28.160 So we use statistical techniques to answer the  question of what would be the difference if all   00:21:28.160 --> 00:21:33.520 the companies were of the same size same industry  and whatever the control variables we have.  00:21:34.560 --> 00:21:39.520 Controls are must be justified instead  of just throwing in a standard set.  00:21:39.520 --> 00:21:44.080 So you need to think through why is there a  correlation between my import and x variable   00:21:44.080 --> 00:21:48.960 and my important y variable and then pick controls  to rule out those alternative explanations.  00:21:49.600 --> 00:21:52.080 In practice the tool that we apply   00:21:52.960 --> 00:21:59.680 to study this kind of x y effects controlling  for other variables is the regression analysis.  00:21:59.680 --> 00:22:04.000 This is the work course of causal  analysis in non-experimental research.  00:22:04.000 --> 00:22:06.560 Regression analysis makes  two important assumptions.  00:22:07.360 --> 00:22:11.440 Effect of x is always constant on y. So everything is linear.  00:22:11.440 --> 00:22:19.680 It's not that little x does not make a difference. When you have a large amount of x then having more   00:22:19.680 --> 00:22:22.880 of that makes more of a difference. Or it could work the other way.  00:22:22.880 --> 00:22:29.920 That initial small increments of x make a big  difference and then when you get further on x   00:22:29.920 --> 00:22:34.720 then the increments are going to get smaller. Regression analysis assumes that the effect of x   00:22:34.720 --> 00:22:40.560 on y is always constant many commonly used  analysis techniques are simply variation   00:22:40.560 --> 00:22:43.120 of regression analysis. So if you understand   00:22:43.680 --> 00:22:50.160 the basics of when regression analysis works and  how regression analysis results are interpreted.  00:22:50.160 --> 00:22:52.320 Then you have a pretty solid foundations   00:22:52.320 --> 00:22:55.840 on understanding also other  more complicated techniques