WEBVTT 00:00:01.300 --> 00:00:02.330 - During the course, 00:00:02.330 --> 00:00:03.370 you will need to complete 00:00:03.370 --> 00:00:06.150 at least one data analysis assignment. 00:00:06.150 --> 00:00:09.260 And I thought it's a good idea to start the course 00:00:09.260 --> 00:00:11.130 by discussing a little bit 00:00:11.130 --> 00:00:13.440 about the different software choices that you make. 00:00:13.440 --> 00:00:16.230 So you must choose at least one of these software 00:00:16.230 --> 00:00:17.990 to use for the first assignment, 00:00:17.990 --> 00:00:21.400 you can of course use multiple software 00:00:21.400 --> 00:00:23.510 for different assignments if you want to, 00:00:23.510 --> 00:00:25.870 and I have some people who have come to the course 00:00:25.870 --> 00:00:29.350 later on to complete it with a different software. 00:00:29.350 --> 00:00:31.730 We are using three different software on the course. 00:00:31.730 --> 00:00:35.727 We have our SPSS, Stata, and R. 00:00:35.727 --> 00:00:38.330 And these are fairly different. 00:00:38.330 --> 00:00:41.790 So they, you can complete the course fully 00:00:41.790 --> 00:00:43.510 using any of these software. 00:00:43.510 --> 00:00:46.420 I have very strong opinions on which of this software 00:00:46.420 --> 00:00:47.320 you should apply 00:00:47.320 --> 00:00:50.590 if you want to be a professional researcher. 00:00:50.590 --> 00:00:51.960 But let's take a look at first, 00:00:51.960 --> 00:00:53.110 what's a statistical software 00:00:53.110 --> 00:00:55.750 and how does it differ from Excel. 00:00:55.750 --> 00:00:56.760 So in Excel, 00:00:56.760 --> 00:00:59.740 your data and your analysis lives in one worksheet. 00:00:59.740 --> 00:01:03.250 So some of the cells have data, and if you do calculations, 00:01:03.250 --> 00:01:05.810 then those calculations go to different cells 00:01:05.810 --> 00:01:09.120 or maybe a different sheet within the same file 00:01:09.120 --> 00:01:11.620 and then when inside cells in there. 00:01:11.620 --> 00:01:15.830 And also all calculation results appear in the same cells. 00:01:15.830 --> 00:01:20.610 So you have data, analysis specification, 00:01:20.610 --> 00:01:22.930 and results in the same file. 00:01:22.930 --> 00:01:25.570 And it's not very easy for anyone 00:01:25.570 --> 00:01:28.460 who has not done the sheet themselves, 00:01:28.460 --> 00:01:32.400 or to understand what is the logical sequence 00:01:32.400 --> 00:01:34.090 behind the analysis. 00:01:34.090 --> 00:01:35.630 So if you calculate the mean 00:01:35.630 --> 00:01:37.780 and then you calculate the standard deviation, 00:01:37.780 --> 00:01:40.720 it's not clear by looking at the Excel sheet, 00:01:40.720 --> 00:01:42.710 which one is calculated first. 00:01:42.710 --> 00:01:45.623 Some cases it doesn't matter, some cases it does. 00:01:46.860 --> 00:01:49.730 Statistical software is a different kind of tool. 00:01:49.730 --> 00:01:52.970 So statistical software has data, 00:01:52.970 --> 00:01:56.060 it has analysis specification and it has results. 00:01:56.060 --> 00:01:58.720 But these are typically in three separate files. 00:01:58.720 --> 00:02:02.330 And the data file is something that you hardly, 00:02:02.330 --> 00:02:03.240 you never edit. 00:02:03.240 --> 00:02:05.537 So your data file use what you have, 00:02:05.537 --> 00:02:08.160 and that's, you never edit it. 00:02:08.160 --> 00:02:12.150 Then the analysis file lists the sequence of operations 00:02:12.150 --> 00:02:15.540 or commands or analysis that you applied to the data. 00:02:15.540 --> 00:02:19.110 And it's basically a text file or a document, 00:02:19.110 --> 00:02:21.020 and you read it from bottom to down 00:02:21.020 --> 00:02:24.193 and then the computer executes things or, 00:02:25.230 --> 00:02:28.220 to the data using that sequence. 00:02:28.220 --> 00:02:30.830 So data analysis uses statistical software 00:02:30.830 --> 00:02:32.490 is command driven. 00:02:32.490 --> 00:02:35.600 And commands can do analysis, they can manipulate data, 00:02:35.600 --> 00:02:36.990 they can load data sets, so it save data sets, 00:02:36.990 --> 00:02:39.170 so it do all kinds of things. 00:02:39.170 --> 00:02:43.230 All of these programs are command driven. 00:02:43.230 --> 00:02:46.230 R is a bit different because it's not 00:02:46.230 --> 00:02:50.670 as smart a statistical analysis software as Stata and SPSS. 00:02:50.670 --> 00:02:53.790 Instead it's a statistical programming environment. 00:02:53.790 --> 00:02:58.790 So it's much more focused on programming than Stata and SPSS 00:02:59.200 --> 00:03:00.610 which are more focused 00:03:00.610 --> 00:03:03.680 on just a sequence reapplying commands to the data. 00:03:03.680 --> 00:03:06.150 Of course, you can do that as well with R, 00:03:06.150 --> 00:03:07.950 but R is a much more general system. 00:03:08.860 --> 00:03:11.010 These have also different target audiences. 00:03:11.890 --> 00:03:14.520 SPSS is owned by IBM and they are, 00:03:14.520 --> 00:03:16.800 one of their main markets is corporations. 00:03:16.800 --> 00:03:18.860 So they want to target marketing departments 00:03:18.860 --> 00:03:20.900 and they have analysis techniques 00:03:20.900 --> 00:03:22.440 that are relevant for marketing, 00:03:22.440 --> 00:03:25.510 like customer segmentation analysis and things like that 00:03:25.510 --> 00:03:27.920 that are not relevant for social science reasons. 00:03:27.920 --> 00:03:32.270 Then Stata has been developed first by a person 00:03:32.270 --> 00:03:33.760 with a background in university 00:03:33.760 --> 00:03:36.630 and is focused on social sciences. 00:03:36.630 --> 00:03:38.320 So it's focused on social sciences 00:03:38.320 --> 00:03:40.400 and nowadays life science research as well. 00:03:40.400 --> 00:03:41.950 But this is specifically designed 00:03:41.950 --> 00:03:43.800 for university researchers. 00:03:43.800 --> 00:03:45.900 And R is a programming environment 00:03:45.900 --> 00:03:47.730 so it's designed to be very general 00:03:47.730 --> 00:03:49.733 without any specific target audience. 00:03:50.620 --> 00:03:55.270 What this difference means that are with R, 00:03:55.270 --> 00:03:56.810 you can do the most things, 00:03:56.810 --> 00:04:01.300 but R because it's general instead of focused specifically 00:04:01.300 --> 00:04:02.730 on certain tasks, 00:04:02.730 --> 00:04:05.150 it may not be the most easiest to use tool 00:04:05.150 --> 00:04:07.500 or the most efficient tool for doing something. 00:04:07.500 --> 00:04:09.210 Then Stata has a more narrow scope 00:04:09.210 --> 00:04:13.580 and it's very good at social science research. 00:04:13.580 --> 00:04:16.070 So most of the things that social science researcher wants 00:04:16.070 --> 00:04:17.740 to do Stata provides, 00:04:17.740 --> 00:04:21.730 and it's a fairly nice to use tool for that purpose. 00:04:21.730 --> 00:04:23.850 SPSS because of its focus, 00:04:23.850 --> 00:04:27.030 then lacks some of the tools that we apply on the course. 00:04:27.030 --> 00:04:29.440 So because it's not focused 00:04:29.440 --> 00:04:31.465 on the kind of research that we do, 00:04:31.465 --> 00:04:35.280 then you need to go through some extra steps to get 00:04:35.280 --> 00:04:36.750 some basic results. 00:04:36.750 --> 00:04:39.377 So it may be good for market segmentation, 00:04:39.377 --> 00:04:41.530 but it's not as easy to use as Stata. 00:04:41.530 --> 00:04:43.270 Once you know how to use it. 00:04:43.270 --> 00:04:44.900 The documentation is also quite different. 00:04:44.900 --> 00:04:49.430 So SPSS documentation is about how to use SPSS, 00:04:49.430 --> 00:04:52.480 so that's the normal software documentation. 00:04:52.480 --> 00:04:53.313 It's not, 00:04:53.313 --> 00:04:56.550 it doesn't try to get you to understand regression analysis 00:04:56.550 --> 00:04:58.310 just tells you that if you understand regression, 00:04:58.310 --> 00:05:00.570 this is how you do it with SPSS. 00:05:00.570 --> 00:05:02.220 Stata on the other hand, 00:05:02.220 --> 00:05:04.430 their documentation explains the analysis as well. 00:05:04.430 --> 00:05:08.040 So this is a pretty good learning resource as well. 00:05:08.040 --> 00:05:11.440 So whereas SPSS manual tells you how to use SPSS, 00:05:11.440 --> 00:05:16.198 Stata tells you how certain analysis are used and why, 00:05:16.198 --> 00:05:18.740 and how you get things done with Stata. 00:05:18.740 --> 00:05:22.080 Then R documentation is not good for learning at all. 00:05:22.080 --> 00:05:24.904 Typically R documentation tells you how a certain command 00:05:24.904 --> 00:05:28.910 is specified and then it may point to an original source 00:05:28.910 --> 00:05:31.110 whoever first invented, 00:05:31.110 --> 00:05:33.670 let's say a regression analysis and tell the reader, 00:05:33.670 --> 00:05:35.830 tell the user to look at the details 00:05:35.830 --> 00:05:38.080 of regression analysis from the original source. 00:05:38.080 --> 00:05:41.000 So this is a less user-friendly documentation. 00:05:41.000 --> 00:05:44.120 The availability of this software differs as well. 00:05:44.120 --> 00:05:48.450 Most universities that I've worked with have SPSS, 00:05:48.450 --> 00:05:50.700 have a site license for SPSS. 00:05:50.700 --> 00:05:52.520 Which means that the SPSS is installed 00:05:52.520 --> 00:05:54.490 on all university computers, 00:05:54.490 --> 00:05:57.260 and typically university also provides a way 00:05:57.260 --> 00:06:00.920 of students and staff to install SPSS 00:06:00.920 --> 00:06:02.070 on their home computer. 00:06:02.950 --> 00:06:04.300 Stata on the other hand, 00:06:04.300 --> 00:06:06.010 doesn't have sets of licensing agreements. 00:06:06.010 --> 00:06:09.590 So Stata usually is installed in a computer lab, 00:06:09.590 --> 00:06:12.210 and if your university has a purchasing agreement, 00:06:12.210 --> 00:06:14.920 then typically it's fairly easy to get it 00:06:14.920 --> 00:06:16.400 on your work computer, 00:06:16.400 --> 00:06:19.020 but not, probably not for your home computer. 00:06:19.020 --> 00:06:21.020 R on the other hand is open-source 00:06:21.020 --> 00:06:24.000 and typically installed on all university computers, 00:06:24.000 --> 00:06:24.833 and it's free. 00:06:24.833 --> 00:06:28.130 You can just download R and the RStudio editor, 00:06:28.130 --> 00:06:31.367 which it's highly recommended on your own computer, 00:06:31.367 --> 00:06:33.690 and then just using it because it doesn't cost you, 00:06:33.690 --> 00:06:36.653 there's no cost attached. 00:06:38.040 --> 00:06:40.180 There are a couple of different ways 00:06:40.180 --> 00:06:41.580 how you use the software 00:06:41.580 --> 00:06:45.980 and the different ways of use partially determine 00:06:45.980 --> 00:06:47.530 which software is best for you. 00:06:48.540 --> 00:06:52.210 The data sets and commands are two separate things 00:06:52.210 --> 00:06:56.350 in a statistical analysis software as I explained, 00:06:56.350 --> 00:06:57.440 so the data file, 00:06:57.440 --> 00:07:00.340 that's whatever you got from your data collection efforts, 00:07:00.340 --> 00:07:04.090 you never edit that, it's columns and rows, 00:07:04.090 --> 00:07:05.700 and if you can, 00:07:05.700 --> 00:07:08.750 some software allow you to have multiple data sets open, 00:07:08.750 --> 00:07:10.380 some software don't, 00:07:10.380 --> 00:07:13.730 it could be viewed as an advantage to have multiple files 00:07:13.730 --> 00:07:15.270 or multiple data files open 00:07:15.270 --> 00:07:17.840 but then the problem is that when you execute a command, 00:07:17.840 --> 00:07:20.890 how do you know which data set you're actually working on? 00:07:20.890 --> 00:07:22.610 So with SPSS, 00:07:22.610 --> 00:07:25.860 I have multiple students who are really confused 00:07:25.860 --> 00:07:27.510 that they apply an analysis, 00:07:27.510 --> 00:07:29.500 and the analysis result is unexpected. 00:07:29.500 --> 00:07:31.520 The reason why it's unexpected 00:07:31.520 --> 00:07:33.680 is that they have two data files open. 00:07:33.680 --> 00:07:36.230 They thought that they were analyzing the first data file, 00:07:36.230 --> 00:07:38.680 but SPSS were actually analyzing the second data file. 00:07:40.530 --> 00:07:41.720 Then we have command files, 00:07:41.720 --> 00:07:45.210 which are a sequence of data manipulation analysis command, 00:07:45.210 --> 00:07:47.550 and these store the logic of your analysis. 00:07:47.550 --> 00:07:50.753 However you want to use your statistical software 00:07:50.753 --> 00:07:55.460 you should always at the end have an analysis file. 00:07:55.460 --> 00:07:57.530 If you have a graphical user interface, 00:07:57.530 --> 00:07:59.730 which Stata and SPSS have, 00:07:59.730 --> 00:08:01.790 then those software, when you use them, 00:08:01.790 --> 00:08:03.990 they will produce a log file 00:08:03.990 --> 00:08:05.850 that contains all the analysis commands 00:08:05.850 --> 00:08:08.490 that you applied in that analysis session. 00:08:08.490 --> 00:08:12.450 When you are done at the end, then you save the log file, 00:08:12.450 --> 00:08:14.310 you extract the commands, 00:08:14.310 --> 00:08:16.490 you take out those that you don't actually need 00:08:16.490 --> 00:08:20.940 for the final paper and then when you write your paper, 00:08:20.940 --> 00:08:22.960 you store the analysis file. 00:08:22.960 --> 00:08:25.220 This is important because you need to be able 00:08:25.220 --> 00:08:27.120 to replicate your analysis later. 00:08:27.120 --> 00:08:30.200 If someone asks you, how did you get your results? 00:08:30.200 --> 00:08:32.340 Unless you have the analysis file, 00:08:32.340 --> 00:08:34.340 then you can't repeat your analysis. 00:08:34.340 --> 00:08:38.380 If a reviewer wants to have changes in your analysis, 00:08:38.380 --> 00:08:40.030 when you sub into a journal, 00:08:40.030 --> 00:08:43.160 then how are you supposed to do that 00:08:43.160 --> 00:08:44.480 if you have not kept track 00:08:44.480 --> 00:08:46.630 of what you actually did for the data. 00:08:46.630 --> 00:08:47.610 On this course, 00:08:47.610 --> 00:08:49.650 whenever you return a data analysis assignment, 00:08:49.650 --> 00:08:52.750 you must return a report and an analysis file as well. 00:08:52.750 --> 00:08:55.220 And this is very important because, 00:08:55.220 --> 00:08:59.830 I can point you to many examples where researchers clearly 00:08:59.830 --> 00:09:01.690 have not stored their results 00:09:01.690 --> 00:09:03.600 and when you ask them about the results, 00:09:03.600 --> 00:09:05.560 they have no idea how they did the calculation, 00:09:05.560 --> 00:09:07.700 because they could have done it a year ago 00:09:07.700 --> 00:09:09.170 and then they have forgotten. 00:09:09.170 --> 00:09:10.800 Doing an analysis further ensures 00:09:10.800 --> 00:09:13.710 that you can always tell a person who wants to know 00:09:13.710 --> 00:09:15.863 about the research, how exactly you did it. 00:09:17.160 --> 00:09:20.890 So analysis file is one way of storing the sequence 00:09:20.890 --> 00:09:21.723 of analysis, 00:09:21.723 --> 00:09:24.250 but there are basically are three different ways 00:09:24.250 --> 00:09:25.130 of using this software. 00:09:25.130 --> 00:09:26.734 So we have first menus, 00:09:26.734 --> 00:09:30.700 so you can generate or do commands using menus. 00:09:30.700 --> 00:09:33.440 So you point and click and you choose 00:09:33.440 --> 00:09:35.430 from the menu regression analysis, 00:09:35.430 --> 00:09:36.760 then you have a list of variables, 00:09:36.760 --> 00:09:38.530 you choose one to be the dependent, 00:09:38.530 --> 00:09:40.410 a couple to be the independence, 00:09:40.410 --> 00:09:41.990 and then you run the execute. 00:09:41.990 --> 00:09:44.180 Then the user interface of the software 00:09:44.180 --> 00:09:45.700 will generate the command, 00:09:45.700 --> 00:09:48.903 which the software will then produce or run. 00:09:50.100 --> 00:09:51.640 R doesn't have menus, 00:09:51.640 --> 00:09:55.290 which makes using R a bit difficult, 00:09:55.290 --> 00:09:57.010 at least for the very beginners, 00:09:57.010 --> 00:09:59.790 because you need to learn how the commands are typed 00:09:59.790 --> 00:10:00.790 in the very beginning. 00:10:00.790 --> 00:10:03.670 So R has a steep learning curve because of that. 00:10:03.670 --> 00:10:05.090 When you open it the first time, 00:10:05.090 --> 00:10:07.070 you may have no idea what to do. 00:10:07.070 --> 00:10:09.470 When you open SPSS or Stata the first time 00:10:09.470 --> 00:10:11.740 you can always see that there's analysis menu, 00:10:11.740 --> 00:10:13.580 perhaps clicking on the analysis menu 00:10:13.580 --> 00:10:16.350 I can do some analysis and then there's regression analysis, 00:10:16.350 --> 00:10:19.040 perhaps clicking on that, you can do a regression analysis, 00:10:19.040 --> 00:10:21.240 and indeed that's the way you do regression. 00:10:22.520 --> 00:10:27.380 Stata and R also allow you to type commands interactively. 00:10:27.380 --> 00:10:31.540 So you can type commands and this is the way 00:10:31.540 --> 00:10:34.760 most professional researcher that I know use their software. 00:10:34.760 --> 00:10:37.470 Once you know the basic commands 00:10:37.470 --> 00:10:41.640 it's a lot easier to type, it's a regressing or R 00:10:41.640 --> 00:10:45.050 or REG, short for regression analysis, 00:10:45.050 --> 00:10:47.530 and then type the names of variables instead of go 00:10:47.530 --> 00:10:49.490 and click through the user interface. 00:10:49.490 --> 00:10:52.160 So you're a lot quicker with keyboard 00:10:52.160 --> 00:10:54.010 than you are with the menus. 00:10:54.010 --> 00:10:57.590 So this is something that, for example, 00:10:57.590 --> 00:10:58.680 Stata documentation recommends 00:10:58.680 --> 00:11:00.955 that you should start learning 00:11:00.955 --> 00:11:02.300 and that's the first thing, 00:11:02.300 --> 00:11:03.310 but that's the second thing 00:11:03.310 --> 00:11:05.070 when you start using the software. 00:11:05.070 --> 00:11:06.610 And then you have the analysis file. 00:11:06.610 --> 00:11:08.360 So the analysis file is just the, 00:11:08.360 --> 00:11:11.020 a sequence of commands that reproduces all your analysis. 00:11:11.020 --> 00:11:13.810 Every time when you think that you did something stupid, 00:11:13.810 --> 00:11:15.207 they'd rerun your analysis file 00:11:15.207 --> 00:11:18.570 and that gives you a clean slate of the final analysis. 00:11:18.570 --> 00:11:20.120 That's how I use this software. 00:11:21.040 --> 00:11:23.550 It also, when we discuss, 00:11:23.550 --> 00:11:25.261 which of this software is the best. 00:11:25.261 --> 00:11:27.670 One thing that you need to consider 00:11:27.670 --> 00:11:29.500 is the capabilities of the software 00:11:29.500 --> 00:11:32.430 and what does the analysis file look like 00:11:32.430 --> 00:11:34.320 because you always have to produce that at least, 00:11:34.320 --> 00:11:37.340 regardless of how you, whether you use menus or typing, 00:11:37.340 --> 00:11:40.930 the analysis file is something that you will always have. 00:11:40.930 --> 00:11:42.963 So here's an analysis file example, 00:11:43.920 --> 00:11:48.920 doing the same analysis set of analysis in Stata and the R, 00:11:49.340 --> 00:11:51.590 you don't have to understand what this means now, 00:11:51.590 --> 00:11:52.900 but it basically, 00:11:52.900 --> 00:11:56.200 what I'm doing here is that this is a regression analysis, 00:11:56.200 --> 00:11:58.890 so we have the regress command here 00:11:58.890 --> 00:12:01.800 or LM command here for linear model 00:12:01.800 --> 00:12:05.720 and I have a data set about professions 00:12:05.720 --> 00:12:09.010 I'm explaining the logarithm of income, 00:12:09.010 --> 00:12:13.280 and I'm having an interaction term with Prestige Women, 00:12:13.280 --> 00:12:15.620 I have a categorical variable here, 00:12:15.620 --> 00:12:17.740 and so this is the regression analysis. 00:12:17.740 --> 00:12:19.880 So in Stata, 00:12:19.880 --> 00:12:21.410 we create a log of income, 00:12:21.410 --> 00:12:24.560 Stata will automatically get an interaction turn for us, 00:12:24.560 --> 00:12:27.940 it'll automatically do categorical variables 00:12:27.940 --> 00:12:30.850 if we indicate them with the i prefix, 00:12:30.850 --> 00:12:32.960 R will automatically treat this type 00:12:32.960 --> 00:12:34.740 as a categorical variable, 00:12:34.740 --> 00:12:38.080 and then we have this regression here, 00:12:38.080 --> 00:12:41.270 interactions that will always multiply things together, 00:12:41.270 --> 00:12:44.440 R knows how to deal with that same with Stata. 00:12:44.440 --> 00:12:49.200 Then we have our marginal predictions calculated here, 00:12:49.200 --> 00:12:51.350 and that's something that I will discuss 00:12:51.350 --> 00:12:53.379 on the course quite a lot because it's highly useful 00:12:53.379 --> 00:12:55.460 and under utilized tool. 00:12:55.460 --> 00:12:58.760 And then we plot the marginal predictions. 00:12:58.760 --> 00:13:03.760 So this is maybe a one, two, three commands 00:13:04.260 --> 00:13:07.253 to do a transformation of one variable, 00:13:07.253 --> 00:13:08.890 regression analysis, 00:13:08.890 --> 00:13:13.110 and then plotting the result using marginal predictions. 00:13:13.110 --> 00:13:16.330 In R, we need to have a load of package 00:13:16.330 --> 00:13:18.040 for the marginal prediction plot, 00:13:18.040 --> 00:13:21.890 we have two, three, four, five, 00:13:21.890 --> 00:13:25.190 six commands out of which one is loading a package, 00:13:25.190 --> 00:13:27.360 then two are just printing out the results, 00:13:27.360 --> 00:13:29.110 the summary commands. 00:13:29.110 --> 00:13:31.680 So you only try a small number of commands 00:13:31.680 --> 00:13:34.733 for a fairly impressive set of things. 00:13:36.900 --> 00:13:38.790 In SPSS, 00:13:38.790 --> 00:13:40.790 this is the regression part. 00:13:40.790 --> 00:13:43.500 So there's no marginal predictions, there's no plotting. 00:13:43.500 --> 00:13:45.450 You can't do that with SPSS. 00:13:45.450 --> 00:13:48.070 So this will, with SPSS, 00:13:48.070 --> 00:13:51.200 SPSS doesn't know how to deal with interaction terms, 00:13:51.200 --> 00:13:53.500 it doesn't know how to deal with categorical variables 00:13:53.500 --> 00:13:54.960 in a regression analysis. 00:13:54.960 --> 00:13:58.400 So you have to dummy code manually. 00:13:58.400 --> 00:14:00.350 So doing this, 00:14:00.350 --> 00:14:03.040 if you can type that's fairly quick to do, 00:14:03.040 --> 00:14:05.710 if you do this in the user interface, this plot, 00:14:05.710 --> 00:14:07.560 maybe it takes you 10 minutes to do, 00:14:07.560 --> 00:14:09.830 compared to just typing the variable name 00:14:09.830 --> 00:14:12.320 and allowing R to do it automatically for you 00:14:12.320 --> 00:14:14.500 or typing i period variable name 00:14:14.500 --> 00:14:17.280 and allowing the Stata to automatically do it for you 00:14:17.280 --> 00:14:18.800 once Stata, 00:14:18.800 --> 00:14:21.560 once you tell Stata that this is a categorical variable. 00:14:21.560 --> 00:14:24.753 So in SPSS, there is, 00:14:26.240 --> 00:14:28.360 you need to do a lot more data manipulation 00:14:28.360 --> 00:14:29.830 before the analysis, 00:14:29.830 --> 00:14:32.950 because the analysis command is actually, it's less capable. 00:14:32.950 --> 00:14:36.310 Also, if the regression command is fairly involved, 00:14:36.310 --> 00:14:38.160 you need to specify lots of things. 00:14:38.160 --> 00:14:41.050 It's not enough to specify just the dependent variable 00:14:41.050 --> 00:14:42.650 and the independent variables, 00:14:42.650 --> 00:14:45.180 but you need to specify all kinds of defaults, 00:14:45.180 --> 00:14:48.360 because for some reason the command doesn't work 00:14:48.360 --> 00:14:52.240 with empty defaults and default to some useful settings. 00:14:52.240 --> 00:14:54.900 And then once we have done the regression analysis 00:14:54.900 --> 00:14:57.890 then you will need to copy paste the results to Excel, 00:14:57.890 --> 00:15:00.377 to do the marginal predictions 00:15:00.377 --> 00:15:02.280 and to plot off the marginal predictions. 00:15:02.280 --> 00:15:03.720 So, SPSS here, 00:15:03.720 --> 00:15:06.340 there's a lot of, it's more work. 00:15:06.340 --> 00:15:09.100 It's more stuff going in the analysis file, 00:15:09.100 --> 00:15:10.470 and it does less than the R. 00:15:10.470 --> 00:15:13.600 That's about the half of what these analysis files do. 00:15:13.600 --> 00:15:16.780 So which one do you think is the most convenient 00:15:16.780 --> 00:15:18.030 to work with in the long run? 00:15:18.030 --> 00:15:19.800 Well, that's a personal preference. 00:15:19.800 --> 00:15:24.530 Some people can get away with never editing 00:15:24.530 --> 00:15:27.430 their analysis files by hand. 00:15:27.430 --> 00:15:28.280 So instead of, 00:15:28.280 --> 00:15:30.480 they just do a command and then they take 00:15:30.480 --> 00:15:32.330 what the command is using the menus 00:15:32.330 --> 00:15:35.600 and then they copy paste it to the analysis file. 00:15:35.600 --> 00:15:36.490 But for example, 00:15:36.490 --> 00:15:38.860 if you need to change how you code 00:15:38.860 --> 00:15:41.240 this categorical variable, at least for me, 00:15:41.240 --> 00:15:44.890 it's a lot simpler to just edit this syntax here 00:15:44.890 --> 00:15:48.310 instead of going and pointing and clicking around. 00:15:48.310 --> 00:15:51.290 So the SPSS syntax it's not as user-friendly 00:15:51.290 --> 00:15:53.080 as a Stata and R, 00:15:53.080 --> 00:15:54.940 but if you do not understand 00:15:54.940 --> 00:15:57.270 any of these software's basic syntaxes 00:15:57.270 --> 00:15:59.220 it's going to be fairly impossible to know what these, 00:15:59.220 --> 00:16:00.800 what any of these does. 00:16:00.800 --> 00:16:02.270 But it's a Stata, less typing here, 00:16:02.270 --> 00:16:05.010 regress and then dependent variable, 00:16:05.010 --> 00:16:06.680 independent variable, same here. 00:16:06.680 --> 00:16:10.460 LM are dependent variable, independent variables, 00:16:10.460 --> 00:16:12.363 compared to this specific SM here. 00:16:13.360 --> 00:16:15.150 So my take on software is pretty clear. 00:16:15.150 --> 00:16:17.330 I don't think anyone should be using SPSS 00:16:17.330 --> 00:16:19.130 for serious research. 00:16:19.130 --> 00:16:24.130 If you want to be a professional, a construction worker, 00:16:24.210 --> 00:16:27.870 you don't go to the closest store and pick the cheapest drill, 00:16:27.870 --> 00:16:29.110 you go to a hardware store 00:16:29.110 --> 00:16:33.130 and pick a proper professional drill. 00:16:33.130 --> 00:16:34.160 That's the same thing here. 00:16:34.160 --> 00:16:36.470 We have different kinds of tools. 00:16:36.470 --> 00:16:38.020 SPSS is a good, 00:16:38.020 --> 00:16:40.060 it's a very good tool for getting started. 00:16:40.060 --> 00:16:43.060 So if you just want to do the first assignment 00:16:43.060 --> 00:16:43.893 of this course 00:16:43.893 --> 00:16:46.750 and never do any quantitative research yourself, 00:16:46.750 --> 00:16:49.020 you're gonna be fine with SPSS. 00:16:49.020 --> 00:16:51.510 If you want to do this for a living, 00:16:51.510 --> 00:16:55.050 then Stata is probably better choice for you. 00:16:55.050 --> 00:16:58.440 The R is also something that you could consider, 00:16:58.440 --> 00:17:00.930 but the problem is that R is a bit technical. 00:17:00.930 --> 00:17:03.360 So if you are a very non-technical person 00:17:03.360 --> 00:17:06.720 then R may not be the right tool for you. 00:17:06.720 --> 00:17:09.730 There also are some good reasons to use SPSS. 00:17:09.730 --> 00:17:13.940 So there are lots of very successful researchers 00:17:13.940 --> 00:17:16.070 who use SPSS as their main tool. 00:17:16.070 --> 00:17:19.050 Their main competence is probably something else 00:17:19.050 --> 00:17:20.380 than data analysis. 00:17:20.380 --> 00:17:22.410 So if you specialize in theory, 00:17:22.410 --> 00:17:25.090 you just need basic tools for testing your theory, 00:17:25.090 --> 00:17:27.110 and then you have others 00:17:27.110 --> 00:17:28.850 who do the more advanced tests for you, 00:17:28.850 --> 00:17:30.540 you're gonna be fine with SPSS, 00:17:30.540 --> 00:17:35.520 but if you want to be very good in statistical analysis 00:17:35.520 --> 00:17:36.710 and quantitative research, 00:17:36.710 --> 00:17:40.738 then SPSS is probably going to be in your way at some point. 00:17:40.738 --> 00:17:43.160 I know quite a few people 00:17:43.160 --> 00:17:45.600 that have used SPSS in the past 00:17:45.600 --> 00:17:47.960 and have moved to Stata since, 00:17:47.960 --> 00:17:50.760 and I don't know anyone who has applied, 00:17:50.760 --> 00:17:54.290 used Stata as their main tool and then moved to SPSS. 00:17:54.290 --> 00:17:57.260 There are some people who move from SPSS to R 00:17:57.260 --> 00:17:59.320 but that's pretty big leap because the software 00:17:59.320 --> 00:18:00.560 are so different. 00:18:00.560 --> 00:18:03.522 That being said, the use of R is increasing. 00:18:03.522 --> 00:18:05.163 On the courses that I give, 00:18:06.140 --> 00:18:09.150 R is tends to be the most popular option, 00:18:09.150 --> 00:18:11.630 because you can install R anywhere 00:18:11.630 --> 00:18:13.400 and it's always available for you. 00:18:13.400 --> 00:18:16.410 Stata comes next and then perhaps SPSS. 00:18:16.410 --> 00:18:19.380 You're gonna be fine with SPSS, but it does not, 00:18:19.380 --> 00:18:20.213 for this course, 00:18:20.213 --> 00:18:23.793 but it's just not an ideal tool in the long run for you. 00:18:25.260 --> 00:18:26.730 So how do you get started? 00:18:26.730 --> 00:18:28.590 First you need to familiarize with the software. 00:18:28.590 --> 00:18:30.530 So you need to have an understanding of their, 00:18:30.530 --> 00:18:33.140 of the basic feeling of how the software looks, 00:18:33.140 --> 00:18:34.750 how it works. 00:18:34.750 --> 00:18:36.340 There are, there's first Stata, 00:18:36.340 --> 00:18:38.660 the Stata's introductory manual 00:18:38.660 --> 00:18:41.070 is a very good getting started. 00:18:41.070 --> 00:18:43.577 So go and do the Stata's, 00:18:43.577 --> 00:18:45.510 "Introducing Stata Sample Session," 00:18:45.510 --> 00:18:48.950 open Stata, go to help menu, go to getting started, 00:18:48.950 --> 00:18:49.790 start working. 00:18:49.790 --> 00:18:50.740 They have, 00:18:50.740 --> 00:18:53.920 they explain how the software can be used by typing commands 00:18:53.920 --> 00:18:56.640 by doing things from the menus. 00:18:56.640 --> 00:19:00.069 If you want to use SPSS, I recommend that you do the same. 00:19:00.069 --> 00:19:04.380 There is a manual that you can access from the menus 00:19:04.380 --> 00:19:06.340 and then work through chapters one, four, seven, eight, 00:19:06.340 --> 00:19:07.560 and nine in that manual. 00:19:07.560 --> 00:19:10.190 Those are the ones that I have for my teaching. 00:19:10.190 --> 00:19:12.860 If you want to use R then I would recommend 00:19:12.860 --> 00:19:13.977 that you go through this, 00:19:13.977 --> 00:19:15.980 "Learn To Use R" from Computer World. 00:19:15.980 --> 00:19:17.450 And these are, 00:19:17.450 --> 00:19:18.920 these will give you roughly the idea 00:19:18.920 --> 00:19:20.700 of what these software are about. 00:19:20.700 --> 00:19:22.230 Then when you actually start learning 00:19:22.230 --> 00:19:23.270 how to use this software, 00:19:23.270 --> 00:19:25.020 then you need some other resources. 00:19:26.580 --> 00:19:28.583 I recommend some books for R, 00:19:30.660 --> 00:19:33.637 there's "R In Action" and "R for Data Science." 00:19:33.637 --> 00:19:36.368 "R In Action" is a bit more old fashioned R, 00:19:36.368 --> 00:19:40.130 and "R for Data Science" is a more modern take for R. 00:19:40.130 --> 00:19:42.160 The problem with "R for Data Science" 00:19:42.160 --> 00:19:45.330 is that this book goes to... 00:19:46.281 --> 00:19:49.360 It gets to pretty advanced stuff pretty quickly. 00:19:49.360 --> 00:19:51.157 So "R In Action" is more basic, 00:19:51.157 --> 00:19:52.350 "R for Data Science" 00:19:52.350 --> 00:19:54.630 is something that you should definitely read at some point, 00:19:54.630 --> 00:19:57.173 if you want to be an efficient user of R. 00:19:58.340 --> 00:19:59.423 For SPSS, 00:20:00.450 --> 00:20:03.760 I recommend "Discovering Statistics Using SPSS," 00:20:03.760 --> 00:20:05.130 that's a pretty good book. 00:20:05.130 --> 00:20:08.020 The same person also has a book about R, 00:20:08.020 --> 00:20:11.330 if I can remember correctly, and that may be a good book. 00:20:11.330 --> 00:20:12.910 I haven't read it myself. 00:20:12.910 --> 00:20:14.010 Then for Stata, 00:20:14.010 --> 00:20:16.680 I recommend that you start reading the "Stata User Manual" 00:20:16.680 --> 00:20:19.543 because that "Stata User Manual" is pretty excellent. 00:20:20.409 --> 00:20:22.670 Then search for online examples, 00:20:22.670 --> 00:20:24.500 there are lots of websites that tell you 00:20:24.500 --> 00:20:26.980 how to do certain analysis in R, SPSS, and Stata, 00:20:26.980 --> 00:20:28.563 and then you can compare. 00:20:29.830 --> 00:20:31.650 Ask for help online. 00:20:31.650 --> 00:20:34.100 For example, this course is Data Analysis Forum. 00:20:34.100 --> 00:20:35.950 If you have a problem ask there, 00:20:35.950 --> 00:20:37.720 come to the computer lab. 00:20:37.720 --> 00:20:39.530 And for R specifically, 00:20:39.530 --> 00:20:41.120 there are some really good online courses. 00:20:41.120 --> 00:20:41.953 For example, 00:20:41.953 --> 00:20:45.370 Data Camp has done interactive course, 00:20:45.370 --> 00:20:47.020 it takes you a couple of hours to do, 00:20:47.020 --> 00:20:48.920 it teaches you the basics of R, 00:20:48.920 --> 00:20:51.520 so you use R in a web browser there, 00:20:51.520 --> 00:20:52.900 the course tells you what to do 00:20:52.900 --> 00:20:54.710 then you do it after you succeed, 00:20:54.710 --> 00:20:56.360 then it tells you the next thing. 00:20:57.510 --> 00:20:58.840 One of my favorite resources 00:20:58.840 --> 00:21:01.670 for learning how to get things done with these software 00:21:01.670 --> 00:21:04.550 is the University of California, Los Angeles, 00:21:04.550 --> 00:21:07.360 data analysis examples website. 00:21:07.360 --> 00:21:08.367 So they are the, 00:21:09.699 --> 00:21:14.100 the link is here, and this is an excellent source for, 00:21:14.100 --> 00:21:17.410 because you can compare how certain things are accomplished 00:21:17.410 --> 00:21:18.610 with different software.