WEBVTT Kind: captions; Language: en 1 00:00:01.710 --> 00:00:06.290 Good afternoon from Avascular Finland to all you, all of you 2 00:00:06.290 --> 00:00:09.810 on on site and all around the world online. 3 00:00:09.810 --> 00:00:14.440 My name is Riley Hilden, and I'm from the University of Helsinki from the 4 00:00:14.440 --> 00:00:19.930 Faculty of Educational Sciences, and I'm working there as Associate professor 5 00:00:19.930 --> 00:00:22.690 in Language, Didactics and teacher education. 6 00:00:22.690 --> 00:00:27.070 And my great pleasure is to welcome you to the Closing Seminar of 7 00:00:27.070 --> 00:00:32.000 Digital, project that has been running since 2019. 8 00:00:32.000 --> 00:00:37.390 And is terminating at the end of August this year. 9 00:00:37.390 --> 00:00:44.650 Digitala, the term comes from Swedish, It's about digital talk and our aims. 10 00:00:44.650 --> 00:00:53.470 Our aim was to develop digital tool for supporting language, language assessment, 11 00:00:53.470 --> 00:00:58.710 speak, speaking, assessment in second languages and foreign languages. 12 00:00:58.710 --> 00:01:02.820 And we start started with the the realities of Finnish, Finnish 13 00:01:02.820 --> 00:01:08.060 language education and perhaps our the all the 1st. 14 00:01:08.060 --> 00:01:14.520 AIM or The first idea that we got was that we were missing. 15 00:01:14.520 --> 00:01:20.760 A speaking part in the in the Finnish matriculation exam that is practically 16 00:01:20.760 --> 00:01:26.140 one or one or the only high stakes exam that we have in in Finnish language 17 00:01:26.140 --> 00:01:30.720 education for for all young people. 18 00:01:30.720 --> 00:01:36.160 And our second aim was to develop a better ecological tool to provide feedback 19 00:01:36.160 --> 00:01:41.760 to students while they are practising, practising their language skills or 20 00:01:41.760 --> 00:01:45.280 language skills autonomously or independently. 21 00:01:45.280 --> 00:01:49.700 From the from school education properly and all the all these 22 00:01:49.700 --> 00:01:55.010 aims take collaboration between fields of science. 23 00:01:55.010 --> 00:02:02.610 And in our case, we approached our our colleagues in speech Technology 24 00:02:02.610 --> 00:02:06.470 for NET and Phonetics to complete these goals. 25 00:02:06.470 --> 00:02:11.580 To complete these ideas, this picture gallery shows who 26 00:02:11.580 --> 00:02:16.320 we are, the the composition of the team. So we. 27 00:02:16.320 --> 00:02:20.960 From the University of Helsinki, educationalist and language language 28 00:02:20.960 --> 00:02:26.980 didactic specialists and our colleagues at all the universities are experts 29 00:02:26.980 --> 00:02:30.300 in speech technology and signal processing. 30 00:02:30.300 --> 00:02:39.350 And at the University of Utah scholar, we have four Phonology, and. 31 00:02:39.350 --> 00:02:51.670 And speech, speech experts and also. Language assessment. 32 00:02:51.670 --> 00:02:59.710 And to complete to meet our our aim and and our need to develop the tool so we 33 00:02:59.710 --> 00:03:06.020 study from the advancement of automatic speaking assessment that. 34 00:03:06.020 --> 00:03:10.560 Is based on automatic automatic speech recognition and scoring system 35 00:03:10.560 --> 00:03:16.720 and the system for giving providing diagnostic feedback. 36 00:03:16.720 --> 00:03:21.100 Of course we were not the only or the first ones to invent this 37 00:03:21.100 --> 00:03:25.080 kind of devices or this kind of ideas either. 38 00:03:25.080 --> 00:03:33.000 We have lot of good examples all around the world, for example the TOEFL I BT personal 39 00:03:33.000 --> 00:03:36.980 test of English Duolingo language skills, just to mention few of them. 40 00:03:36.980 --> 00:03:43.480 Or perhaps the most most referenced. Tests. 41 00:03:43.480 --> 00:03:49.740 And the way that we have gone through now during these project years, we 42 00:03:49.740 --> 00:03:57.120 started from planning the tasks for speaking to the various levels of common 43 00:03:57.120 --> 00:04:03.370 European framework of reference and then we. 44 00:04:03.370 --> 00:04:10.800 So we asked the language students and language learners to complete these 45 00:04:10.800 --> 00:04:17.420 tasks and record their their speech in in the model system. 46 00:04:17.420 --> 00:04:22.960 And we use the use the model platform because it was familiar to all of us. 47 00:04:22.960 --> 00:04:29.930 And then we rated these samples also in the same system, analysed 48 00:04:29.930 --> 00:04:34.440 the ratings for their reliability and consistency. 49 00:04:34.440 --> 00:04:40.540 And then we trained with these material, we trained the machine. 50 00:04:40.540 --> 00:04:48.740 And the samples were placed on the on the on the various levels, separate levels according 51 00:04:48.740 --> 00:04:55.210 to the ratings and we ended up with user interface that then. 52 00:04:55.210 --> 00:05:03.310 Somehow fulfilled our wish to give feedback as well and that is something that you 53 00:05:03.310 --> 00:05:10.950 that people here on site also have the opportunity of, of trying out today. 54 00:05:10.950 --> 00:05:18.390 So the current functionalities of this module application that we we have created, they 55 00:05:18.390 --> 00:05:23.910 they it's not fully complete yet but it's on on the good way right now. 56 00:05:23.910 --> 00:05:28.130 At present, the teacher can create and edit their speaking tasks 57 00:05:28.130 --> 00:05:33.890 and teachers also can explore student samples, listen to them and 58 00:05:33.890 --> 00:05:36.720 view the automatic ratings given by the system. 59 00:05:36.720 --> 00:05:41.860 The automatic system but the teacher is not entirely dependent on. 60 00:05:41.860 --> 00:05:46.800 The machine he or she can change their, edit the ratings and give 61 00:05:46.800 --> 00:05:53.260 written feedback to their students and the teachers also can download 62 00:05:53.260 --> 00:05:58.060 the rating reports from the Moodle system. 63 00:05:58.060 --> 00:06:05.680 The learners again are able to test their microphone and to record their tasks. 64 00:06:05.680 --> 00:06:10.000 You the task instructions and speak according to them and 65 00:06:10.000 --> 00:06:13.050 when they are happy with their with their. 66 00:06:13.050 --> 00:06:19.350 Performance there after having listened to it a couple of times properly so they can submit 67 00:06:19.350 --> 00:06:27.570 it, submit the sample for rating and the the system rates it and after that the student 68 00:06:27.570 --> 00:06:34.280 can see the rating given the the automatic rating given by the machine. 69 00:06:34.280 --> 00:06:43.920 And they can also view the complementary feedback given by their teachers. 70 00:06:43.920 --> 00:06:50.370 The strengths and limitations are somehow familiar from from current research 71 00:06:50.370 --> 00:06:56.650 from other systems, so the strength of loss of course. 72 00:06:56.650 --> 00:07:04.070 Cover time and space independency so that the students can train whenever they want and where 73 00:07:04.070 --> 00:07:10.790 they happen to be and this the system also spares teachers rating time. 74 00:07:10.790 --> 00:07:14.310 Teacher doesn't have to be around all the time. 75 00:07:14.310 --> 00:07:20.700 And we are also proud of having created the first automated trading tool for domestic 76 00:07:20.700 --> 00:07:26.770 law, national languages Finish Our Swedish, both of which are somehow considered 77 00:07:26.770 --> 00:07:34.430 as so-called under resourced languages, compared to word languages like English, 78 00:07:34.430 --> 00:07:39.510 where the speech technology is far more advanced. 79 00:07:39.510 --> 00:07:41.550 And we are. 80 00:07:41.550 --> 00:07:47.850 Our system is already capable of providing feedback on several components of oral proficiency, 81 00:07:47.850 --> 00:07:52.370 but there are still there are some limitations to to our current system. 82 00:07:52.370 --> 00:07:59.710 The processing time is rather long and the users need to wait quite far, quite 83 00:07:59.710 --> 00:08:05.050 while, perhaps couple of minutes for the for the results to come. 84 00:08:05.050 --> 00:08:12.810 And the construct that we can measure right now is of course not the entire construct of 85 00:08:12.810 --> 00:08:21.910 speaking, but it's mostly like limited or restricted to measurable constructs, measurable 86 00:08:21.910 --> 00:08:28.120 trade traits, features, or speech like fluency or pronunciation. 87 00:08:28.120 --> 00:08:33.600 And the machine ratings and feedback are dependent on the amount and quality of the 88 00:08:33.600 --> 00:08:40.280 training data, which is kind of limited because in in the course of this project we 89 00:08:40.280 --> 00:08:47.720 have not been able to to record like thousands of hours of speech. 90 00:08:47.720 --> 00:08:56.490 But this is good start and we are also looking for the maintenance of the tool. 91 00:08:56.490 --> 00:09:01.850 For yeah and looking for some more finance financialization 92 00:09:01.850 --> 00:09:07.320 and and resourcing for the project output to be. 93 00:09:07.320 --> 00:09:15.090 And maintained and developed further. We have. 94 00:09:15.090 --> 00:09:20.390 Completed quite number of studies together with the, with the team, with 95 00:09:20.390 --> 00:09:26.730 the, with the three universities and you can find find our references and 96 00:09:26.730 --> 00:09:41.290 and our work in in the articles that we have published. 97 00:09:41.290 --> 00:09:46.030 OK, Good afternoon everybody here in this hall and also online. 98 00:09:46.030 --> 00:09:47.870 My name, my name is Ari Hutta. 99 00:09:47.870 --> 00:09:52.810 I come from the Centre for Applied Language Studies at the at the University 100 00:09:52.810 --> 00:09:56.770 of Avascular where I work as a professor of language assessment. 101 00:09:56.770 --> 00:10:03.350 And in this project I'm member of the of the VASCULAR team and it's my great 102 00:10:03.350 --> 00:10:10.290 pleasure to introduce briefly the the three presentations in this. 103 00:10:10.290 --> 00:10:15.990 The second part of the of this closing seminar and the three. 104 00:10:15.990 --> 00:10:21.150 Presentations are from the three partner universities Helsinki Alto and Universe 105 00:10:21.150 --> 00:10:26.890 Scalar, focusing on different aspects of the of the project ranging from stakeholder 106 00:10:26.890 --> 00:10:33.190 beliefs, analysis of acoustic features of second language speech and then finally 107 00:10:33.190 --> 00:10:39.090 about deep learning methods in learner language. 108 00:10:39.090 --> 00:10:44.100 Recognition and and and rating and we we will start by. 109 00:10:44.100 --> 00:10:49.720 Presentation by Anna von Sansa and Riley Hilden from the University of Helsinki, who will 110 00:10:49.720 --> 00:10:55.190 be talking about stakeholder beliefs, biggest fear or dream come true. 111 00:10:55.190 --> 00:11:02.420 Please flourish yours. Thank you. 112 00:11:02.420 --> 00:11:07.980 Alright, so my name is Anna Fontanez and and I have been working as postdoctoral researcher 113 00:11:07.980 --> 00:11:16.760 in this project and I will talk to you about stakeholder beliefs and in this talk stakeholder 114 00:11:16.760 --> 00:11:22.300 believes referred to L2 Finnish learners and their teachers. 115 00:11:22.300 --> 00:11:29.200 We have also had human raters as as stakeholders. 116 00:11:29.200 --> 00:11:33.440 But first, I will start by describing the development 117 00:11:33.440 --> 00:11:37.990 stages that Riley mentioned already. Just quick cheque. 118 00:11:37.990 --> 00:11:42.900 Are the online participants seeing and hearing OK? 119 00:11:42.900 --> 00:11:52.350 Please. 120 00:11:52.350 --> 00:12:01.230 Are the slides OK? OK, good. 121 00:12:01.230 --> 00:12:04.630 So. 122 00:12:04.630 --> 00:12:08.070 I will start by giving you an overview of the stages that we have 123 00:12:08.070 --> 00:12:12.010 taken when developing automated speaking assessment. 124 00:12:12.010 --> 00:12:16.070 First, we defined the speaking construct and drafted the 125 00:12:16.070 --> 00:12:20.130 speaking tasks and criteria to be used. 126 00:12:20.130 --> 00:12:25.150 The data collection took place during finished lessons, first remotely 127 00:12:25.150 --> 00:12:30.820 due to the pandemic and then live in schools using Moodle. 128 00:12:30.820 --> 00:12:36.290 The speech samples were transcribed and rated by human raters. 129 00:12:36.290 --> 00:12:41.790 We analysed the ratings then using many facet rush measurement in order to receive fairer 130 00:12:41.790 --> 00:12:47.410 scores and to explore how the scales and ratios function, so to say. 131 00:12:47.410 --> 00:12:55.110 Colleagues at Alta developed speech recognizer that converts speech into text and 132 00:12:55.110 --> 00:13:01.790 then several machine learning methods were then used to predict the human ratings, that 133 00:13:01.790 --> 00:13:07.130 is, automatically scoring the the speech samples or speaking. 134 00:13:07.130 --> 00:13:09.830 Performances. 135 00:13:09.830 --> 00:13:15.070 And in the first version, we used the rating criteria that the human 136 00:13:15.070 --> 00:13:24.240 ratios had used to to produce automated feedback. 137 00:13:24.240 --> 00:13:28.260 And next briefly about the speaking tasks that we used. 138 00:13:28.260 --> 00:13:35.160 So we used Moodle quiz module to collect the speech samples first remotely 139 00:13:35.160 --> 00:13:40.660 as I told you and we had different test versions for proficiency levels 140 00:13:40.660 --> 00:13:47.780 A1 and A2 beginners and then one for B1 level speakers and one for B2 level 141 00:13:47.780 --> 00:13:54.400 speakers and for B1B2 level speakers we we. 142 00:13:54.400 --> 00:14:00.110 Use the curriculum for upper secondary education to to define 143 00:14:00.110 --> 00:14:04.730 the goals, contents and target level. 144 00:14:04.730 --> 00:14:11.690 And for A1 and A2 versions we followed the goals of 1 university 145 00:14:11.690 --> 00:14:17.950 course targeting beginners, Beginners of finish. 146 00:14:17.950 --> 00:14:23.130 And all the test formats included bridle out tasks combined 147 00:14:23.130 --> 00:14:42.010 to semi structured and 148 00:14:42.010 --> 00:14:43.810 open-ended tasks. 149 00:14:43.810 --> 00:14:48.020 And in the semi structured tasks the learner was for example asked to answer question during 150 00:14:48.020 --> 00:14:48.020 simulated phone call time limit, 15 seconds per response for example. 151 00:14:48.020 --> 00:14:48.020 And in the open-ended task then they they included for example 152 00:14:48.020 --> 00:14:49.820 talking about given topic for one minute. 153 00:14:49.820 --> 00:14:56.560 Or describing or comparing pictures. Is that clear now? 154 00:14:56.560 --> 00:15:01.100 Good and the speaking tasks are available online. 155 00:15:01.100 --> 00:15:06.280 So if you Google Zanetto and community Digitala you will you will find all the 156 00:15:06.280 --> 00:15:12.960 materials, the criteria, the tasks and everything there online. 157 00:15:12.960 --> 00:15:16.340 And next about the rating criteria. 158 00:15:16.340 --> 00:15:21.860 So this is the you see here list of these scales that the raters used and we 159 00:15:21.860 --> 00:15:27.960 used level descriptors of the previous national core curriculum as they are local 160 00:15:27.960 --> 00:15:34.880 applications of the common European framework and they suit assessment purposes 161 00:15:34.880 --> 00:15:40.660 as they describe learners skills in sufficient detail and their detailed and 162 00:15:40.660 --> 00:15:43.820 analytical nature makes them applicable. 163 00:15:43.820 --> 00:15:51.950 To be used in automated scoring and feedback. 164 00:15:51.950 --> 00:15:58.190 And finally, last year, together with four software engineering students, we designed 165 00:15:58.190 --> 00:16:03.430 Moodle plugin for Finnish and Swedish Swedish learners and their teachers. 166 00:16:03.430 --> 00:16:08.010 And on the the right here you see workflow. 167 00:16:08.010 --> 00:16:11.420 On the left here you see the workflow of the tool and on the 168 00:16:11.420 --> 00:16:17.410 right screenshot of the report page and. 169 00:16:17.410 --> 00:16:20.970 So I will walk you through this workflow for first, so 170 00:16:20.970 --> 00:16:23.410 have look at the the left side here. 171 00:16:23.410 --> 00:16:28.710 So first when the system receives a speech sample, it uses automatic 172 00:16:28.710 --> 00:16:33.550 speech recognition to produce a transcript of the sample. 173 00:16:33.550 --> 00:16:37.930 Then the system produces automatic scores on selected dimensions of 174 00:16:37.930 --> 00:16:43.100 speech and finally shows the results to the learner. 175 00:16:43.100 --> 00:16:50.600 What you see on the right side here the result page and teachers have the possibility to comment 176 00:16:50.600 --> 00:16:57.480 on this course produced by the machine and finally if teachers or researchers as we are are able 177 00:16:57.480 --> 00:17:04.210 to export the learners speech samples together with the the scores. 178 00:17:04.210 --> 00:17:11.430 And currently the task types include read aloud and spontaneous speech up to 3 minutes. 179 00:17:11.430 --> 00:17:15.350 And more details are available in the user manual. 180 00:17:15.350 --> 00:17:23.560 I kindly ask Mike, our research assistant to to post the link to the user manual for. 181 00:17:23.560 --> 00:17:33.400 Our remote participants and there's also short video available on GitHub. 182 00:17:33.400 --> 00:17:35.430 Good. 183 00:17:35.430 --> 00:17:41.990 So moving on, as in literature also students in our studies highlighted 184 00:17:41.990 --> 00:17:47.070 the disadvantages and benefits of automated assessment. 185 00:17:47.070 --> 00:17:52.850 The benefits you see here, they include saving resources, providing immediate 186 00:17:52.850 --> 00:17:59.530 feedback, enabling self regulated learning and using same criteria when 187 00:17:59.530 --> 00:18:07.550 assessing different speakers or learners. 188 00:18:07.550 --> 00:18:11.100 In our studies the stakeholders list following challenges. 189 00:18:11.100 --> 00:18:12.900 As you see here. 190 00:18:12.900 --> 00:18:19.130 They worry if others will hear what they say or if in if in an high stakes 191 00:18:19.130 --> 00:18:25.010 testing context, someone else would hear what they they are are speaking 192 00:18:25.010 --> 00:18:28.390 but also if they talk over them in the same room. 193 00:18:28.390 --> 00:18:34.470 That's one concern of the students and it is also 194 00:18:34.470 --> 00:18:37.000 important to notice that the students. 195 00:18:37.000 --> 00:18:39.280 Who participated in our studies? 196 00:18:39.280 --> 00:18:44.400 They were not used to talking to a computer, so that's something new and 197 00:18:44.400 --> 00:18:50.040 something strange and need needs to be practised more. 198 00:18:50.040 --> 00:18:56.080 And also the this type of speaking tasks were new to them, they are 199 00:18:56.080 --> 00:19:02.260 more limited in nature and they were performed as individual tasks 200 00:19:02.260 --> 00:19:08.120 in instead of having pair to talk with. 201 00:19:08.120 --> 00:19:11.880 And also the learners worried about data security and protection 202 00:19:11.880 --> 00:19:16.160 as you see here when recording their voices. 203 00:19:16.160 --> 00:19:19.970 So that's one one thing to keep in mind. 204 00:19:19.970 --> 00:19:26.030 And yeah, on the right here we have then some open issues listed, 205 00:19:26.030 --> 00:19:29.630 some issues that divided learners opinions. 206 00:19:29.630 --> 00:19:37.940 They include questions for example, that will the test give true picture of my my. 207 00:19:37.940 --> 00:19:43.940 OK, good. Some technical issues here minor. 208 00:19:43.940 --> 00:19:47.420 I will continue. Good. 209 00:19:47.420 --> 00:19:55.790 So yeah, so we'll be test, give out, give up. 210 00:19:55.790 --> 00:20:01.710 Good, this is what happens with machines, but yeah. 211 00:20:01.710 --> 00:20:04.470 A true true picture of my language skills. 212 00:20:04.470 --> 00:20:10.610 And also does the machine have any prejudice against different speakers? 213 00:20:10.610 --> 00:20:16.270 And yeah, also some learners prefer prefer talking 214 00:20:16.270 --> 00:20:20.240 to the computer, but some feel anxious. 215 00:20:20.240 --> 00:20:27.200 So this is some dividing opinion. Good. 216 00:20:27.200 --> 00:20:31.880 Then moving on to teacher teacher impressions. 217 00:20:31.880 --> 00:20:37.840 So this is in detail reported in one article of ours. 218 00:20:37.840 --> 00:20:39.980 But. 219 00:20:39.980 --> 00:20:45.820 The idea behind the automated diagnostic feedback is of course that it could help learners by 220 00:20:45.820 --> 00:20:50.920 providing information on their strengths and weaknesses in their performance. 221 00:20:50.920 --> 00:20:55.020 And we interviewed some teachers in order to investigate their 222 00:20:55.020 --> 00:20:59.880 first impressions of the automated feedback. 223 00:20:59.880 --> 00:21:05.080 And first of all, the teachers see many possibilities for using this kind of tool 224 00:21:05.080 --> 00:21:13.140 already and they were for, for example, concerned about big issues such as negative 225 00:21:13.140 --> 00:21:21.430 washback that the the automated feedback might have and yeah. 226 00:21:21.430 --> 00:21:25.110 They were also the the teachers were interested in what's 227 00:21:25.110 --> 00:21:28.860 happening behind the scenes, so to speak. 228 00:21:28.860 --> 00:21:32.650 Viability of the tool and issues like that. 229 00:21:32.650 --> 00:21:38.030 All in all they would already now like to use this this tool 230 00:21:38.030 --> 00:21:42.660 to support language learning and teaching. Good. 231 00:21:42.660 --> 00:21:48.010 Then Riley briefly about the perceptions. 232 00:21:48.010 --> 00:21:54.710 Yes, I'm giving you very, very brief overview of of comparative study 233 00:21:54.710 --> 00:22:01.900 between our teacher survey conducted in 2012 and compare it to something 234 00:22:01.900 --> 00:22:04.730 that we did in the very beginning of this year. 235 00:22:04.730 --> 00:22:12.170 The issue of speaking assessment at upper secondary education in Finland has been around 236 00:22:12.170 --> 00:22:19.000 several decades already and we once once in while we always try to include. 237 00:22:19.000 --> 00:22:23.260 That that sub test to the matriculation exam and then then again we find 238 00:22:23.260 --> 00:22:28.950 that OK, it's not really practical and one of those moments was like 10 239 00:22:28.950 --> 00:22:34.620 years ago and we had another project focusing. 240 00:22:34.620 --> 00:22:37.920 Assessment of oral proficiency at upper secondary level. 241 00:22:37.920 --> 00:22:46.340 And in that context we asked language teachers at that time what they think 242 00:22:46.340 --> 00:22:52.200 about oral language assessment in general and also also about the introduction 243 00:22:52.200 --> 00:22:55.720 of that sub test into the matriculation exam. 244 00:22:55.720 --> 00:23:01.300 And let's see what the how the landscape looks like about 245 00:23:01.300 --> 00:23:05.370 the construct of language or oral language in. 246 00:23:05.370 --> 00:23:10.050 Both times support both poach point of time. 247 00:23:10.050 --> 00:23:15.250 We found that teachers value fluency, pronunciation, and interaction skills 248 00:23:15.250 --> 00:23:20.540 as the holy tried for criteria of speaking assessment. 249 00:23:20.540 --> 00:23:23.840 So that's something that is valued very highly by them. 250 00:23:23.840 --> 00:23:29.580 While grammar is considered less important and the open-ended answers 251 00:23:29.580 --> 00:23:32.830 reflected the pertaining assets and assets and dilemmas. 252 00:23:32.830 --> 00:23:37.170 All speaking assessment in summative and high stakes context so that people are afraid. 253 00:23:37.170 --> 00:23:39.000 And they have anxiety and. 254 00:23:39.000 --> 00:23:42.960 And what about what about interaction and and what not you? 255 00:23:42.960 --> 00:23:46.540 I think that most of us are rather familiar with that 256 00:23:46.540 --> 00:23:50.020 discussion from different contexts. 257 00:23:50.020 --> 00:23:54.820 And then about the the matriculation examination language test 258 00:23:54.820 --> 00:23:58.480 if it should include speaking component. 259 00:23:58.480 --> 00:24:04.260 Here we have the comparison in in parallel rows and columns. 260 00:24:04.260 --> 00:24:09.300 The columns represent the different syllabus in Finnish language education. 261 00:24:09.300 --> 00:24:15.200 So English is the most most advanced and and most frequently studied. 262 00:24:15.200 --> 00:24:20.380 And then we have the the mandatory second domestic language. 263 00:24:20.380 --> 00:24:27.730 Done some other other languages that that we that are common in our educational system and with 264 00:24:27.730 --> 00:24:37.170 very brief glance you can see that the percentages of yes responses has 265 00:24:37.170 --> 00:24:45.700 well roughly doubled during this 10 years so that people language teachers are very positive. 266 00:24:45.700 --> 00:24:50.730 For the idea that there should be should be speaking component because it is in 267 00:24:50.730 --> 00:24:56.390 in the core curriculum and we have the Community Communicative Language education 268 00:24:56.390 --> 00:25:03.950 ideal which our current school Leaving Examination fails to. 269 00:25:03.950 --> 00:25:10.490 Represent and then about what about the technique then automated assessment 270 00:25:10.490 --> 00:25:15.190 of elder speaking, How do language teachers feel about that? 271 00:25:15.190 --> 00:25:18.790 We didn't have the same question 10 years ago because technology 272 00:25:18.790 --> 00:25:21.430 was not that advanced at that time. 273 00:25:21.430 --> 00:25:28.350 But today we can see that people are mostly positive for that idea as well. 274 00:25:28.350 --> 00:25:32.690 So that why not automated speaking assessment if we can 275 00:25:32.690 --> 00:25:34.870 come up with tool that is functional. 276 00:25:34.870 --> 00:25:40.020 And that is user friendly and everything that we are, we are trying to strive at. 277 00:25:40.020 --> 00:25:45.420 And now we can like proudly pronounce that digital project has 278 00:25:45.420 --> 00:25:50.060 produced such tool, a prototype of such tool. 279 00:25:50.060 --> 00:25:55.080 But as I mentioned earlier, so a lot of development is still needed 280 00:25:55.080 --> 00:26:02.260 but today you will be, you will become more familiar with what has happened 281 00:26:02.260 --> 00:26:07.740 inside inside the machine and what kind of. Perspectives we have. 282 00:26:07.740 --> 00:26:16.180 I think that I will conclude from my part and we welcome questions from far and near. 283 00:26:16.180 --> 00:26:22.510 So please if there is anything we can complete our presentation with or something 284 00:26:22.510 --> 00:26:29.620 that is in your mind on your minds, please let me know how about the. 285 00:26:29.620 --> 00:26:34.260 Distant participants. Has anyone? 286 00:26:34.260 --> 00:26:38.020 Given the stance. 287 00:26:38.020 --> 00:26:43.040 And these QR codes go to video of the Moodle plugin that we have 288 00:26:43.040 --> 00:26:48.660 produced and also the materials available on Senedo. 289 00:26:48.660 --> 00:26:51.980 So if you want to scan one. 290 00:26:51.980 --> 00:27:00.560 Please questions at the moment there is just one general question and. 291 00:27:00.560 --> 00:27:07.700 The US online are asking if the slides are going to be presented to them. 292 00:27:07.700 --> 00:27:12.910 But did I hear correctly the the question was if the slides will be available later on? 293 00:27:12.910 --> 00:27:14.710 Yes, yeah, we will. 294 00:27:14.710 --> 00:27:19.760 We I we can publish the slides as PDF on our website in 295 00:27:19.760 --> 00:27:22.930 the same news item that we have online. 296 00:27:22.930 --> 00:27:28.780 Is that OK with everyone? Judge. 297 00:27:28.780 --> 00:27:33.560 Thank you. Any questions from the people here? 298 00:27:33.560 --> 00:27:37.470 And the whole. 299 00:27:37.470 --> 00:27:41.750 Just checking what's the electric question about the? 300 00:27:41.750 --> 00:27:43.870 Teachers. 301 00:27:43.870 --> 00:27:48.790 Reactions or feelings about the technology and assessment automated was that interconnection 302 00:27:48.790 --> 00:27:52.230 of matriculation examination or more generally, more generally. 303 00:27:52.230 --> 00:27:56.750 But of course there there was like a connexion that people, people usually are 304 00:27:56.750 --> 00:28:04.330 already aware because of our publications and and writings, that the only practical 305 00:28:04.330 --> 00:28:11.460 way to implement that kind of mass large scale assessment of speaking is must 306 00:28:11.460 --> 00:28:15.500 somehow be supported by technology so that. 307 00:28:15.500 --> 00:28:24.490 They're well connected, but we were not directly asking about. 308 00:28:24.490 --> 00:28:40.160 Any other questions? Please, Dimitri. 309 00:28:40.160 --> 00:28:45.720 This is great, great study that we did on teachers. 310 00:28:45.720 --> 00:28:47.520 Beliefs. 311 00:28:47.520 --> 00:28:52.450 And hence you know, me being pedantic about stuff. 312 00:28:52.450 --> 00:28:55.810 Do you? Differentiate. 313 00:28:55.810 --> 00:29:01.180 Between beliefs, perceptions. Experiences. 314 00:29:01.180 --> 00:29:05.140 Because I've seen all this keywords in your slides. 315 00:29:05.140 --> 00:29:06.940 Right. 316 00:29:06.940 --> 00:29:11.510 Versus, you know, the the title being beliefs as they said they were to me. 317 00:29:11.510 --> 00:29:16.080 These are like pretty different things. 318 00:29:16.080 --> 00:29:23.200 Yes, that was good question and well we have used those those terms like interchangeably 319 00:29:23.200 --> 00:29:28.740 here not to focus too much about the conceptual diversities between them. 320 00:29:28.740 --> 00:29:34.480 We are aware of all the fact that there are many different definitions and there there 321 00:29:34.480 --> 00:29:42.580 are distinctions between those constructs and like the the the role of knowing and 322 00:29:42.580 --> 00:29:46.590 doing and having personal experience of something is very different. 323 00:29:46.590 --> 00:29:51.790 But we have just chosen to use them as a in in very popular general way. 324 00:29:51.790 --> 00:29:56.070 So we can perhaps be back in into that question in some of our 325 00:29:56.070 --> 00:29:59.210 future articles and take it like more accurately. 326 00:29:59.210 --> 00:30:03.460 So thank you for the reminder. OK. 327 00:30:03.460 --> 00:30:10.440 We may have time for one very quick question or comment before moving on to the next. 328 00:30:10.440 --> 00:30:19.020 Presentation. 329 00:30:19.020 --> 00:30:23.020 This is second question online. 330 00:30:23.020 --> 00:30:32.450 How would you assess Digital's performance as regards prosodic features assessment? 331 00:30:32.450 --> 00:30:34.670 That might be. 332 00:30:34.670 --> 00:30:41.080 We might leave that for Haney to answer, since Haynes and Mikos presentation is about 333 00:30:41.080 --> 00:30:47.670 acoustic features of L2, Swedish and Finnish, So maybe you could. 334 00:30:47.670 --> 00:30:52.720 You could. Answer this one also. Yeah. 335 00:30:52.720 --> 00:30:54.970 So we'll we'll keep that question in mind for for. 336 00:30:54.970 --> 00:30:57.160 Yeah. And afterwards. 337 00:30:57.160 --> 00:30:58.960 OK. Thank you. 338 00:30:58.960 --> 00:31:09.240 Thank you and thank you. 339 00:31:09.240 --> 00:31:12.840 OK, so next we will have a. 340 00:31:12.840 --> 00:31:18.650 Being from local university of our US block here in Kalyan and Niko Kuronen, 341 00:31:18.650 --> 00:31:25.050 Hayley will be talking giving you this talk about acoustic features of of 342 00:31:25.050 --> 00:31:29.050 the Swedish and Finnish second language speech. 343 00:31:29.050 --> 00:31:39.980 Please. 344 00:31:39.980 --> 00:31:49.120 No, I don't like that. OK. 345 00:31:49.120 --> 00:31:52.080 OK, hello everyone here and online. 346 00:31:52.080 --> 00:31:55.630 My name is Hayley Kalia and I've been working as postdoc in 347 00:31:55.630 --> 00:31:58.540 the digital project for a couple of years now. 348 00:31:58.540 --> 00:32:02.720 My background is in phonetics. 349 00:32:02.720 --> 00:32:09.680 And I will be today shortly representing our some of some of our research results 350 00:32:09.680 --> 00:32:14.640 from the phonetic side, which we've been doing with Mika Kuronen and Maria Cottonian 351 00:32:14.640 --> 00:32:18.460 and couple of research assistants as well. 352 00:32:18.460 --> 00:32:20.860 And we'll see whether we can answer this question about 353 00:32:20.860 --> 00:32:25.380 the performance of of prosody assessment. 354 00:32:25.380 --> 00:32:31.120 The acoustic features we've been studying includes mainly 355 00:32:31.120 --> 00:32:34.560 features related to speech fluency. And Prosody. 356 00:32:34.560 --> 00:32:37.030 But yeah, OK, I will go forward. 357 00:32:37.030 --> 00:32:41.030 So limited amount of time for this. 358 00:32:41.030 --> 00:32:46.830 First I will talk little bit about predicting L2 Swedish fluency and 359 00:32:46.830 --> 00:32:51.630 proficiency and and also a little bit about pronunciation. 360 00:32:51.630 --> 00:32:58.880 Here's freshly published article which was published in Speech Communication journal this 361 00:32:58.880 --> 00:33:04.770 year about Prosody and fluency of Finland, Swedish as second language. 362 00:33:04.770 --> 00:33:10.520 Investigating global parameters for automatic speaking assessment and, 363 00:33:10.520 --> 00:33:13.380 for those who are not familiar with Finland, Swedish. 364 00:33:13.380 --> 00:33:19.280 It is variety of Swedish spoken in Finland and one of the two official languages 365 00:33:19.280 --> 00:33:26.080 here and compared to standard Swedish spoken in Sweden, Finland Swedish has 366 00:33:26.080 --> 00:33:31.720 its own segmental and prosodic characteristics and as far as we know, this 367 00:33:31.720 --> 00:33:36.030 is the first first study to to study. 368 00:33:36.030 --> 00:33:38.150 Prosody and fluency of Finland. 369 00:33:38.150 --> 00:33:43.650 Swedish as second language and relate those acoustic features to assessments 370 00:33:43.650 --> 00:33:50.610 of fluency and proficiency and pronunciation. 371 00:33:50.610 --> 00:33:57.130 Because it's it's it's more common in in phonetic studies to to if. 372 00:33:57.130 --> 00:34:02.290 If you connect acoustic features to human assessment, you usually study 373 00:34:02.290 --> 00:34:06.430 only one language skill dimension at time like. 374 00:34:06.430 --> 00:34:09.280 Proficiency or fluency or pronunciation. 375 00:34:09.280 --> 00:34:15.770 But here our motivation was to cheque whether these parameters are related to. 376 00:34:15.770 --> 00:34:22.250 Any of these or all of these dimensions in order to see what could be integrated in the automatic 377 00:34:22.250 --> 00:34:26.930 assessment or how could we explain the automatic assessment system? 378 00:34:26.930 --> 00:34:37.720 And. The data was 265 speech samples. 379 00:34:37.720 --> 00:34:43.760 We have 30 L 1 so native Swedish Finland Swedish speakers samples as well. 380 00:34:43.760 --> 00:34:46.840 Most of the samples were non-native speakers. 381 00:34:46.840 --> 00:34:51.680 Finish high school students speak speaking Swedish. 382 00:34:51.680 --> 00:34:55.860 Because it's compulsory subject in schools here. 383 00:34:55.860 --> 00:34:59.420 They were semi spontaneous narrative speech and the task 384 00:34:59.420 --> 00:35:03.960 assignment had picture and or written prompt. 385 00:35:03.960 --> 00:35:10.630 We decided to discard very short samples because. 386 00:35:10.630 --> 00:35:20.610 These global features are not very reliable to compute from very short samples. 387 00:35:20.610 --> 00:35:25.870 Very shortly about the parameters, I will not go these through 388 00:35:25.870 --> 00:35:28.530 in detail because we don't have much time. 389 00:35:28.530 --> 00:35:35.600 I want to focus on some of the interesting results we've got from this, but we had a. 390 00:35:35.600 --> 00:35:42.420 Few features related to F0 change and a few features related to speech rhythm. 391 00:35:42.420 --> 00:35:48.300 Or they have usually been related connected to speech rhythm and 392 00:35:48.300 --> 00:35:52.270 then bunch of fluency related features there. 393 00:35:52.270 --> 00:35:57.070 And we used multiple linear regression models for predicting. 394 00:35:57.070 --> 00:36:02.300 Human writings of proficiency, fluency, and pronunciation. 395 00:36:02.300 --> 00:36:05.600 And the results were like this. 396 00:36:05.600 --> 00:36:12.440 So for the proficiency model, our features explained 397 00:36:12.440 --> 00:36:17.360 33% of the variation in the ratings. 398 00:36:17.360 --> 00:36:23.420 What was not very surprising is that speech rate was one of the most important. 399 00:36:23.420 --> 00:36:27.490 Features there, but. 400 00:36:27.490 --> 00:36:32.250 What was interesting is that the frequency of silent pauses in the speech 401 00:36:32.250 --> 00:36:37.310 sample was even more significant than the speech rate itself. 402 00:36:37.310 --> 00:36:41.710 So the more there were silent pauses in the speech sample, 403 00:36:41.710 --> 00:36:47.890 the lower the proficiency rating was. 404 00:36:47.890 --> 00:36:53.190 What was also interesting is that the F0 slope contributed to the prediction, 405 00:36:53.190 --> 00:37:00.650 although it was not statistically significant. 406 00:37:00.650 --> 00:37:02.450 Yes. 407 00:37:02.450 --> 00:37:07.650 And also the amount of wrong language segment indicated lower proficiency. 408 00:37:07.650 --> 00:37:12.470 It's the WL ratio there. WL stands for wrong language. 409 00:37:12.470 --> 00:37:19.370 So this speech samples had quite lot of these finish speaking students speaking 410 00:37:19.370 --> 00:37:23.830 some other language than Swedish because they they were maybe more proficient 411 00:37:23.830 --> 00:37:27.110 in English or or some other language than which. 412 00:37:27.110 --> 00:37:32.600 So we computed that as well. 413 00:37:32.600 --> 00:37:40.200 OK, for the fluency model, what I find interesting here is that the ratio 414 00:37:40.200 --> 00:37:45.200 of field pauses, which means hesitations basically in speech. 415 00:37:45.200 --> 00:37:50.240 So the relative amount of hesitations are was significant predictor 416 00:37:50.240 --> 00:37:56.610 there together with speech rate but also the. 417 00:37:56.610 --> 00:38:09.010 And triangle S means. The. 418 00:38:09.010 --> 00:38:13.960 Let me see. Just moment. 419 00:38:13.960 --> 00:38:15.760 Yeah. 420 00:38:15.760 --> 00:38:19.370 So the standard deviation of celebrity duration. 421 00:38:19.370 --> 00:38:24.830 Which tells about how much the they varied their sealable duration. 422 00:38:24.830 --> 00:38:29.850 Basically in the speed samples that also became significant predictor of 423 00:38:29.850 --> 00:38:38.420 of fluency and the model predicted 44% of the rating variation. 424 00:38:38.420 --> 00:38:41.780 And not so surprisingly to pronunciation model was not very good. 425 00:38:41.780 --> 00:38:48.500 In only predicted 10% of the variation of the ratings, but and there the. 426 00:38:48.500 --> 00:38:54.900 The frequency of wrong language segments was was very important feature. 427 00:38:54.900 --> 00:38:56.960 But still it Yeah. 428 00:38:56.960 --> 00:39:00.720 So basically from with these features we cannot predict the pronunciation, 429 00:39:00.720 --> 00:39:05.050 but it explains quite nicely the fluency. 430 00:39:05.050 --> 00:39:10.680 And so couple of plots from the. Ah. 431 00:39:10.680 --> 00:39:16.240 Interesting features which is mean of 0 slope and the mean 432 00:39:16.240 --> 00:39:18.600 standard deviation of sealable duration. 433 00:39:18.600 --> 00:39:26.740 So we also looked at the data or the group the data by their proficiency level and fluency 434 00:39:26.740 --> 00:39:34.800 level and here you can see that that fluency or from the left plot you 435 00:39:34.800 --> 00:39:41.340 can see that the speakers with influencing category one and fluency category 3. 436 00:39:41.340 --> 00:39:50.650 That could be distinguished by their mean F0 slope, so it's something that. 437 00:39:50.650 --> 00:39:54.630 Tell something about how they used their fundamental frequency when 438 00:39:54.630 --> 00:40:00.750 speaking Swedish and the the higher fluency speakers were closer to 439 00:40:00.750 --> 00:40:06.140 the native speakers which is on the right most? 440 00:40:06.140 --> 00:40:08.560 Boxplot with the light green. 441 00:40:08.560 --> 00:40:11.040 Those are the native speakers. 442 00:40:11.040 --> 00:40:14.180 Then on the right plot there's the mean standard deviation of 443 00:40:14.180 --> 00:40:19.030 syllable duration, and there you can see that the. 444 00:40:19.030 --> 00:40:25.810 The less least fluent speakers were distinguished from the. 445 00:40:25.810 --> 00:40:30.020 So my fluent speakers interestingly but and also from the native 446 00:40:30.020 --> 00:40:34.430 speakers, but then again influenza category three they they were 447 00:40:34.430 --> 00:40:36.640 not so well distinguished from other categories. 448 00:40:36.640 --> 00:40:43.510 So something interesting is happening in these syllable durations of L2 Swedish 449 00:40:43.510 --> 00:40:51.170 speakers and it should be looked deeper into what is going on there. 450 00:40:51.170 --> 00:40:57.710 And these are probably those more language specific features than for example 451 00:40:57.710 --> 00:41:03.170 the fluency features, which you can see here some of the results. 452 00:41:03.170 --> 00:41:08.180 On the left there's articulation rate and the fluency categories. 453 00:41:08.180 --> 00:41:14.000 And then there's speech rate, and also the rate of silent pauses in the sample. 454 00:41:14.000 --> 00:41:17.460 And you can see that all the speaker groups are very well 455 00:41:17.460 --> 00:41:22.030 distinguished, distinguished from each other. 456 00:41:22.030 --> 00:41:24.210 And it's just it's not very surprising. 457 00:41:24.210 --> 00:41:30.730 These are very universal parameters that are good characteristics 458 00:41:30.730 --> 00:41:35.430 of fluency, speech fluency. OK. 459 00:41:35.430 --> 00:41:42.100 Do you have any questions from the Swedish data so far for the Swedish? 460 00:41:42.100 --> 00:41:49.290 Yeah, I have few minutes to go through the our finish studies, which is unfortunate 461 00:41:49.290 --> 00:41:52.900 because there are more of those, but I will try to speak quickly. 462 00:41:52.900 --> 00:42:03.560 So with L2 finish data, we've mainly focused on measuring fluency and we've tried out. 463 00:42:03.560 --> 00:42:08.890 Several different models there. And. 464 00:42:08.890 --> 00:42:11.770 Yeah, mostly related to those fluency measures. 465 00:42:11.770 --> 00:42:16.800 And there has been gap in research of Finnish as second language 466 00:42:16.800 --> 00:42:19.730 when it comes to acoustically analysed fluency. 467 00:42:19.730 --> 00:42:24.690 And actually there is still also a gap in research on on native finished 468 00:42:24.690 --> 00:42:30.670 speech as well, but let's go to that little bit later. 469 00:42:30.670 --> 00:42:37.120 And for this data we had 200 speed samples from two different. 470 00:42:37.120 --> 00:42:38.920 Sets. 471 00:42:38.920 --> 00:42:41.760 Basically we have adult learners of finish and then we had 472 00:42:41.760 --> 00:42:47.410 smaller set of of high school learners of finish. 473 00:42:47.410 --> 00:42:53.470 And we use digital human ratings for fluency and proficiency here, since we already 474 00:42:53.470 --> 00:42:58.170 saw that these fluency features, they cannot be used for predicting pronunciation, 475 00:42:58.170 --> 00:43:05.330 but some of them seem to be related to proficiency. 476 00:43:05.330 --> 00:43:10.590 And for those who are familiar with fraud, you can see that 477 00:43:10.590 --> 00:43:14.270 we've done lot of annotation work here. 478 00:43:14.270 --> 00:43:19.480 Will not go these through. In detail here, but um. 479 00:43:19.480 --> 00:43:22.880 Yeah, we. 480 00:43:22.880 --> 00:43:28.060 Annotated lot of different fluency features there and then computed 481 00:43:28.060 --> 00:43:34.830 bunch of parameters, and here are some examples of them. 482 00:43:34.830 --> 00:43:39.210 These features here are related to language specific syntax 483 00:43:39.210 --> 00:43:41.750 and there are more specific fluency parameters. 484 00:43:41.750 --> 00:43:47.390 So we wanted to see whether the location of pauses in speech samples 485 00:43:47.390 --> 00:43:52.980 has some effect on the prediction of fluency. 486 00:43:52.980 --> 00:43:57.840 And then we also use some more global fluency parameters 487 00:43:57.840 --> 00:43:59.960 to compute compete with the speech rate. 488 00:43:59.960 --> 00:44:05.190 And here we basically combined all pauses and hesitations 489 00:44:05.190 --> 00:44:09.010 and corrections and repetitions into the same. 490 00:44:09.010 --> 00:44:12.000 Same parameter. 491 00:44:12.000 --> 00:44:15.680 And altogether, altogether we had 44 fluids related parameters and we 492 00:44:15.680 --> 00:44:20.880 also used multiple linear regression models for that. 493 00:44:20.880 --> 00:44:25.340 And yes, so about the contribution of post location parameters. 494 00:44:25.340 --> 00:44:30.680 Interestingly, the post location parameters improved the predictive power 495 00:44:30.680 --> 00:44:33.760 of the proficiency model better than the fluency model. 496 00:44:33.760 --> 00:44:39.880 So like half of the parameters in the proficiency model were related 497 00:44:39.880 --> 00:44:45.140 to pause locations, which was interesting and only. 498 00:44:45.140 --> 00:44:54.260 One the other fluency model improved only little bit. 499 00:44:54.260 --> 00:44:58.290 And I wanted to show you these plots very quickly because it says 500 00:44:58.290 --> 00:45:01.640 something about the contribution of the pause location parameters 501 00:45:01.640 --> 00:45:04.540 that their relevance and their significance. 502 00:45:04.540 --> 00:45:08.120 Significance in the models basically depend on the speech data. 503 00:45:08.120 --> 00:45:11.810 So the blue blue ones are the high school students and 504 00:45:11.810 --> 00:45:13.880 the red ones are the adult learners. 505 00:45:13.880 --> 00:45:18.220 And you can see that in the blue ones, which is the smaller data, if 506 00:45:18.220 --> 00:45:25.150 you have one outlier, it affects lot in the lower plot. 507 00:45:25.150 --> 00:45:26.960 But then. 508 00:45:26.960 --> 00:45:31.960 The contribution of the global parameters is more clear. 509 00:45:31.960 --> 00:45:35.760 So the disfluency ratio is the lowest plot there and you can 510 00:45:35.760 --> 00:45:40.730 see clear tendency that more the more the. 511 00:45:40.730 --> 00:45:43.220 More disfluencies there are. 512 00:45:43.220 --> 00:45:47.360 The lower the proficiency et cetera. 513 00:45:47.360 --> 00:45:54.780 But about the creakiness, we also tried to wanted to try to measure Creek in these L2 514 00:45:54.780 --> 00:45:59.420 finish samples because it is said that finished people Creek lot maybe. 515 00:45:59.420 --> 00:46:03.060 So I thought maybe the more proficient finished speakers also create 516 00:46:03.060 --> 00:46:10.780 lot, but as you can see there were not many with the up lift right up 517 00:46:10.780 --> 00:46:14.140 upright corner, so there were not many speakers who. 518 00:46:14.140 --> 00:46:22.700 Used creaky voice in the finished learners. 519 00:46:22.700 --> 00:46:28.220 So, but here are my take home messages from our studies. 520 00:46:28.220 --> 00:46:35.090 So the potential here is that if we could combine combine automatic pause or hesitation detection 521 00:46:35.090 --> 00:46:41.840 with some sort of syntactic parser, maybe we could define the location of pauses and hesitations 522 00:46:41.840 --> 00:46:48.200 and and at that to the fluency model to make it more accurate. 523 00:46:48.200 --> 00:46:54.360 And also if we are able to recognise and separate target language words from from all 524 00:46:54.360 --> 00:46:59.080 other stuff in the speech like hesitations and repetitions and so on, that could also 525 00:46:59.080 --> 00:47:04.520 make maybe we can make more efficient fluency model as well. 526 00:47:04.520 --> 00:47:09.280 And then, well, the issues are that if if we have very low proficiency speakers, 527 00:47:09.280 --> 00:47:12.580 they may not not produce linguistic phrases at all. 528 00:47:12.580 --> 00:47:18.290 So then the location of policies is very difficult to define. 529 00:47:18.290 --> 00:47:23.510 And also the databased bias is is very important issue. 530 00:47:23.510 --> 00:47:30.100 So whatever is present in the data we have may become relevant. 531 00:47:30.100 --> 00:47:33.060 Um, yeah. So. 532 00:47:33.060 --> 00:47:37.490 Maybe we can think about context specific, specific assessment models. 533 00:47:37.490 --> 00:47:43.140 Um, which is used in in some of the international systems. 534 00:47:43.140 --> 00:47:47.340 And also yeah, I want to get back to this that in the finish L2 studies 535 00:47:47.340 --> 00:47:51.740 we did not have reference group of L1 finished at all. 536 00:47:51.740 --> 00:47:55.760 So these are some phenomena that should be studied also in L1 finish. 537 00:47:55.760 --> 00:48:01.150 So we could say with. With. 538 00:48:01.150 --> 00:48:07.670 Yeah, so we could say that that these phenomena are present in Finnish in general. 539 00:48:07.670 --> 00:48:10.550 OK, I'm sorry, I use little bit more time than I should have 540 00:48:10.550 --> 00:48:16.820 maybe, but uh, and here are some references. 541 00:48:16.820 --> 00:48:30.170 And I'm ready to take some questions if we still have time for that. 542 00:48:30.170 --> 00:48:36.300 Thank you, Hayley. We have time for one or two questions. 543 00:48:36.300 --> 00:48:39.800 Perhaps. Yeah. 544 00:48:39.800 --> 00:48:45.540 One thing the. Because there was this question about the. 545 00:48:45.540 --> 00:48:52.120 Contribution of Prosody. To how How was it actually phrased? 546 00:48:52.120 --> 00:48:59.040 How would you assess Digital's performance as regards prosodic features assessment? 547 00:48:59.040 --> 00:49:04.060 Is it so that actually because here you had actually sort of 548 00:49:04.060 --> 00:49:07.820 annotated the, the, the fluency features by hand. 549 00:49:07.820 --> 00:49:10.860 So it's not yet implemented in digital. 550 00:49:10.860 --> 00:49:13.240 But can you say we have or? 551 00:49:13.240 --> 00:49:19.540 Yeah, well, I'm sure Yaroslav will have more to say about this, but we do have some 552 00:49:19.540 --> 00:49:27.900 prosodic features also integrated like this pairwise variability index which is usually 553 00:49:27.900 --> 00:49:30.930 related to speech rhythm, which is related to speech. 554 00:49:30.930 --> 00:49:35.270 City and also some F0 features, but I I don't think we 555 00:49:35.270 --> 00:49:43.830 have F0 slope there and I don't think we. Neither have. 556 00:49:43.830 --> 00:49:47.630 This standard deviation of syllable duration, and we were 557 00:49:47.630 --> 00:49:50.230 actually with the Finland Swedish L2 data. 558 00:49:50.230 --> 00:49:55.870 We were expecting that the pairwise variability index would have become significant 559 00:49:55.870 --> 00:50:00.470 predictor in the influence or pronunciation, but it did not. 560 00:50:00.470 --> 00:50:04.590 Instead, the standard deviation of syllable duration it 561 00:50:04.590 --> 00:50:09.330 outran the NPI, which was surprise for us. 562 00:50:09.330 --> 00:50:14.150 So maybe yeah, it it should be studied further to figure 563 00:50:14.150 --> 00:50:16.020 out what is going on there in the syllable. 564 00:50:16.020 --> 00:50:19.800 Patients of the Finland Swedish speakers the Finland Swedish, Swedish. 565 00:50:19.800 --> 00:50:23.500 It's different from Sweden Swedish and it's it's personally 566 00:50:23.500 --> 00:50:28.410 very close to finish, so but still the. 567 00:50:28.410 --> 00:50:36.530 Word and sentence stress production is different due to Swedish grammar, so. 568 00:50:36.530 --> 00:50:40.490 We we should look into that more deeply. What's going on there? 569 00:50:40.490 --> 00:50:45.550 So digital has taken some steps towards that direction, but there's still, 570 00:50:45.550 --> 00:50:50.230 yeah, yeah, we are still using these kind of standard, standard parameters 571 00:50:50.230 --> 00:50:52.590 that have been used in in other systems as well. 572 00:50:52.590 --> 00:50:56.440 But this is these are just new studies that. 573 00:50:56.440 --> 00:51:03.320 Could give up more ideas for the language specific models maybe? 574 00:51:03.320 --> 00:51:08.690 Any other questions? Quick ones please. 575 00:51:08.690 --> 00:51:13.170 I was wondering here if there are language teachers online or here present. 576 00:51:13.170 --> 00:51:16.730 So what would you want to say to them? 577 00:51:16.730 --> 00:51:23.630 Any take home messages from teachers? 578 00:51:23.630 --> 00:51:28.910 What would you like to say to them? Umm. 579 00:51:28.910 --> 00:51:35.690 Yeah, well the I would say that if you don't yet teach prosody in your target 580 00:51:35.690 --> 00:51:39.450 language, start doing that now because it's very important. 581 00:51:39.450 --> 00:51:44.410 It's not only, it's not only the single words that we produce and pronounce, 582 00:51:44.410 --> 00:51:49.950 but it's how we combine the words together and and how we produce the full 583 00:51:49.950 --> 00:51:54.640 sentences and speech, utterance, utterances. 584 00:51:54.640 --> 00:52:00.130 And yeah, and of course we cannot expect perfect pronunciation in terms 585 00:52:00.130 --> 00:52:03.650 of of words or segments or prosthetically either. 586 00:52:03.650 --> 00:52:08.420 But it's yeah, it's all about the intelligibility. 587 00:52:08.420 --> 00:52:12.760 Which is integrated in our rating criteria, so. 588 00:52:12.760 --> 00:52:19.150 Thank you, honey. Thank you. 589 00:52:19.150 --> 00:52:27.350 And next we have the final, third presentation in this section by team from Alto 590 00:52:27.350 --> 00:52:35.430 University, Yaroslav, Gettman, Ekaterina, Wasco, Boenicke and Nikorima and 591 00:52:35.430 --> 00:52:37.680 it's going to be. 592 00:52:37.680 --> 00:52:41.210 Keep talking with the title Deep learning methods in L2. 593 00:52:41.210 --> 00:52:45.310 Low resource each. Recognition and rating please. 594 00:52:45.310 --> 00:52:47.110 Good afternoon everyone. 595 00:52:47.110 --> 00:52:56.800 My name is Yaroslav Gitman and I am working at Nicodemus Group at Alto University and. 596 00:52:56.800 --> 00:52:58.600 Yeah. 597 00:52:58.600 --> 00:53:01.480 And I'll introduce you the technical aspect of our 598 00:53:01.480 --> 00:53:05.580 automatic speaking assessment systems. 599 00:53:05.580 --> 00:53:12.190 So as you can see from the diagram, our automatic assessment systems consist 600 00:53:12.190 --> 00:53:19.620 of four individual evaluators covering different aspects of L2 speakers 601 00:53:19.620 --> 00:53:26.170 proficiency such as test completion, lexical. 602 00:53:26.170 --> 00:53:30.050 Range and accuracy, pronunciation and fluency. 603 00:53:30.050 --> 00:53:35.110 And we have separate evaluator which predicts the overall 604 00:53:35.110 --> 00:53:39.070 proficiency level or the CFR holistic level. 605 00:53:39.070 --> 00:53:46.730 But apart from that, our system needs to know what the L2 speaker said. 606 00:53:46.730 --> 00:53:52.670 So for that purposes we developed automatic speech recognition systems 607 00:53:52.670 --> 00:53:58.990 for L2 speech of Finland, Swedish and Finnish. 608 00:53:58.990 --> 00:54:02.430 As model architecture, we used the recently recently 609 00:54:02.430 --> 00:54:06.550 developed Wave 2 back two models. 610 00:54:06.550 --> 00:54:13.030 Those models are pre trained on large amounts of unlabelled or unscripted speech 611 00:54:13.030 --> 00:54:21.400 like dozens to thousands to hundreds of thousands of hours and. 612 00:54:21.400 --> 00:54:27.100 At this stage, these models are not able to perform automatic speech recognition, but instead 613 00:54:27.100 --> 00:54:35.580 they learn some general acoustic patterns or characteristics of speech. 614 00:54:35.580 --> 00:54:41.910 Which makes it possible to fine tune these models for automatic speech recognition 615 00:54:41.910 --> 00:54:47.580 using only limited amount of transcribed speech data. 616 00:54:47.580 --> 00:54:53.180 And moreover, these systems are easy to adapt to target domain. 617 00:54:53.180 --> 00:55:01.660 So in our experiments, we first fine-tuned our models for L1 on one speech 618 00:55:01.660 --> 00:55:06.560 and then continuously fine tune it upon the target L2 domain. 619 00:55:06.560 --> 00:55:11.410 And we didn't use any external language model to prevent from 620 00:55:11.410 --> 00:55:15.850 unintentional correcting students mistakes. 621 00:55:15.850 --> 00:55:20.190 So we were interested in what the student were actually saying 622 00:55:20.190 --> 00:55:25.490 rather than what the students tried to say. 623 00:55:25.490 --> 00:55:31.900 And in our experiments, we used 14 hours of finished data and almost six 624 00:55:31.900 --> 00:55:39.310 hours of L2 Finland Swedish data to train the S systems. 625 00:55:39.310 --> 00:55:45.110 And we use the K fold cross validation technique. 626 00:55:45.110 --> 00:55:50.320 In order to. Have test results on the whole data set. 627 00:55:50.320 --> 00:55:56.900 For example, for ASR we trained for four subsystems 628 00:55:56.900 --> 00:56:01.840 and aggregated the test set results. 629 00:56:01.840 --> 00:56:10.240 In total, our ASR systems achieved 24 percent, 22% word error rate and 7% character 630 00:56:10.240 --> 00:56:19.400 error rate for finish and 18% word error rate and 9% character error rate for Swedish. 631 00:56:19.400 --> 00:56:25.300 The diagrams on the right illustrate the distribution of samples depending 632 00:56:25.300 --> 00:56:29.860 on the proficiency level and the word and character error rate of our 633 00:56:29.860 --> 00:56:34.140 SI systems for the corresponding levels. 634 00:56:34.140 --> 00:56:37.800 And as you can see from the curves. 635 00:56:37.800 --> 00:56:44.290 The more proficient the speaker is, the lower is the word and character error rate. 636 00:56:44.290 --> 00:56:47.170 For both Swedish and Finnish. 637 00:56:47.170 --> 00:56:51.410 The only exception is slight increase in Word and character 638 00:56:51.410 --> 00:56:55.470 error rate for finish for level C2. 639 00:56:55.470 --> 00:57:03.890 The most likely explanation for such behaviour is that C2 students 640 00:57:03.890 --> 00:57:10.050 use two complex grammatical and lexical structures and very rare words, making 641 00:57:10.050 --> 00:57:18.560 it difficult for Sr system to correctly recognise those utterances. 642 00:57:18.560 --> 00:57:20.360 Next. 643 00:57:20.360 --> 00:57:25.120 Next, let's move to the holistic score predictor. 644 00:57:25.120 --> 00:57:30.400 By the way, as you can see from the diagrams, our data sets are heavily imbalanced 645 00:57:30.400 --> 00:57:36.090 towards the intermediate level speakers, especially for Swedish data. 646 00:57:36.090 --> 00:57:42.800 And as result, we filtered out the underrepresented classes. 647 00:57:42.800 --> 00:57:44.830 And. 648 00:57:44.830 --> 00:57:48.030 For Swedish, we trained the classifier to distinguish 649 00:57:48.030 --> 00:57:52.530 between 4 holistic levels and who finish. 650 00:57:52.530 --> 00:57:57.300 We had six class classification problem. 651 00:57:57.300 --> 00:58:02.140 As training it puts, we used some manually extracted lexical 652 00:58:02.140 --> 00:58:05.080 fluency and pronunciation features. 653 00:58:05.080 --> 00:58:09.370 Some of them were introduced by Haney. 654 00:58:09.370 --> 00:58:16.970 And in addition, we used deep acoustic representations from our some model. 655 00:58:16.970 --> 00:58:21.310 In the diagram diagram they are named as hidden representations. 656 00:58:21.310 --> 00:58:29.010 As mentioned previously, those hidden representations learned by two network contains 657 00:58:29.010 --> 00:58:36.610 some general acoustic properties embedded or acoustic patterns of speech. 658 00:58:36.610 --> 00:58:41.850 And based on our experiments, those deep representations improved 659 00:58:41.850 --> 00:58:46.460 the classification performance for the rating systems. 660 00:58:46.460 --> 00:58:49.850 So we can continue with them with the handcrafted ones. 661 00:58:49.850 --> 00:58:54.050 And as classifier for each dimension and also for the holistic 662 00:58:54.050 --> 00:58:59.780 score, we used six layer deep neural networks. 663 00:58:59.780 --> 00:59:08.400 And for finish, the CFR rating model achieved 46% 664 00:59:08.400 --> 00:59:12.800 accuracy and 39 percent F1 score. 665 00:59:12.800 --> 00:59:18.220 And here you can see the confusion matrix of the results, where horizontal 666 00:59:18.220 --> 00:59:24.520 axis represents the predicted class or predicted level and the vertical 667 00:59:24.520 --> 00:59:29.750 axis represents the reference or the true label. 668 00:59:29.750 --> 00:59:37.430 For example, take look at the. Let's say B2 level. 669 00:59:37.430 --> 00:59:45.250 Our system predicts B2 samples, so it predicts 14% of 670 00:59:45.250 --> 00:59:51.630 them as B, 145% as B2 and 33% SC one. 671 00:59:51.630 --> 00:59:55.790 So those numbers are just ratios. 672 00:59:55.790 --> 01:00:00.030 And as you can see from the confusion metrics. 673 01:00:00.030 --> 01:00:04.010 Our model is mostly confusing between neighbouring classes. 674 01:00:04.010 --> 01:00:09.730 For example, for level B1 samples, they predictions are almost 675 01:00:09.730 --> 01:00:14.620 equally distributed between levels A to B1 and B2. 676 01:00:14.620 --> 01:00:22.100 And as you can see from the zeros on the corners of the system learned to perfectly 677 01:00:22.100 --> 01:00:28.830 or almost perfectly distinguishing between the extreme levels. 678 01:00:28.830 --> 01:00:33.320 And similarly for Swedish. 679 01:00:33.320 --> 01:00:36.810 The model confuses between neighbouring classes. 680 01:00:36.810 --> 01:00:43.760 But apart from that, if you look at the vertical axis you can see that. 681 01:00:43.760 --> 01:00:52.250 This model mostly predicts this proficiency level as A2 or B1. 682 01:00:52.250 --> 01:00:57.130 And that's an expected defective behaviour, because going 683 01:00:57.130 --> 01:01:01.800 back to the distribution you can see that. 684 01:01:01.800 --> 01:01:12.080 Levels 8/2 and B1 over represented compared to other levels. 685 01:01:12.080 --> 01:01:17.050 And to partially mitigate the. 686 01:01:17.050 --> 01:01:22.750 Imbalance issue for Swedish data where currently collecting and annotating 687 01:01:22.750 --> 01:01:29.340 more Swedish samples of beginner students. 688 01:01:29.340 --> 01:01:38.010 Finally, here are the full results for all the four dimensions and the holistic 1. 689 01:01:38.010 --> 01:01:44.540 But let's focus on the human to human versus machine to human performance. 690 01:01:44.540 --> 01:01:50.780 So we measured how well humans agree to each other when rating the samples, and 691 01:01:50.780 --> 01:01:56.980 how machine agrees to the average score by multiple human raters. 692 01:01:56.980 --> 01:02:05.240 And we measured the Pearson correlation and Cohens Kappa scores. 693 01:02:05.240 --> 01:02:07.520 And. 694 01:02:07.520 --> 01:02:13.900 Our systems mostly outperform humans in terms of agreement for 695 01:02:13.900 --> 01:02:18.230 all dimensions except for lexical grammatical 1. 696 01:02:18.230 --> 01:02:20.180 And. 697 01:02:20.180 --> 01:02:29.340 We got extremely good scores for the General Holistic Proficiency Predictor. 698 01:02:29.340 --> 01:02:34.180 And similarly similarly for Swedish. 699 01:02:34.180 --> 01:02:39.040 All systems outperformed humans in terms of agreement, 700 01:02:39.040 --> 01:02:45.690 except for lexical grammatical dimension. 701 01:02:45.690 --> 01:02:54.090 To conclude, in this project we developed L2S systems for Finland, Swedish and 702 01:02:54.090 --> 01:02:57.810 Finnish. And. 703 01:02:57.810 --> 01:03:00.820 We also developed for. 704 01:03:00.820 --> 01:03:06.960 Individual evaluator systems focusing on different aspects of L2 speakers 705 01:03:06.960 --> 01:03:12.140 proficiency as well as overall rating system. 706 01:03:12.140 --> 01:03:16.280 And as can be noticed from the results, some systems might 707 01:03:16.280 --> 01:03:20.120 have quite low agreement with human raters. 708 01:03:20.120 --> 01:03:27.120 But for these dimensions, the degree of human to human agreement is often even lower. 709 01:03:27.120 --> 01:03:32.300 So as one of future research directions, we are planning 710 01:03:32.300 --> 01:03:35.860 to perform some model compression techniques. 711 01:03:35.860 --> 01:03:41.620 Those of you who have tested out our plugin have noticed that it's quite 712 01:03:41.620 --> 01:03:47.530 slow, so currently our neural models are quite huge. 713 01:03:47.530 --> 01:03:52.390 So we would like to perform some model compression or distillation 714 01:03:52.390 --> 01:03:56.200 techniques to make them smaller and faster. 715 01:03:56.200 --> 01:04:02.560 And finally, we are planning to update the Swedish models after we receive 716 01:04:02.560 --> 01:04:08.820 the human rates ratings for the beginner level samples. 717 01:04:08.820 --> 01:04:11.000 That concludes my presentation. 718 01:04:11.000 --> 01:04:13.640 If you have any questions, I'll be happy to answer. 719 01:04:13.640 --> 01:04:23.300 Thank you. 720 01:04:23.300 --> 01:04:30.420 Thank you. Any immediate questions or comments? 721 01:04:30.420 --> 01:04:36.210 Online or or here. Of. 722 01:04:36.210 --> 01:04:42.770 What do you what do you think the? Umm. 723 01:04:42.770 --> 01:04:50.710 What would be the effect of trying to increase human to human agreement Here, I mean. 724 01:04:50.710 --> 01:04:54.680 Because you have compared the. 725 01:04:54.680 --> 01:04:57.740 Is there kind of direct sort of benefit from trying to 726 01:04:57.740 --> 01:05:02.510 get humans agree more with each each other? If. 727 01:05:02.510 --> 01:05:04.690 If. 728 01:05:04.690 --> 01:05:09.630 When the basis of comparison is the kind of an average human rating. 729 01:05:09.630 --> 01:05:11.720 Basically, that's that's simply one figure. 730 01:05:11.720 --> 01:05:15.730 But of course it can be problematic if if there's lot 731 01:05:15.730 --> 01:05:18.890 of variation among the humans, of course. 732 01:05:18.890 --> 01:05:25.640 Yeah, I think the variation between human scores was is problem for neural network. 733 01:05:25.640 --> 01:05:28.270 Because. 734 01:05:28.270 --> 01:05:34.680 If the humans do not agree to each other, what would be the gold label for the model? 735 01:05:34.680 --> 01:05:41.430 That's the question. So that's why we average this course and. 736 01:05:41.430 --> 01:05:48.070 We make the model imitate some Emory average human writer, but if the humans 737 01:05:48.070 --> 01:05:54.220 agree more than we can choose just either of scores and. 738 01:05:54.220 --> 01:06:02.500 Let the model learn to imitate a particular human rather. 739 01:06:02.500 --> 01:06:04.830 So actually it would be. 740 01:06:04.830 --> 01:06:09.410 Not useful thing to do to actually try to include in the in the 741 01:06:09.410 --> 01:06:13.560 model all the different human raters, but rather. 742 01:06:13.560 --> 01:06:17.720 Like you're some sort of average of of them? 743 01:06:17.720 --> 01:06:23.360 If I get it right, so it's it's better to use 111 sort of average rather than 744 01:06:23.360 --> 01:06:29.960 try to include wider range of different types of ratings there. 745 01:06:29.960 --> 01:06:35.400 In general, it's easier to teach the model to imitate some single human 746 01:06:35.400 --> 01:06:39.650 quarter average human rather than multiple humans. 747 01:06:39.650 --> 01:06:42.960 But it's kind of computational. Challenge. 748 01:06:42.960 --> 01:06:50.260 It's kind of computational challenge. 749 01:06:50.260 --> 01:06:55.260 So there are some questions of from our online participants. 750 01:06:55.260 --> 01:07:00.180 Do you think the fact that the system struggles would be 1 and B2 is characteristic 751 01:07:00.180 --> 01:07:03.620 of all languages or just finish and Swedish? 752 01:07:03.620 --> 01:07:13.820 Because that's also something that human race is usually struggled with in English too. 753 01:07:13.820 --> 01:07:18.340 Hmm. Sorry, could you repeat that again, please? 754 01:07:18.340 --> 01:07:23.420 Do you think the fact that the system struggles with B1 and B2 is characteristic 755 01:07:23.420 --> 01:07:26.380 of all languages, or just Finnish and Swedish? 756 01:07:26.380 --> 01:07:29.160 And because that's also something that human rate is 757 01:07:29.160 --> 01:07:34.700 used usually struggle with in English? 758 01:07:34.700 --> 01:07:41.020 I think that's mostly related to the amount of data for that levels. 759 01:07:41.020 --> 01:07:42.820 So. 760 01:07:42.820 --> 01:07:48.430 For Finish, we partially mitigated that issue by collecting data from students 761 01:07:48.430 --> 01:07:54.910 with various backgrounds for finished, recollected the beginner level samples 762 01:07:54.910 --> 01:08:02.870 from university students, as well as bilingual students and immigrants and 763 01:08:02.870 --> 01:08:08.320 Swedes learning Finish, so that's why. 764 01:08:08.320 --> 01:08:15.070 We have more balanced distribution and as result. 765 01:08:15.070 --> 01:08:22.630 We will get relatively high scores even for B1 and B2 samples. 766 01:08:22.630 --> 01:08:30.040 So it's more of question about the amount of training data. 767 01:08:30.040 --> 01:08:33.660 So it's not language specific. 768 01:08:33.660 --> 01:08:38.590 I could also comment that from the more general assessment perspective is that. 769 01:08:38.590 --> 01:08:43.090 Everybody who has done done assessment, writing of speaking or writing. 770 01:08:43.090 --> 01:08:47.970 Has noticed that it's fairly easy to spot particularly weak performance 771 01:08:47.970 --> 01:08:52.140 and very good one and it's in the middle that where you have struggled 772 01:08:52.140 --> 01:08:54.990 to actually think about whether this is to be 1 or B2. 773 01:08:54.990 --> 01:08:59.910 So I think that's kind of a genuine decision challenge. 774 01:08:59.910 --> 01:09:04.490 But, and of course here it happens to, so most of the data comes from from that. 775 01:09:04.490 --> 01:09:12.270 Great right? So. 776 01:09:12.270 --> 01:09:15.520 So. Yeah. 777 01:09:15.520 --> 01:09:20.270 Don't know. 778 01:09:20.270 --> 01:09:22.960 I was wondering it's not like the question to you, but actually 779 01:09:22.960 --> 01:09:29.030 the whole digital things or or that's I was kind of. 780 01:09:29.030 --> 01:09:34.750 Hesitant to ask it now. Like rating is one thing right? 781 01:09:34.750 --> 01:09:37.860 But when you when you train your the model. 782 01:09:37.860 --> 01:09:40.630 On whichever scale the other thing which which I I 783 01:09:40.630 --> 01:09:43.360 think could be bit more interesting. 784 01:09:43.360 --> 01:09:48.340 I don't know if that's interesting to you is to approach that from more diagnostic 785 01:09:48.340 --> 01:09:51.830 performance, because I I think you can train the model. 786 01:09:51.830 --> 01:09:56.040 Ohh, there there will be like more agreement amongst raters. 787 01:09:56.040 --> 01:09:59.750 What 3 gets? What could be? 788 01:09:59.750 --> 01:10:05.310 Trained with the particular speech sounds, you know what sort of challenges they have. 789 01:10:05.310 --> 01:10:07.110 Right. 790 01:10:07.110 --> 01:10:11.960 They let us have and and you know kind of for more diagnostic perspective. 791 01:10:11.960 --> 01:10:14.560 Right, rather than the scale. Right. 792 01:10:14.560 --> 01:10:17.920 And I think that would be quite, quite an interesting word. 793 01:10:17.920 --> 01:10:22.470 An interesting thing to look at also in terms of training model. 794 01:10:22.470 --> 01:10:24.720 You know. Right. 795 01:10:24.720 --> 01:10:28.240 Because there are particular features which you saw that would kind of separate, 796 01:10:28.240 --> 01:10:33.690 separate across, across, across the professions levels. 797 01:10:33.690 --> 01:10:39.950 So I want to. 1. 798 01:10:39.950 --> 01:10:43.820 Take. You know that. 799 01:10:43.820 --> 01:10:48.650 So, so rather than training to, to to. 800 01:10:48.650 --> 01:10:52.270 To differentiate across the proficiency level, whichever scale it 801 01:10:52.270 --> 01:10:56.570 is right, which I I think it would be like notoriously difficult also 802 01:10:56.570 --> 01:11:01.310 for human races and whether we are and like. 803 01:11:01.310 --> 01:11:06.040 Are there are some someday you're not is is different question. 804 01:11:06.040 --> 01:11:09.700 Approach it or tackle the title or the challenge from different perspective. 805 01:11:09.700 --> 01:11:11.970 You know what is it that. 806 01:11:11.970 --> 01:11:17.720 Could be focused on in particular speech sample in terms of the. 807 01:11:17.720 --> 01:11:22.690 Teaching and learning training. Development. 808 01:11:22.690 --> 01:11:29.170 This one, and that's a different question than scale. 809 01:11:29.170 --> 01:11:32.420 Yeah. 810 01:11:32.420 --> 01:11:38.180 To some, I mean this is not the direct answer to that to some extent there are but 811 01:11:38.180 --> 01:11:44.680 there has been analytical rating of certain dimensions but that's on on more 812 01:11:44.680 --> 01:11:49.290 specific scale and and the there have been training of of the. 813 01:11:49.290 --> 01:11:53.670 Of the models for fluency and pronunciation and and so on. 814 01:11:53.670 --> 01:11:59.160 But if I think if you mean that, so looking at the. 815 01:11:59.160 --> 01:12:02.710 At the samples at at the more detailed. 816 01:12:02.710 --> 01:12:05.090 Way perhaps would would be. 817 01:12:05.090 --> 01:12:07.130 One way to make use of that. 818 01:12:07.130 --> 01:12:10.720 So perhaps the question is how to spot interesting. 819 01:12:10.720 --> 01:12:14.950 Cases for for further further analysis perhaps? 820 01:12:14.950 --> 01:12:16.850 Could be, yeah. But certainly. 821 01:12:16.850 --> 01:12:21.610 Well, we need to discuss that that more. Ohh. 822 01:12:21.610 --> 01:12:28.540 OK, if there are no questions online, I think it's it's time for coffee 823 01:12:28.540 --> 01:12:35.780 break before we have a panel discussion with people. 824 01:12:35.780 --> 01:12:40.750 From from here and also from abroad via via Internet. 825 01:12:40.750 --> 01:12:45.510 And and we have. 826 01:12:45.510 --> 01:12:50.630 Well Miko Curiman and myself will be in the panel Henry will be chairing it and 827 01:12:50.630 --> 01:12:57.550 and then we will have helper streak and Talia Isaacs as as international participants 828 01:12:57.550 --> 01:13:02.930 in in in the panel and we will be discussing the challenges and advantages of 829 01:13:02.930 --> 01:13:08.430 automated assessment of spoken skills in high stakes are contexts but and and 830 01:13:08.430 --> 01:13:12.530 we'll we'll we'll start that at the. Exactly. 831 01:13:12.530 --> 01:13:15.730 Through. Yep, 5:00 o'clock. 832 01:13:15.730 --> 01:13:17.570 Yeah, five, 5:00 o'clock. 833 01:13:17.570 --> 01:13:20.620 So in in about half an hour. 834 01:13:20.620 --> 01:13:24.780 And and but so we we we will keep the. 835 01:13:24.780 --> 01:13:30.150 That the we won't interrupt the connexion, but we'll invite 836 01:13:30.150 --> 01:13:34.740 the online participants to join us again. 837 01:13:34.740 --> 01:13:39.680 At at 5:00 o'clock and and we'll have here coffee break before that. 838 01:13:39.680 --> 01:13:50.050 OK. Thank you everybody. 839 01:13:50.050 --> 01:13:57.570 It's 5:00 PM in Finland and so welcome to the Digital Seminar. 840 01:13:57.570 --> 01:14:02.160 The last part of it, we are going to have panel discussion. 841 01:14:02.160 --> 01:14:08.000 On the topic of that, what are the challenges and advantages of automatic 842 01:14:08.000 --> 01:14:12.000 assessment of spoken language skills in high stakes contexts? 843 01:14:12.000 --> 01:14:17.750 And we have. Online, we have two. 844 01:14:17.750 --> 01:14:25.830 Discussions we have Talia Isaacs, Associate Professor of Applied Linguistics and teaching English 845 01:14:25.830 --> 01:14:32.440 to speakers of other languages from University College London, welcome. 846 01:14:32.440 --> 01:14:34.490 And we have helmets. 847 01:14:34.490 --> 01:14:40.090 Trick, Associate Professor, Centre for Language Studies and Department 848 01:14:40.090 --> 01:14:44.490 of Language and Communication, Radboud University Nijmegen. 849 01:14:44.490 --> 01:14:47.830 I'm sorry about my Dutch pronunciation, not sure if that 850 01:14:47.830 --> 01:14:51.630 went right, but I warmly welcome both of you. 851 01:14:51.630 --> 01:14:58.190 And here on site we have professor, Ari Hota from University of Avascular 852 01:14:58.190 --> 01:15:03.300 Professor in Language Assessment and Education and professor. 853 01:15:03.300 --> 01:15:08.610 Emma from Alto University, Professor in Speech and Language Processing. 854 01:15:08.610 --> 01:15:17.450 And well, first, I would like our guest speakers to shortly introduce 855 01:15:17.450 --> 01:15:25.510 yourselves or rather your perspectives on automatic assessment of spoken language skills. 856 01:15:25.510 --> 01:15:32.670 So what are your perspectives on that? 857 01:15:32.670 --> 01:15:36.720 Who should start Ladies First? Happy. 858 01:15:36.720 --> 01:15:38.520 I'm happy to go. First. 859 01:15:38.520 --> 01:15:41.980 I'm just to say I'm, I'm really pleased that there's discussion about this. 860 01:15:41.980 --> 01:15:48.400 And yeah, so I'll I'll bring some of my perspective just as form of summary 861 01:15:48.400 --> 01:15:52.780 because I think you were interested in talking about challenges and advantages 862 01:15:52.780 --> 01:15:58.220 of automatic automated assessment in speaking in particular. 863 01:15:58.220 --> 01:16:02.740 Of course we rely on technology for speech speeches transient. 864 01:16:02.740 --> 01:16:07.180 I mean we need technology even to play back and and record. 865 01:16:07.180 --> 01:16:13.910 Speech and to render it something that is quantifiable otherwise it it it's firmer role 866 01:16:13.910 --> 01:16:20.010 And so when we've long been using technology of course in in the context of speaking assessment 867 01:16:20.010 --> 01:16:26.270 I'm having now assessments that are fully automated that is machine delivered and machine 868 01:16:26.270 --> 01:16:29.510 scored for high stakes purposes is relatively new thing. 869 01:16:29.510 --> 01:16:34.810 I mean this is sort of 21st century that this has come into high stakes assessments 870 01:16:34.810 --> 01:16:37.910 for language proficiency and then used for purposes. 871 01:16:37.910 --> 01:16:41.440 That are quite consequential to test takers lives, decisions about getting into 872 01:16:41.440 --> 01:16:44.900 university or getting job they want and so on and so forth. 873 01:16:44.900 --> 01:16:52.360 So there are now in the assessment market as you know tests that are, you know, like 874 01:16:52.360 --> 01:16:58.860 IELTS that are administered by humans and then scored by human racers. 875 01:16:58.860 --> 01:17:01.820 And then on the other end of the continuum you have tests that 876 01:17:01.820 --> 01:17:04.420 are completely, you know, scored by algorithms. 877 01:17:04.420 --> 01:17:08.940 Of course they're they're trained and there is black box little bit. 878 01:17:08.940 --> 01:17:14.070 So we don't know exactly always what criteria are being used to score that assessment. 879 01:17:14.070 --> 01:17:16.380 I think also what we should be mindful. 880 01:17:16.380 --> 01:17:19.810 Of course there are lot of different options now. 881 01:17:19.810 --> 01:17:24.190 And you know, I mean I'll talk about advantages in second, but I think 882 01:17:24.190 --> 01:17:29.150 that in thinking about speaking in real world contexts, there will always 883 01:17:29.150 --> 01:17:32.840 be role or there should always be some role for human interlocutors as 884 01:17:32.840 --> 01:17:35.830 the arbiter of what they do and don't understand. 885 01:17:35.830 --> 01:17:37.910 And of course we might be communicating more and more 886 01:17:37.910 --> 01:17:40.080 with robots in the future, you know. 887 01:17:40.080 --> 01:17:43.450 Working dialogue systems and things are ever changing. 888 01:17:43.450 --> 01:17:46.730 But I think that there there should be place for 889 01:17:46.730 --> 01:17:49.650 all different variety of assessments. 890 01:17:49.650 --> 01:17:53.310 And of course, once you've developed that algorithm that can score the 891 01:17:53.310 --> 01:17:56.490 speech, the assessment can be much, much cheaper, right? 892 01:17:56.490 --> 01:17:58.740 If it's fully done by machine, right. You don't. 893 01:17:58.740 --> 01:18:00.540 It's logistically simpler. 894 01:18:00.540 --> 01:18:02.950 You don't have to have Raiders, certainly they don't have to. 895 01:18:02.950 --> 01:18:08.890 You don't have to have, you know, an examiner involved in also conducting 896 01:18:08.890 --> 01:18:11.390 the assessment one-on-one they can speak directly into. 897 01:18:11.390 --> 01:18:14.380 Machine and then have that scored. 898 01:18:14.380 --> 01:18:19.640 I mean another advantage of course is that it's what we call rater effects, right, 899 01:18:19.640 --> 01:18:25.640 That that individual examiners might be subject to some form of bias and that 900 01:18:25.640 --> 01:18:30.680 gets ironed out and kind of averaged out because I mean there there is bias when 901 01:18:30.680 --> 01:18:36.030 you train AI systems, you know these are. 902 01:18:36.030 --> 01:18:42.620 Trained on particular data set that of course comes from and has ingrained with it biases 903 01:18:42.620 --> 01:18:48.750 within given community or society, however individual rater idiosyncrasies. 904 01:18:48.750 --> 01:18:53.930 So individual differences in one examiner or two examiners if you have 905 01:18:53.930 --> 01:18:58.100 two people rating speech sample kind of get averaged out. 906 01:18:58.100 --> 01:19:05.030 If if it's over much larger sample of listeners or human evaluators, 907 01:19:05.030 --> 01:19:07.110 I mean, there's always the question of. 908 01:19:07.110 --> 01:19:09.140 What should be the appropriate standard? 909 01:19:09.140 --> 01:19:12.550 You know, is native speaker standard appropriate? 910 01:19:12.550 --> 01:19:18.500 And the conversations about standard language are very relevant also to 911 01:19:18.500 --> 01:19:23.880 machine scoring, you know, I mean, often the algorithms are trained on 912 01:19:23.880 --> 01:19:27.860 normed sample of native speakers, whatever that means. 913 01:19:27.860 --> 01:19:33.070 And and then and then of course deviances. 914 01:19:33.070 --> 01:19:39.190 Relative to the native speaker norm are what what count is kind of errors and so that. 915 01:19:39.190 --> 01:19:44.550 So I think that these things need to be made by transparent by testing bodies but 916 01:19:44.550 --> 01:19:47.520 there also needs to be some thought about the appropriateness of the standards 917 01:19:47.520 --> 01:19:51.410 and the kind of methods of of of training these algorithms. 918 01:19:51.410 --> 01:19:55.270 I mean you have the possibility and and Hellman's research you know 919 01:19:55.270 --> 01:19:59.990 shows this of rapid real time feedback provision. 920 01:19:59.990 --> 01:20:03.570 You know there's also of course the idea of. 921 01:20:03.570 --> 01:20:11.860 Of being able to access, you know, assessments you know very quickly, but also from remote 922 01:20:11.860 --> 01:20:16.750 destinations where you might not have human evaluators available in certain part of the 923 01:20:16.750 --> 01:20:20.260 world who can assess, for example, a particular target language. 924 01:20:20.260 --> 01:20:23.440 So you know there are there are all sorts of advantages, 925 01:20:23.440 --> 01:20:25.540 but there are lots of disadvantages. 926 01:20:25.540 --> 01:20:30.640 So I mean the reductionist approach, thinking about the construct and and matters of 927 01:20:30.640 --> 01:20:35.660 validity I think are really important here when you're only measuring. 928 01:20:35.660 --> 01:20:37.670 Things that the machine is good at. 929 01:20:37.670 --> 01:20:43.860 For example, you want to look at speaker, somebody speaking ability, but in effect the 930 01:20:43.860 --> 01:20:49.410 machine can only handle pronunciation features, specifically, features to do with individual 931 01:20:49.410 --> 01:20:55.090 vowel and consonant sounds, and some fluency features and very limited, maybe vocabulary 932 01:20:55.090 --> 01:21:01.030 that is much narrower construct than one where you might have interaction with an interlocutor 933 01:21:01.030 --> 01:21:04.710 or different sort of interactional patterns. 934 01:21:04.710 --> 01:21:07.400 Appropriateness is something that is very difficult. 935 01:21:07.400 --> 01:21:09.630 Or pragmatic competence. 936 01:21:09.630 --> 01:21:14.710 There are other aspects as well that can be very, very difficult for machines 937 01:21:14.710 --> 01:21:19.170 to be able to score well and even things like intonation. 938 01:21:19.170 --> 01:21:23.660 And so you end up having, you know, a very narrow speaking construct 939 01:21:23.660 --> 01:21:26.710 and you see that with some of the high stakes tests. 940 01:21:26.710 --> 01:21:30.710 It's kind of like we're going back to audio lingualism and grammar translation 941 01:21:30.710 --> 01:21:34.390 in terms of the way that things are being tested. 942 01:21:34.390 --> 01:21:38.600 You know, it sort of lends itself to wash back effects of rote learning. 943 01:21:38.600 --> 01:21:43.010 And decontextualized sort of discrete point items. 944 01:21:43.010 --> 01:21:47.050 And and this is problem I of course technology is evolving 945 01:21:47.050 --> 01:21:49.270 and there are improvements being made. 946 01:21:49.270 --> 01:21:51.210 So this is moving field. 947 01:21:51.210 --> 01:21:56.810 And but still I think that certainly you've got much narrower speaking construct 948 01:21:56.810 --> 01:22:01.110 than you would have with other speaking assessment types. 949 01:22:01.110 --> 01:22:06.550 Of course, there are other matters as well, but I think that's the major one that 950 01:22:06.550 --> 01:22:09.450 I would want to point out and I've probably talked over time now. 951 01:22:09.450 --> 01:22:12.370 So I'm here curious to hear Homer's perspective. 952 01:22:12.370 --> 01:22:15.760 Thank you. OK. 953 01:22:15.760 --> 01:22:17.740 Thank you very much. 954 01:22:17.740 --> 01:22:20.280 Well, first of all, thanks for the presentations this afternoon. 955 01:22:20.280 --> 01:22:24.320 I think it's an interesting project and I see many interesting results 956 01:22:24.320 --> 01:22:30.220 already and hopefully it will be used in the future. 957 01:22:30.220 --> 01:22:32.820 OK so well what? 958 01:22:32.820 --> 01:22:38.540 As you know we have 4 skills to perceptive and two productive and it's easier. 959 01:22:38.540 --> 01:22:46.100 Well very generally speaking training and testing productive skills is more complex than perceptive 960 01:22:46.100 --> 01:22:52.080 skills we notice and speech is more speaking is more complex than writing. 961 01:22:52.080 --> 01:22:56.310 And I also teach golf course, and each year I ask my students. 962 01:22:56.310 --> 01:23:02.220 To look for call systems on speaking good quality call systems. 963 01:23:02.220 --> 01:23:07.920 And they're always disappointment disappointed because there are not many good 964 01:23:07.920 --> 01:23:12.770 call systems out there, good quality call systems for training. 965 01:23:12.770 --> 01:23:18.230 And of course there are less systems for testing, them for training 966 01:23:18.230 --> 01:23:20.410 and they are more difficult to evaluate. 967 01:23:20.410 --> 01:23:25.830 But the ones that I've evaluated myself, they are also quite limited. 968 01:23:25.830 --> 01:23:27.920 And well, Tanya already mentioned this. 969 01:23:27.920 --> 01:23:32.350 The problem, of course is what is correct and what is incorrect. 970 01:23:32.350 --> 01:23:37.750 Something like standard language doesn't exist in most countries. 971 01:23:37.750 --> 01:23:41.730 And there's lot of variation within the, let's say, the native language. 972 01:23:41.730 --> 01:23:45.710 So it's very difficult to define what is correct and incorrect to find borders. 973 01:23:45.710 --> 01:23:52.610 If you develop software, you need thresholds or whether you put the thresholds. 974 01:23:52.610 --> 01:23:56.270 And if we look at spontaneous native speech. 975 01:23:56.270 --> 01:24:00.870 Well, spontaneous native speech is very strange. It often isn't grammatical. 976 01:24:00.870 --> 01:24:02.360 It's full of disfluencies. 977 01:24:02.360 --> 01:24:04.790 It's full of restarts, repairs. 978 01:24:04.790 --> 01:24:07.790 So is everything natives do. Is that correct? 979 01:24:07.790 --> 01:24:12.060 Is that the startup? Or that's quite problematic then? 980 01:24:12.060 --> 01:24:16.960 And then of course, spoken language has so many different aspects today. 981 01:24:16.960 --> 01:24:20.840 Most of the presentations were about prosody and fluency I heard. 982 01:24:20.840 --> 01:24:25.220 Of course you have segmental phonemes, but you also have things like intelligibility, 983 01:24:25.220 --> 01:24:29.960 comprehensibility, naturalness, for instance speech synthesis. 984 01:24:29.960 --> 01:24:31.760 This is big issue. 985 01:24:31.760 --> 01:24:35.010 Speech synthesis can be correct but not natural. 986 01:24:35.010 --> 01:24:38.730 And that we have lexicon and we have formulaic language. 987 01:24:38.730 --> 01:24:43.960 A lot of our native language consists of idioms, multi word expressions and all these 988 01:24:43.960 --> 01:24:49.250 kinds of formulaic language which is very problematic for non natives. 989 01:24:49.250 --> 01:24:52.190 And that's only the speaking part. 990 01:24:52.190 --> 01:24:56.170 But of course speaking should be evaluated. It's communication. 991 01:24:56.170 --> 01:25:03.780 It's about communication and then it has to be unprepared, spontaneous, extemporaneous. 992 01:25:03.780 --> 01:25:08.400 I've seen many international students in my classes who have extremely high 993 01:25:08.400 --> 01:25:12.980 TOEFL scores but have problem communicating in English. 994 01:25:12.980 --> 01:25:18.300 That could also have some other reasons but high score on the. 995 01:25:18.300 --> 01:25:24.810 Duffel France Dolphin Square doesn't mean that you can communicate well. 996 01:25:24.810 --> 01:25:27.670 Something that has already been mentioned the presentations today and 997 01:25:27.670 --> 01:25:30.690 also by Talia, is that we have artificial intelligence. 998 01:25:30.690 --> 01:25:34.410 Performance goes up, but they are black boxes and if you do high 999 01:25:34.410 --> 01:25:38.390 stakes test you should be able to explain why somebody passes, but 1000 01:25:38.390 --> 01:25:40.810 especially why somebody feels test of course. 1001 01:25:40.810 --> 01:25:45.670 So then you should. Should it be completely automatic? 1002 01:25:45.670 --> 01:25:48.350 I I wonder if that's already possible. 1003 01:25:48.350 --> 01:25:53.590 I think that we can use automatic scores to help humans 1004 01:25:53.590 --> 01:26:02.510 do tests, but complete automatic test? I have doubts about it. 1005 01:26:02.510 --> 01:26:06.230 So that's roughly I think. 1006 01:26:06.230 --> 01:26:12.050 Current technology is improving and yes we can have technology make report with 1007 01:26:12.050 --> 01:26:16.910 lot of interesting data in there, prosodic fluency related data, pronunciation 1008 01:26:16.910 --> 01:26:22.490 related data and and then I think preferably a human or one or two human suit use 1009 01:26:22.490 --> 01:26:27.030 these data and then come up with the final results. 1010 01:26:27.030 --> 01:26:36.410 The test. That's what I my perspective on this. 1011 01:26:36.410 --> 01:26:41.920 Thank you very much Professors Isaacs and Strick, that was. 1012 01:26:41.920 --> 01:26:43.720 Yeah. 1013 01:26:43.720 --> 01:26:51.850 Do you and Mikayla would also like to have short turn here? 1014 01:26:51.850 --> 01:26:54.330 OK. Can you now hear me? 1015 01:26:54.330 --> 01:26:56.150 Yes, very well. Great. 1016 01:26:56.150 --> 01:27:00.000 Great. So we just changed the tables. 1017 01:27:00.000 --> 01:27:06.160 So yeah, my own background is such that the the my involvement in the 1018 01:27:06.160 --> 01:27:11.520 digital project is the first time I'm involved in automated speech recognition 1019 01:27:11.520 --> 01:27:15.020 and and automated assessment of speaking. 1020 01:27:15.020 --> 01:27:21.290 I I worked earlier on in my career in in deciding rating scales for human 1021 01:27:21.290 --> 01:27:27.200 raters and then I've I've later on I've done some work with automated assessment 1022 01:27:27.200 --> 01:27:31.120 of or actually analysis of of learner writing. 1023 01:27:31.120 --> 01:27:38.040 So. What I would like to start with is that. 1024 01:27:38.040 --> 01:27:41.500 My own background is in more in in the kind of formative type 1025 01:27:41.500 --> 01:27:45.530 of assessment, so there I can see lot of them. 1026 01:27:45.530 --> 01:27:51.950 Applications for for automated analysis of of speaking and and and writing and 1027 01:27:51.950 --> 01:27:57.550 and and but when it comes to the high stakes assessment it's it's much more complex 1028 01:27:57.550 --> 01:28:04.650 and and and difficult and and one of I think the. 1029 01:28:04.650 --> 01:28:09.870 Decision to actually use automated assessment of speaking 1030 01:28:09.870 --> 01:28:13.490 or writing in high stake context. 1031 01:28:13.490 --> 01:28:19.230 Assuming that it's it's not resource issue, which is definitely is is also 1032 01:28:19.230 --> 01:28:23.610 question of of the context context and and and perhaps the kind of impact we want 1033 01:28:23.610 --> 01:28:29.260 to have in in one context here in in our country where we have final school 1034 01:28:29.260 --> 01:28:36.150 Leaving Examination which has with where the language exams have always lacked 1035 01:28:36.150 --> 01:28:41.240 the speaking component, I would think that. 1036 01:28:41.240 --> 01:28:48.880 Even partially functioning automated speaking assessment system that would. 1037 01:28:48.880 --> 01:28:55.780 Not work independently, but but in in in support of of of human raters 1038 01:28:55.780 --> 01:29:03.080 would be potentially useful if the if that means that by using that system 1039 01:29:03.080 --> 01:29:07.900 we would be able to introduce or speaking component in into this this final 1040 01:29:07.900 --> 01:29:10.880 examination which is so far is lacking so. 1041 01:29:10.880 --> 01:29:17.900 So it may be that even with the restricted construct that 1042 01:29:17.900 --> 01:29:19.840 that can be assessed it it may be that. 1043 01:29:19.840 --> 01:29:26.090 Actually, that would still be used useful in in the in that kind of context. 1044 01:29:26.090 --> 01:29:29.870 Then I'm thinking of another context where it's it's much more difficult to 1045 01:29:29.870 --> 01:29:35.140 say whether it actually makes sense to even even introduce. 1046 01:29:35.140 --> 01:29:39.640 Automated assessment and that is another examination that we have which 1047 01:29:39.640 --> 01:29:45.680 is National Certificates of Professional which which is examination 1048 01:29:45.680 --> 01:29:50.320 for adult learners and it covers 8 or 9 languages. 1049 01:29:50.320 --> 01:29:53.820 So the question is what would be the? 1050 01:29:53.820 --> 01:30:00.930 Impact and meaningfulness of introducing automated investment into one or two of the languages 1051 01:30:00.930 --> 01:30:06.550 there, but not do that for for the for the others because I I think that would. 1052 01:30:06.550 --> 01:30:11.150 That would create an interesting interesting situation in terms of 1053 01:30:11.150 --> 01:30:16.790 of of of imbalance of in of of various various sorts. 1054 01:30:16.790 --> 01:30:20.210 So I think of I'm I'm. 1055 01:30:20.210 --> 01:30:26.440 I think that the the context and the intended impact might be an interest important 1056 01:30:26.440 --> 01:30:33.730 factor in designing when and and whether to actually go for automated assessment 1057 01:30:33.730 --> 01:30:37.540 and then another aspect to this construct. 1058 01:30:37.540 --> 01:30:44.290 But the issue is that other tasks, I mean, I mean what what is the, what is the range 1059 01:30:44.290 --> 01:30:49.540 of tasks that that should be used or could be used with automated assessment and and 1060 01:30:49.540 --> 01:30:55.680 what are those, what are the implications of of that to what can be assessed automatically 1061 01:30:55.680 --> 01:30:58.480 versus what what perhaps human right is assessed. 1062 01:30:58.480 --> 01:31:06.470 So that's perhaps my two cents about this habit at this moment. 1063 01:31:06.470 --> 01:31:10.490 Yeah, maybe I can say as well as all my name is Mika Coleman, professor 1064 01:31:10.490 --> 01:31:13.850 in Speech and Language Processing, although University. 1065 01:31:13.850 --> 01:31:19.010 So my background is in automatic speech recognition and speech processing, language 1066 01:31:19.010 --> 01:31:24.050 modelling, but not so much about in in assessing speaking skills. 1067 01:31:24.050 --> 01:31:27.190 So what I've learned a lot during this project. 1068 01:31:27.190 --> 01:31:32.080 And based on that, I think the biggest challenge. 1069 01:31:32.080 --> 01:31:35.590 Is maybe not in the engineering side. 1070 01:31:35.590 --> 01:31:40.070 So even though the our system works fairly well now, but it's not that robust. 1071 01:31:40.070 --> 01:31:42.820 So there is room for improvement. 1072 01:31:42.820 --> 01:31:46.860 But still I can see that we are advanced quite lot. 1073 01:31:46.860 --> 01:31:50.080 And if we continue like this, we get more data. 1074 01:31:50.080 --> 01:31:54.520 Maybe we can sort this bias issues and also some robustness 1075 01:31:54.520 --> 01:31:58.950 issues with just getting more more data, but. 1076 01:31:58.950 --> 01:32:04.990 Than others kind of human factors, which I consider the most challenging thing. 1077 01:32:04.990 --> 01:32:10.850 So even after all this training that we give to our raters, they still disagree And 1078 01:32:10.850 --> 01:32:15.390 then our poor machine tries to learn that why this is sometimes good and sometimes 1079 01:32:15.390 --> 01:32:21.520 bad and and then we just take some authorities weighted averages or or unweighted 1080 01:32:21.520 --> 01:32:25.850 authorities and try to learn those authorities which is? 1081 01:32:25.850 --> 01:32:29.920 Which certainly gives some noise noise to the system that. 1082 01:32:29.920 --> 01:32:32.110 Is it so? 1083 01:32:32.110 --> 01:32:35.050 Kind of how can we train the reviewers better? 1084 01:32:35.050 --> 01:32:38.390 Or how do we how can we get more reviewers? 1085 01:32:38.390 --> 01:32:44.090 Maybe that helps if we can model it as a distribution instead of individuals, but. 1086 01:32:44.090 --> 01:32:58.160 That's kind of. Challenge that I'm viewing viewing. 1087 01:32:58.160 --> 01:33:02.900 I'll come to this side. Yes, so. 1088 01:33:02.900 --> 01:33:09.460 I think bias is one of the keywords that everyone of you used at least once in your terms 1089 01:33:09.460 --> 01:33:16.120 and I would like to ask about this possible Raider bias and this human factor. 1090 01:33:16.120 --> 01:33:21.180 What do you think that automatic methods and and artificial 1091 01:33:21.180 --> 01:33:27.540 intelligence could give to to improving the? 1092 01:33:27.540 --> 01:33:30.520 Assessment, reliability within humans in general. 1093 01:33:30.520 --> 01:33:34.870 Do you think that, How do you think think that we could kind of? 1094 01:33:34.870 --> 01:33:48.520 Trained Raiders or human assessors with the help of artificial intelligence. 1095 01:33:48.520 --> 01:33:52.840 Question or maybe however, if you want to start, yes. 1096 01:33:52.840 --> 01:33:55.040 So yeah, I wonder if this is the right question. 1097 01:33:55.040 --> 01:33:59.580 So when this presentation, I think this was the last presentation, yeah, human to human 1098 01:33:59.580 --> 01:34:04.460 agreements were low and the machine to average humans were higher. 1099 01:34:04.460 --> 01:34:08.660 So the question is should we train the reviewers or should 1100 01:34:08.660 --> 01:34:16.470 we use scores, ratings on which the humans? Do you agree? 1101 01:34:16.470 --> 01:34:18.270 There are some. 1102 01:34:18.270 --> 01:34:22.130 So for instance, recently we have been looking at in the Netherlands and 1103 01:34:22.130 --> 01:34:27.310 in many countries affect if children learn to read test. 1104 01:34:27.310 --> 01:34:32.690 They often do. Is the number of words correct per minute? 1105 01:34:32.690 --> 01:34:38.430 And we asked many teachers to do this and we don't know what they do. 1106 01:34:38.430 --> 01:34:42.410 We have no clue at all, but their agreement between within 1107 01:34:42.410 --> 01:34:46.360 them and between them is extremely high. 1108 01:34:46.360 --> 01:34:48.560 Of course this is simple message, but I think we should 1109 01:34:48.560 --> 01:34:53.280 first try to look for measure scores ratings. 1110 01:34:53.280 --> 01:34:55.400 That humans do agree on. 1111 01:34:55.400 --> 01:35:01.860 Because indeed what what we always do is if we train machine classifiers, artificial intelligence, 1112 01:35:01.860 --> 01:35:06.320 we use as benchmark as reference human scores we have, we don't have better. 1113 01:35:06.320 --> 01:35:09.480 Sometimes I wonder if the humans are better than the machines, 1114 01:35:09.480 --> 01:35:11.340 but this is the only thing we can do. 1115 01:35:11.340 --> 01:35:13.900 We can use humans as the as the reference. 1116 01:35:13.900 --> 01:35:23.410 Then I think it's the best thing to start with scores, ratings that humans do agree on. 1117 01:35:23.410 --> 01:35:26.030 Yes, that's definitely, I was not, I was just going to say I was 1118 01:35:26.030 --> 01:35:29.010 noting the light at Helmer's response I think. 1119 01:35:29.010 --> 01:35:33.810 But having benchmark samples and choosing those appropriately are are important also. 1120 01:35:33.810 --> 01:35:38.230 I mean I think there are there validity questions as well of course about 1121 01:35:38.230 --> 01:35:41.850 what I mean are the Raiders being trained in the first instance about 1122 01:35:41.850 --> 01:35:44.500 what it is that they should be attending to? 1123 01:35:44.500 --> 01:35:49.360 You know, is there some kind of, I mean, yeah, I mean what what are the training, 1124 01:35:49.360 --> 01:35:52.730 are they given benchmark samples to listen to for example. 1125 01:35:52.730 --> 01:35:56.540 And then do they, do they have some kind of training, I mean I missed those presentations, 1126 01:35:56.540 --> 01:35:59.340 but those are the sorts of questions I would be asking. 1127 01:35:59.340 --> 01:36:04.960 I mean you would expect to have some kind of norming sessions and consensus building on you 1128 01:36:04.960 --> 01:36:09.400 know what the different criteria are that should be evaluated and and the aspects of the performance 1129 01:36:09.400 --> 01:36:12.800 that are relevant to the construct and those that are extraneous. 1130 01:36:12.800 --> 01:36:18.000 But just to bring in that I agree with Helmer's point as well though from earlier about 1131 01:36:18.000 --> 01:36:23.670 the importance of hybrids, you know machine and human scoring, I think that. 1132 01:36:23.670 --> 01:36:27.430 There should be place for both. 1133 01:36:27.430 --> 01:36:33.920 You know, I, I, I am also quite worried about fully automated assessments for speaking. 1134 01:36:33.920 --> 01:36:40.180 And Larry Davis and Spurs Papa George you at ETS have written an article on how 1135 01:36:40.180 --> 01:36:44.380 you know the approach is for TOEFL I BT to to look at these things. 1136 01:36:44.380 --> 01:36:47.590 So yeah, I think that that's just point I wanted to stick in. 1137 01:36:47.590 --> 01:36:50.890 Thank you. 1138 01:36:50.890 --> 01:36:54.590 I was just thinking that because I'm, I'm sort of so very much interested in in 1139 01:36:54.590 --> 01:36:59.810 sort of collaboration between machine and human human writer and and actually 1140 01:36:59.810 --> 01:37:07.360 if we for the moment forget the idea of of of actually the. 1141 01:37:07.360 --> 01:37:12.010 The machine doing kind of an independent scoring in in way that it's somehow 1142 01:37:12.010 --> 01:37:17.340 automatically taken into account or or replacing human ratings. 1143 01:37:17.340 --> 01:37:26.010 But can the automated analysis tool be useful rater training device? 1144 01:37:26.010 --> 01:37:29.560 I'm I'm just thinking of of that. 1145 01:37:29.560 --> 01:37:38.640 Because the for example, would it be useful for for for human right to get 1146 01:37:38.640 --> 01:37:40.500 machine? 1147 01:37:40.500 --> 01:37:46.630 Estimation of let's say lexical richness of of of of piece of writing or or or spoken 1148 01:37:46.630 --> 01:37:54.900 performance or or indices of syntactic complexity or or some some things that are quite 1149 01:37:54.900 --> 01:38:00.610 difficult to for human raters perhaps to to systematically evaluate. 1150 01:38:00.610 --> 01:38:03.130 But but the machine can can, can do. 1151 01:38:03.130 --> 01:38:06.870 I mean certain things that the machine can can do fairly 1152 01:38:06.870 --> 01:38:10.730 sort of reliably and and feed that as as. 1153 01:38:10.730 --> 01:38:16.420 Kind of an information to the human rater as a kind of reminder OK that this performance 1154 01:38:16.420 --> 01:38:21.000 has these characteristics and then it's it's up to the human rater and of course depending 1155 01:38:21.000 --> 01:38:26.250 on the on the system how they are trained to to consider that in particular way as 1156 01:38:26.250 --> 01:38:30.210 as supporting the human right versus final decision. 1157 01:38:30.210 --> 01:38:36.230 So I'm I'm, I'm, I'm thinking that that might be useful way of of introducing 1158 01:38:36.230 --> 01:38:43.900 automated analysis in into a into high stakes scoring. 1159 01:38:43.900 --> 01:38:47.700 Not not necessarily meaning meaning that the. 1160 01:38:47.700 --> 01:38:53.680 That the that actually the the. Automated. 1161 01:38:53.680 --> 01:38:59.340 Analysis is is on kind of an automatic part of of of certain decision making 1162 01:38:59.340 --> 01:39:03.680 but but it's it's some something that the human right should consider because 1163 01:39:03.680 --> 01:39:08.820 that's that's kind of an objective view to the performance and then of course 1164 01:39:08.820 --> 01:39:12.650 it's up to the human right to make well of course. 1165 01:39:12.650 --> 01:39:20.660 Unavoidably subjective final decision about how how to use that information. 1166 01:39:20.660 --> 01:39:25.760 So are you meaning like something that if you take some nuclear power 1167 01:39:25.760 --> 01:39:30.280 station, there is controller who has big screen and there are lots 1168 01:39:30.280 --> 01:39:33.220 of kind of devices that measure different things. 1169 01:39:33.220 --> 01:39:36.600 And then he when he sees that something is going wrong, then 1170 01:39:36.600 --> 01:39:38.800 he sees that, OK, that's that's something. 1171 01:39:38.800 --> 01:39:43.280 Or maybe he makes his decision based on several of these metres, what they are looking. 1172 01:39:43.280 --> 01:39:45.800 Yeah, same time. That's an interesting analogy. 1173 01:39:45.800 --> 01:39:50.420 So maybe the kind of skill is there still in the human, but yeah, it's just use. 1174 01:39:50.420 --> 01:39:52.270 Is this different devices? 1175 01:39:52.270 --> 01:39:55.170 Yeah, but maybe it kind of gets even too too difficult. 1176 01:39:55.170 --> 01:40:00.330 Still if there are lots of devices that yeah, so it must be used or some some key aspects 1177 01:40:00.330 --> 01:40:09.770 of the performance that should or could be considered by the human rights. 1178 01:40:09.770 --> 01:40:13.070 No, I I think I agree with that. 1179 01:40:13.070 --> 01:40:17.380 So for instance the the machine, a lot of the measures that were mentioned 1180 01:40:17.380 --> 01:40:20.660 in the presentations is afternoon, like rate of speech, articulation, 1181 01:40:20.660 --> 01:40:23.720 rate filled pauses, silent pauses. 1182 01:40:23.720 --> 01:40:28.450 This is something that the machine can measure easily and if you have 1183 01:40:28.450 --> 01:40:31.640 report with these numbers and you have the, let's say the average 1184 01:40:31.640 --> 01:40:34.590 and the standard deviation within the population. 1185 01:40:34.590 --> 01:40:36.630 This can help you evaluate. 1186 01:40:36.630 --> 01:40:39.070 It's not the only thing, but it can help you. 1187 01:40:39.070 --> 01:40:44.450 It's bit, of course, like we have also found this in our research, that 1188 01:40:44.450 --> 01:40:48.490 rate of speech is very good predictor of proficiency level. 1189 01:40:48.490 --> 01:40:51.210 It's bit strange, of course, Yeah, well, it's logical because 1190 01:40:51.210 --> 01:40:53.810 you start speaking faster if you're more proficient. 1191 01:40:53.810 --> 01:40:58.150 But it's strange to use that as the only measure or something like that, but if if 1192 01:40:58.150 --> 01:41:04.860 it's part of set of measures that human can use to evaluate. 1193 01:41:04.860 --> 01:41:08.490 Of the. The speech of learner. 1194 01:41:08.490 --> 01:41:13.210 I think that can be useful. 1195 01:41:13.210 --> 01:41:17.850 Monologic tasks, though obviously if you have task that's dialogic with an 1196 01:41:17.850 --> 01:41:28.130 interlocutor then that becomes something that you can't use. 1197 01:41:28.130 --> 01:41:34.420 Do we have any questions from the audience or from our online viewers? 1198 01:41:34.420 --> 01:41:41.330 Someone is typing. Anybody from the present audience here? 1199 01:41:41.330 --> 01:41:47.480 Do you have any? In your minds. 1200 01:41:47.480 --> 01:41:51.570 Nothing yet. Well, maybe I'll I will. 1201 01:41:51.570 --> 01:41:58.600 Maybe I will ask the probably of one of the obvious questions, but. 1202 01:41:58.600 --> 01:42:03.630 What do you think are the prospects of automatic speech assessments? 1203 01:42:03.630 --> 01:42:08.870 So where do you see? Automatic. 1204 01:42:08.870 --> 01:42:15.970 Second or foreign language assessment is in well globally, but also 1205 01:42:15.970 --> 01:42:19.730 in in in areas with with smaller language resources. 1206 01:42:19.730 --> 01:42:24.910 Where would it be in, let's say 10 years? The technologies. 1207 01:42:24.910 --> 01:42:26.930 Developing fast but. 1208 01:42:26.930 --> 01:42:33.090 Still we are in way struggling with similar issues and then few years back. 1209 01:42:33.090 --> 01:42:35.690 So where do you see, what do you see? 1210 01:42:35.690 --> 01:42:42.600 The future of automatic assessment would be near future. 1211 01:42:42.600 --> 01:42:46.420 Let's talk. Zoo. 1212 01:42:46.420 --> 01:42:53.810 Discusses. Did you hear my question OK? 1213 01:42:53.810 --> 01:42:55.610 Yeah, yeah, I heard your question. 1214 01:42:55.610 --> 01:43:00.490 So I I think it will gradually will be used more and more, but I 1215 01:43:00.490 --> 01:43:04.990 wonder if it always will be used in in sound way. 1216 01:43:04.990 --> 01:43:08.130 Already I think. 1217 01:43:08.130 --> 01:43:11.390 15 years ago test was introduced in the Netherlands 1218 01:43:11.390 --> 01:43:14.230 for foreigners coming to the Netherlands. 1219 01:43:14.230 --> 01:43:17.030 And it was completely automatic test. 1220 01:43:17.030 --> 01:43:23.780 And it has been used for couple of years after after which so many people complained. 1221 01:43:23.780 --> 01:43:28.350 And Sir, that it was they had to stop it in fact. 1222 01:43:28.350 --> 01:43:33.910 And I I started to test and I I think what the the the problems 1223 01:43:33.910 --> 01:43:39.940 were so people because there are not enough human. 1224 01:43:39.940 --> 01:43:48.010 Assessors, raters will use technology more and more, but I think we should. 1225 01:43:48.010 --> 01:43:52.040 Experts like us should help them in making the right decisions 1226 01:43:52.040 --> 01:43:55.660 when to use it and how to use it otherwise. 1227 01:43:55.660 --> 01:43:59.950 We have seen this also in other areas that they use improved 1228 01:43:59.950 --> 01:44:10.110 technology in the wrong way, so. That's what I think. 1229 01:44:10.110 --> 01:44:12.030 Talks about test misuse. 1230 01:44:12.030 --> 01:44:16.530 I think that's important and I think that you know the consequences you 1231 01:44:16.530 --> 01:44:20.870 know of using technology, I mean are important to think about it as they 1232 01:44:20.870 --> 01:44:25.190 are for any kind of high stakes assessment. I mean I'm thinking about context. 1233 01:44:25.190 --> 01:44:27.320 I'm not very good at predicting the future. 1234 01:44:27.320 --> 01:44:30.150 I mean we can all think about avatars dialogue systems. 1235 01:44:30.150 --> 01:44:33.750 I know that at present there is a researcher at my university who's 1236 01:44:33.750 --> 01:44:37.650 been looking at truck drivers voice quality and using that as way 1237 01:44:37.650 --> 01:44:39.600 to predict whether they're too fatigued and. 1238 01:44:39.600 --> 01:44:42.240 To pull off the road, I mean there are all sorts of 1239 01:44:42.240 --> 01:44:44.160 really neat things already going on. 1240 01:44:44.160 --> 01:44:48.180 And I think that the uses of fully automated assessments, 1241 01:44:48.180 --> 01:44:50.400 you know, they are being critiqued. 1242 01:44:50.400 --> 01:44:52.460 I think that they're here to stay and I think as Helmer 1243 01:44:52.460 --> 01:44:55.460 says, they're useful, continue inevitably. 1244 01:44:55.460 --> 01:44:57.600 And you know, there are some context. 1245 01:44:57.600 --> 01:45:01.760 I think Ari alluded to some context where there may be possibilities 1246 01:45:01.760 --> 01:45:06.640 of no speaking assessment at all because you know, they're just the 1247 01:45:06.640 --> 01:45:09.920 teachers don't have the, you know, the language. 1248 01:45:09.920 --> 01:45:14.180 Ability in the target language to be able to offer any kind of, you know, speech 1249 01:45:14.180 --> 01:45:18.790 in the classroom, so you know role for formative assessment and the use of technology 1250 01:45:18.790 --> 01:45:20.690 to at least have some speaking in the classroom. 1251 01:45:20.690 --> 01:45:26.790 Although not ideal, but I'm thinking about high 6 context in the UK during 1252 01:45:26.790 --> 01:45:32.970 COVID where we've got an education system in the UK that is completely, you 1253 01:45:32.970 --> 01:45:37.170 know, that is completely assessed by standardised tests and when test centres 1254 01:45:37.170 --> 01:45:40.250 closed because of COVID, they introduced. 1255 01:45:40.250 --> 01:45:44.940 Algorithms to try to take into account the school that the person was from 1256 01:45:44.940 --> 01:45:50.080 and and and and use that to adjust performance indices. 1257 01:45:50.080 --> 01:45:54.920 You know, because there's wide sort of disparity in terms of advantage and disadvantaged 1258 01:45:54.920 --> 01:45:59.900 schools and that become completely politically unacceptable. 1259 01:45:59.900 --> 01:46:04.060 It just ended up exacerbating societal inequalities. 1260 01:46:04.060 --> 01:46:08.500 And Boris Johnson, the then UK Prime Minister, accused these 1261 01:46:08.500 --> 01:46:11.030 mutant algorithms of being the problem. 1262 01:46:11.030 --> 01:46:14.480 In other words, the, you know, the government of the day was passing the buck. 1263 01:46:14.480 --> 01:46:16.760 It's not our fault. 1264 01:46:16.760 --> 01:46:19.760 This is the mutant algorithms, these awful machines gone awry 1265 01:46:19.760 --> 01:46:22.120 that are, you know, creating this problem. 1266 01:46:22.120 --> 01:46:26.620 And and that's really problem because the accountability lies with the humans. 1267 01:46:26.620 --> 01:46:29.550 It should be the humans that are taking responsibility for this 1268 01:46:29.550 --> 01:46:31.840 and not levelling the blame on the machines. 1269 01:46:31.840 --> 01:46:33.660 I think that's ludicrous. 1270 01:46:33.660 --> 01:46:35.740 So that was just one case. 1271 01:46:35.740 --> 01:46:39.180 I think that there are questions about ethics with this as well. 1272 01:46:39.180 --> 01:46:43.750 Thank you. 1273 01:46:43.750 --> 01:46:47.510 What what comes to my mind about the future of automated assessment 1274 01:46:47.510 --> 01:46:51.270 in in high stakes context is, is that. 1275 01:46:51.270 --> 01:46:54.430 It's likely to continue to depend on the availability 1276 01:46:54.430 --> 01:47:01.280 of of resources and and so there are. It's question of. 1277 01:47:01.280 --> 01:47:05.680 Of the size of the language and and the size of the economy of of of the 1278 01:47:05.680 --> 01:47:08.740 country who who develops those or or can't can't develop. 1279 01:47:08.740 --> 01:47:13.900 So that's sort of unavoidable but kind of general trend. 1280 01:47:13.900 --> 01:47:19.060 So I, I, I I agree with that that we will be seeing more of of automated 1281 01:47:19.060 --> 01:47:22.640 assessment across different purposes of assessment. 1282 01:47:22.640 --> 01:47:25.980 But as I said I, I, I particularly interested in kind 1283 01:47:25.980 --> 01:47:28.460 of assessments that support learning. 1284 01:47:28.460 --> 01:47:31.520 So what I'd like to see in. 1285 01:47:31.520 --> 01:47:39.010 More of the high stakes examinations and tests is that they would actually produce 1286 01:47:39.010 --> 01:47:42.910 more than just the, let's say one or couple of overall scores. 1287 01:47:42.910 --> 01:47:49.150 But they would also provide more more feedback on on various aspects of of learner 1288 01:47:49.150 --> 01:47:55.210 learner performance To the the test takers examine is even if that's not their main 1289 01:47:55.210 --> 01:48:00.920 function, but the computerization makes that obviously. 1290 01:48:00.920 --> 01:48:07.240 More possible than than in the in the paper age and and particularly 1291 01:48:07.240 --> 01:48:12.520 if, if and when the automated scoring algorithms develop. 1292 01:48:12.520 --> 01:48:18.540 And because unavoidably they they consider a vast number of different aspects of performance 1293 01:48:18.540 --> 01:48:22.680 to come up with, let's say perhaps only one one overall score. 1294 01:48:22.680 --> 01:48:25.090 So, so actually they. 1295 01:48:25.090 --> 01:48:30.050 A lot of that details sort of is remains unused and then of course 1296 01:48:30.050 --> 01:48:34.690 it's it's also then of course a better ecological challenge to think 1297 01:48:34.690 --> 01:48:39.800 about what we which part of of of the number of. 1298 01:48:39.800 --> 01:48:43.940 Features that the systems analyse actually can be turned into 1299 01:48:43.940 --> 01:48:47.660 meaningful feedback how to human and and so on. 1300 01:48:47.660 --> 01:48:52.040 But I I think that's the kind of a potential that I see for automated assessment 1301 01:48:52.040 --> 01:48:57.420 in in high stakes context that that's not not done yet. 1302 01:48:57.420 --> 01:49:01.780 Yeah, I think that makes very much sense what what Ari said so. 1303 01:49:01.780 --> 01:49:06.420 The feedback be something that we should improve and maybe getting when we get more data 1304 01:49:06.420 --> 01:49:11.290 we can do it And that is something where where the automatic system can help both the 1305 01:49:11.290 --> 01:49:17.200 teachers and and and test takers so they can give feedback to both And of of course obviously 1306 01:49:17.200 --> 01:49:22.260 for self training kind of simulating the high tech systems and and then training and 1307 01:49:22.260 --> 01:49:27.410 preparing for those but also in the actual high stakes exams to kind of make it more 1308 01:49:27.410 --> 01:49:33.270 transparent and give some feedback. 1309 01:49:33.270 --> 01:49:37.030 I mean we're we're ohh, sorry. Go ahead help. No, go ahead Talia. Go ahead. 1310 01:49:37.030 --> 01:49:37.310 Ohh. 1311 01:49:37.310 --> 01:49:39.520 I I was gonna go on slightly different tangent so if 1312 01:49:39.520 --> 01:49:42.110 yours is relevant to what they were saying. OK, yeah. 1313 01:49:42.110 --> 01:49:44.450 So what what you're just explaining is what we are doing 1314 01:49:44.450 --> 01:49:46.270 in the research project at the moment. 1315 01:49:46.270 --> 01:49:49.990 So it's about children learning to read and we try to 1316 01:49:49.990 --> 01:49:52.970 make report out of their reading aloud. 1317 01:49:52.970 --> 01:49:56.370 And this is then used for personalised learning so that we 1318 01:49:56.370 --> 01:49:59.970 can see from the report what are they doing. 1319 01:49:59.970 --> 01:50:03.920 Well what are their problems and then we can adjust their. 1320 01:50:03.920 --> 01:50:06.750 Well, the task that exercises, So they have to do that. 1321 01:50:06.750 --> 01:50:10.540 Exactly what we do and what they said about data. 1322 01:50:10.540 --> 01:50:15.140 So for instance, we now one of our projects became product, it's reading 1323 01:50:15.140 --> 01:50:20.420 tutor and now more than 1000 children day are using it. 1324 01:50:20.420 --> 01:50:24.180 So lot of data is coming in, more than million exercises month. 1325 01:50:24.180 --> 01:50:26.060 So that's great. 1326 01:50:26.060 --> 01:50:30.900 However, with the GDPR you do have problem, you cannot 1327 01:50:30.900 --> 01:50:34.070 simply use all data for all purposes. 1328 01:50:34.070 --> 01:50:39.080 This is really these GDPR that is introduced, I think mainly for Facebook, Google 1329 01:50:39.080 --> 01:50:43.330 and all the other big companies that make wrong use of our data. 1330 01:50:43.330 --> 01:50:48.530 But this is problematic for research and products for language learning. 1331 01:50:48.530 --> 01:50:50.550 But I totally agree. So what you said. 1332 01:50:50.550 --> 01:50:55.810 So that's more formative assessment than probably say between. 1333 01:50:55.810 --> 01:51:02.480 Using this to help people to learn better personalised learning is 1334 01:51:02.480 --> 01:51:08.510 certainly something promising that we can do, yeah? 1335 01:51:08.510 --> 01:51:10.790 So I was going to go on a slightly different tangent. 1336 01:51:10.790 --> 01:51:15.120 I mean we all know about ChatGPT and that this is going to completely revolutionise, 1337 01:51:15.120 --> 01:51:20.300 I mean not only ChatGPT but the competitive products and products to to follow 1338 01:51:20.300 --> 01:51:22.730 that will revolutionise the jobs of tomorrow. 1339 01:51:22.730 --> 01:51:26.810 I mean if you think about, I mean I I hope that there will be place 1340 01:51:26.810 --> 01:51:33.040 for humans in all element in all aspects of the test testing process 1341 01:51:33.040 --> 01:51:36.450 in future, but already you're seeing companies. 1342 01:51:36.450 --> 01:51:43.180 Even ones that are, you know, I'm thinking of Duolingo colleagues now who are using. 1343 01:51:43.180 --> 01:51:48.180 You know chat TPT models to try to generate items for tests. 1344 01:51:48.180 --> 01:51:53.140 So we've got sort of machines involved in test development now you know and 1345 01:51:53.140 --> 01:51:57.640 and these these technologies, I mean this is competitive market there, there 1346 01:51:57.640 --> 01:52:01.390 is push to improve them to make them more accessible. 1347 01:52:01.390 --> 01:52:04.740 I mean they're already publicly accessible, but you could think of multimodal 1348 01:52:04.740 --> 01:52:09.900 kind of thing coming into play very soon where you could play speech sample 1349 01:52:09.900 --> 01:52:13.260 and then have the algorithm spit out elements of the. 1350 01:52:13.260 --> 01:52:15.940 Properties of the speech to describe and you know there 1351 01:52:15.940 --> 01:52:18.010 are all sorts of different applications. 1352 01:52:18.010 --> 01:52:23.890 I just I I I mean I think that certainly these technologies will continue to advance. 1353 01:52:23.890 --> 01:52:28.650 I do hope there is still some place though for human assessment, 1354 01:52:28.650 --> 01:52:31.750 you know be it in some kind of hybrid mode. 1355 01:52:31.750 --> 01:52:36.100 But you know obviously these things are all trained on on humans 1356 01:52:36.100 --> 01:52:38.830 and there was question about bias as well. 1357 01:52:38.830 --> 01:52:43.730 I wanted to talk about bias in the context of the human data. 1358 01:52:43.730 --> 01:52:48.500 That the machine is trained on because we know that implicit associations. 1359 01:52:48.500 --> 01:52:55.660 For example, you know women teacher, nurse men, engineer, scientists. 1360 01:52:55.660 --> 01:52:59.930 You know these things exist in corpora and there may be place you 1361 01:52:59.930 --> 01:53:04.660 know so these societal associations about good and bad and we know 1362 01:53:04.660 --> 01:53:07.780 all about implicit biases and psychology. 1363 01:53:07.780 --> 01:53:11.880 You know these these are embedded in the data that we're training the machines on. 1364 01:53:11.880 --> 01:53:14.060 So I think that there is some kind of. 1365 01:53:14.060 --> 01:53:21.090 Responsibility elements in terms of sort of how to mitigate these biases, and I mean 1366 01:53:21.090 --> 01:53:25.610 I think I wanna go back also, it's bit of mishmash of points, but to Helmer's 1367 01:53:25.610 --> 01:53:29.910 points about what is the standard, what should be the appropriate cut off in terms 1368 01:53:29.910 --> 01:53:35.860 of speech, even if you think about vowel accuracy? You know what is intelligible? 1369 01:53:35.860 --> 01:53:37.800 What is not intelligible? What you know. 1370 01:53:37.800 --> 01:53:40.980 I remember I was doing demo. I was PhD student. 1371 01:53:40.980 --> 01:53:43.430 I'm I'm Canadian. English is my mother tongue. 1372 01:53:43.430 --> 01:53:49.330 And there was some kind of demo of the use of an AI system. 1373 01:53:49.330 --> 01:53:55.320 You know, it was virtual patient and doctor chat and I went on to try to model how it 1374 01:53:55.320 --> 01:54:00.690 worked and it was quite embarrassing when my Canadian vowel got rejected. 1375 01:54:00.690 --> 01:54:05.390 And it's wrong because the algorithm, you know, had very restricted cut off and I said 1376 01:54:05.390 --> 01:54:09.530 about you or something instead of about or however they pronounce it. 1377 01:54:09.530 --> 01:54:13.850 So, I mean these are real, you know, important things because of course high 1378 01:54:13.850 --> 01:54:17.630 stakes decisions can be made on on the basis of this and at the moment the 1379 01:54:17.630 --> 01:54:20.310 kinds of tasks that are used as as has been raised. 1380 01:54:20.310 --> 01:54:23.880 Before and the the you know what counts is right and wrong. 1381 01:54:23.880 --> 01:54:27.020 I mean these are things human raters don't agree on it. 1382 01:54:27.020 --> 01:54:31.820 If you go back to the assessing speaking literature to the time of what you know, 1383 01:54:31.820 --> 01:54:38.540 the early 19, you know early 20th century and into Robert Lotto in the 60s, there 1384 01:54:38.540 --> 01:54:43.460 were concerns about this subjectivity of human scoring you know of of scoring 1385 01:54:43.460 --> 01:54:48.040 of speech that even if you have vowel 2 people couldn't agree on whether it's 1386 01:54:48.040 --> 01:54:50.750 correct or incorrect and this persists. There is. 1387 01:54:50.750 --> 01:54:52.550 Vowel intelligibility. 1388 01:54:52.550 --> 01:54:57.050 Studies that show that there's not 100% agreement to cross to highly trained human raters, 1389 01:54:57.050 --> 01:55:02.450 even phoneticians won't agree on this 100% of the time necessarily, you know. 1390 01:55:02.450 --> 01:55:08.470 So in way, having automated systems kind of a lazy problems of subjectivity, but introduces 1391 01:55:08.470 --> 01:55:12.230 whole other set of problems which we've been talking about today. 1392 01:55:12.230 --> 01:55:14.130 So it is bit of revolution. 1393 01:55:14.130 --> 01:55:20.530 It will be interesting to see what comes next. 1394 01:55:20.530 --> 01:55:23.210 OK. Thank you very much. 1395 01:55:23.210 --> 01:55:25.790 What about interaction? 1396 01:55:25.790 --> 01:55:29.930 That's an important part of of communicative language construct. 1397 01:55:29.930 --> 01:55:34.480 Do you see that there is any chance in the near future that we could 1398 01:55:34.480 --> 01:55:39.390 be assessing interaction automatically somehow in? 1399 01:55:39.390 --> 01:55:47.450 Language learners speech. 1400 01:55:47.450 --> 01:55:51.990 Yeah, I think Charlie already mentioned voice bots or avatar something and that line. 1401 01:55:51.990 --> 01:55:57.680 So we are carrying out couple of research projects with voice spots. 1402 01:55:57.680 --> 01:56:00.600 Also the technology is gradually improving. 1403 01:56:00.600 --> 01:56:04.360 So I think that in the future we will see more also of that that you 1404 01:56:04.360 --> 01:56:10.720 talk to an avatar on the screen and then you can also well train interaction 1405 01:56:10.720 --> 01:56:13.940 but also of course assess interaction. 1406 01:56:13.940 --> 01:56:17.800 It's then it's not human to human interaction, that's another thing, 1407 01:56:17.800 --> 01:56:32.680 but it's simulated human to avatar interaction. 1408 01:56:32.680 --> 01:56:38.400 Well, I'm reminded of Luke Harding's plenary here about the couple of months ago 1409 01:56:38.400 --> 01:56:48.250 he was visiting us and because of of of of of an event here and and. What? What? 1410 01:56:48.250 --> 01:56:50.050 He. 1411 01:56:50.050 --> 01:56:54.520 So the highlight the IT was well this is partly my mind interpretation 1412 01:56:54.520 --> 01:56:58.610 was that that OK, we need to define. 1413 01:56:58.610 --> 01:57:05.490 What do we actually mean and and what is involved and what is what gets rated even by humans 1414 01:57:05.490 --> 01:57:10.170 when we when we rate interactional competence, pragmatic competence. 1415 01:57:10.170 --> 01:57:12.170 I mean that's that's challenge. 1416 01:57:12.170 --> 01:57:17.910 Which is why there are not that many speaking tests where where you you 1417 01:57:17.910 --> 01:57:24.240 actually somehow even, well, human rated are sort of. 1418 01:57:24.240 --> 01:57:29.020 That's obviously where actually you have something like like discourse management 1419 01:57:29.020 --> 01:57:36.370 or or or international competence or what whatever you turn taking behaviour because 1420 01:57:36.370 --> 01:57:39.760 this is hard to define in a way that people can agree on. 1421 01:57:39.760 --> 01:57:48.400 So that's that's 111 challenge then actually Luke was in presenting very interesting. 1422 01:57:48.400 --> 01:57:51.910 Case of of an of an piece of negotiation. 1423 01:57:51.910 --> 01:57:55.480 Negotiation of it was about. 1424 01:57:55.480 --> 01:58:01.510 Agreement versus disagreement among 22 persons talking and actually the. 1425 01:58:01.510 --> 01:58:06.720 The way the the interaction unfolded actually took quite long time and it was 1426 01:58:06.720 --> 01:58:11.300 it was complex it was over number of different sort of turns and actually he presented 1427 01:58:11.300 --> 01:58:19.300 as as kind of an A case where obviously any any automated trading system would 1428 01:58:19.300 --> 01:58:25.120 would would would suit really struggle to make make some some sense of because I 1429 01:58:25.120 --> 01:58:29.540 mean well humans could could do that when they were paying paying attention to that 1430 01:58:29.540 --> 01:58:31.880 So I I see that as as very. 1431 01:58:31.880 --> 01:58:38.440 Challenging piece piece of approach but it depends on when 22 persons be that to 1432 01:58:38.440 --> 01:58:44.080 humans or human and machine actually talk to each other then the question is 1433 01:58:44.080 --> 01:58:49.880 what is it that we are rating in them because obviously probably less of challenge 1434 01:58:49.880 --> 01:58:55.300 to to figure out that okay speaker one has has this kind of fluency and and and 1435 01:58:55.300 --> 01:59:00.540 pronunciation and and so on and so forth the second speaker but then what what 1436 01:59:00.540 --> 01:59:02.830 are the things that in the actually. 1437 01:59:02.830 --> 01:59:07.640 That are part of the interaction between them that we should should be looking for first 1438 01:59:07.640 --> 01:59:11.240 of all and and then how, how to sort of measure that automatically. 1439 01:59:11.240 --> 01:59:19.750 I think that's quite quite kind of 2 levels of challenge here. 1440 01:59:19.750 --> 01:59:23.070 Yes, I think people are often. 1441 01:59:23.070 --> 01:59:25.050 No, let me start in another way. 1442 01:59:25.050 --> 01:59:28.850 So I teach courses in English, so it was obligatory for me to do 1443 01:59:28.850 --> 01:59:32.600 Cambridge Proficiency of English Tests some time ago. 1444 01:59:32.600 --> 01:59:35.710 And I did it and it's totally manual. 1445 01:59:35.710 --> 01:59:41.820 And I had to do writing, reading, listening, speaking and conversation. 1446 01:59:41.820 --> 01:59:47.280 And while I did it, of course I was also trying to evaluate the 1447 01:59:47.280 --> 01:59:50.800 the process and the conversation is quite strange. 1448 01:59:50.800 --> 01:59:54.620 So you couple to another person that is tested and somebody and you start 1449 01:59:54.620 --> 01:59:57.140 talking and somebody in background is evaluating you. 1450 01:59:57.140 --> 02:00:00.520 And I think this is very subjective and it depends on the person you 1451 02:00:00.520 --> 02:00:04.810 couple to et cetera, but nobody is questioning that. 1452 02:00:04.810 --> 02:00:09.690 For some reason, and this is something I see quite often, that if you 1453 02:00:09.690 --> 02:00:13.990 try to do something automatically with the with technology, people are 1454 02:00:13.990 --> 02:00:17.050 very critical and they look at all the problems. 1455 02:00:17.050 --> 02:00:21.470 But if it's done by humans, well, they're less critical 1456 02:00:21.470 --> 02:00:24.690 in way and it's got to be critical. 1457 02:00:24.690 --> 02:00:28.990 And I think before you start using technology for high stake tests, you should 1458 02:00:28.990 --> 02:00:32.300 prove that it works and you should try to find out how to use it. 1459 02:00:32.300 --> 02:00:35.420 That's like I said before, but I think we should. 1460 02:00:35.420 --> 02:00:41.060 Also try to look at what are the possible ways we can use technology and. 1461 02:00:41.060 --> 02:00:44.960 People are not perfect. Human teachers are not perfect. 1462 02:00:44.960 --> 02:00:48.940 They are good, and they help people learn languages, and they're good 1463 02:00:48.940 --> 02:00:53.580 at carrying out tests, but they're also not not perfect. 1464 02:00:53.580 --> 02:00:57.340 So like I said, I think we should look more forwards. 1465 02:00:57.340 --> 02:01:01.020 Where can we use technology to help humans? 1466 02:01:01.020 --> 02:01:05.390 Not replace them, but help humans? 1467 02:01:05.390 --> 02:01:08.550 I I agree with that last point, helmer, about using technology 1468 02:01:08.550 --> 02:01:11.590 to to harness human assessment. 1469 02:01:11.590 --> 02:01:17.570 I don't fully agree with the lack of criticality about group assessments and pure assessments 1470 02:01:17.570 --> 02:01:21.590 because there has been some published literature and language testing on that. 1471 02:01:21.590 --> 02:01:26.110 It's not as recent though as the technology stuff, which is, but there there used to be some 1472 02:01:26.110 --> 02:01:31.690 heated discussions at conferences decade plus over decade ago, I I would say. 1473 02:01:31.690 --> 02:01:35.930 I think actually there's special issue on the topic in language testing. 1474 02:01:35.930 --> 02:01:38.120 But probably from. 1475 02:01:38.120 --> 02:01:40.530 Yeah, from well over a decade ago I would say. 1476 02:01:40.530 --> 02:01:42.990 So it it it probably 2009. 1477 02:01:42.990 --> 02:01:45.390 But yeah, no, I mean you do raise some points. 1478 02:01:45.390 --> 02:01:48.750 I think with interactional competence there are other things to think about. 1479 02:01:48.750 --> 02:01:51.890 I mean just in terms of the speaking construct being measured, I always 1480 02:01:51.890 --> 02:01:54.910 wonder who should be assessed for these things. 1481 02:01:54.910 --> 02:02:01.700 You know, in way maybe native speakers probably should be assessed as well if 1482 02:02:01.700 --> 02:02:05.490 they're, you know, if that part of the, if that part of the speaking construct is 1483 02:02:05.490 --> 02:02:08.670 relevant, then you can't really presume that native speakers. 1484 02:02:08.670 --> 02:02:12.630 Illness you know shouldn't should be exempt from that part of the assessment. 1485 02:02:12.630 --> 02:02:16.620 So yeah, I think that there are, there are matters to think about in terms 1486 02:02:16.620 --> 02:02:21.140 of just what what is being measured, what is appropriate. 1487 02:02:21.140 --> 02:02:25.740 You know if you want to sort of tap into different interactional patterns then it's 1488 02:02:25.740 --> 02:02:31.620 good to do that with broader construct of assessing speaking then always thinking 1489 02:02:31.620 --> 02:02:36.580 about what is the like I I worry lot about machine driven assessment where you're 1490 02:02:36.580 --> 02:02:39.060 introducing tasks because the machine does it well. 1491 02:02:39.060 --> 02:02:43.930 Solely and you're not thinking about what the real what what do people have to do in the 1492 02:02:43.930 --> 02:02:49.690 real world learners in classrooms to succeed in life you know or do whatever it is they 1493 02:02:49.690 --> 02:02:55.610 need to do to integrate into or get that job that they want or you know in high stake 1494 02:02:55.610 --> 02:03:02.230 settings right what is the so we we always should not lose sight of the kind of what law 1495 02:03:02.230 --> 02:03:06.670 Bachman is called and and best palm of called the target language use domain so there 1496 02:03:06.670 --> 02:03:09.520 should still even with the use of technology. 1497 02:03:09.520 --> 02:03:14.190 Be some link between what test takers are being asked to do in the assessment and then 1498 02:03:14.190 --> 02:03:17.930 what they need to do in in terms of the real world language skills. 1499 02:03:17.930 --> 02:03:20.750 So I think we need to just make sure now and in the 1500 02:03:20.750 --> 02:03:22.740 future that we don't lose sight of that. 1501 02:03:22.740 --> 02:03:25.630 And of course, part of the problem with some of the existing fully 1502 02:03:25.630 --> 02:03:27.870 automated tests is that they're very inauthentic. 1503 02:03:27.870 --> 02:03:31.810 They might be used for, you know. 1504 02:03:31.810 --> 02:03:37.060 Academic gatekeeping, but they are not academic in terms of their, you 1505 02:03:37.060 --> 02:03:40.140 know, in terms of the kinds of questioning, in terms of the response 1506 02:03:40.140 --> 02:03:42.580 cases that are expected and so on and so forth. 1507 02:03:42.580 --> 02:03:47.720 So yeah, I think that's just an important consideration in all of this. 1508 02:03:47.720 --> 02:03:52.510 Perhaps one way to address that would be to think that the. 1509 02:03:52.510 --> 02:03:54.310 OK. 1510 02:03:54.310 --> 02:03:57.750 If if even if it's very challenging for both both human 1511 02:03:57.750 --> 02:04:01.930 and and machines to sort of rate interaction. 1512 02:04:01.930 --> 02:04:06.050 Would it be enough that actually that task actually requires that 1513 02:04:06.050 --> 02:04:12.010 that you interact with with somebody even if if the focus of assessment 1514 02:04:12.010 --> 02:04:15.710 is is not not in in in that interaction. 1515 02:04:15.710 --> 02:04:20.790 But actually that there is there is this authenticity in in the fact that. 1516 02:04:20.790 --> 02:04:23.350 That that actually the the. 1517 02:04:23.350 --> 02:04:26.490 Test tasks somehow try to capture. 1518 02:04:26.490 --> 02:04:34.420 In in interesting and important ways of of of communicating in interact and and then. 1519 02:04:34.420 --> 02:04:36.820 Whatever is possible gets assessed. 1520 02:04:36.820 --> 02:04:44.100 Even if the interaction is is may maybe not sort of captured by whatever rating 1521 02:04:44.100 --> 02:04:55.320 system is is is used sort of as as fully as as would would be ideal. 1522 02:04:55.320 --> 02:05:02.150 I hate to interrupt but we are running out of time where we are already overtime but. 1523 02:05:02.150 --> 02:05:07.890 Maybe if there are any questions from the audience here or online we. 1524 02:05:07.890 --> 02:05:10.990 Take one more question than whoever needs to leave. 1525 02:05:10.990 --> 02:05:14.090 Feel free to leave, but if you are, if there are any 1526 02:05:14.090 --> 02:05:20.370 questions or are there any questions? One question. 1527 02:05:20.370 --> 02:05:35.620 OK. Again. 1528 02:05:35.620 --> 02:05:38.170 So. 1529 02:05:38.170 --> 02:05:43.370 Yeah, we have one audience question from online it goes like this. 1530 02:05:43.370 --> 02:05:49.100 How do you see advancement and use of AI generate generative engines 1531 02:05:49.100 --> 02:05:54.010 such as ChatGPT board, etcetera in spoken assessment? 1532 02:05:54.010 --> 02:05:58.730 Would dialogue with an assessment both be right reliable in the near future? 1533 02:05:58.730 --> 02:06:08.420 What are the main challenges? 1534 02:06:08.420 --> 02:06:10.220 Yeah, maybe I can start. 1535 02:06:10.220 --> 02:06:17.730 So what we have seen already is that given some kind of input, these 1536 02:06:17.730 --> 02:06:24.640 large language models are able to provide feedback or explanations. 1537 02:06:24.640 --> 02:06:30.920 So we could think that maybe this could kind of help us to verbalise the feedback 1538 02:06:30.920 --> 02:06:36.840 into some way that is easy to understand something that human really. 1539 02:06:36.840 --> 02:06:38.640 It's sort of good, right? 1540 02:06:38.640 --> 02:06:41.820 But it it doesn't just doesn't have time for writing, so 1541 02:06:41.820 --> 02:06:48.210 it's kind of helping tool tool in that. 1542 02:06:48.210 --> 02:06:53.350 Yes, I guess maybe Ari knows more about this the kind of written language situation 1543 02:06:53.350 --> 02:06:57.930 at the moment, because it clearly it's at the GPT can give feedback already 1544 02:06:57.930 --> 02:07:01.330 now about your essays and and your language skills. 1545 02:07:01.330 --> 02:07:07.700 So probably, but I haven't explored the issue here so. 1546 02:07:07.700 --> 02:07:09.860 No cheque, so it's for chatting. 1547 02:07:09.860 --> 02:07:12.460 So it's written, It's indeed it's written. 1548 02:07:12.460 --> 02:07:17.670 So if you can turn it into speech, which is possible, of course you can. 1549 02:07:17.670 --> 02:07:21.730 With text to speech you can verbalise that the written method, so 1550 02:07:21.730 --> 02:07:26.090 I think it can be used to simulate conversation. 1551 02:07:26.090 --> 02:07:30.660 However, the problem then is that the tests are not standardised. 1552 02:07:30.660 --> 02:07:34.920 One person might get more difficult. Sentences. 1553 02:07:34.920 --> 02:07:38.040 Questions. Then the other one. 1554 02:07:38.040 --> 02:07:43.390 And that's something you have to deal with then. 1555 02:07:43.390 --> 02:07:47.930 Yeah, I guess they're the same kinds of things that you deal with in human 1556 02:07:47.930 --> 02:07:50.390 assessments that make them quite inauthentic, right. 1557 02:07:50.390 --> 02:07:54.430 The scripting that is involved in some kind of standardisation across test takers. 1558 02:07:54.430 --> 02:07:58.730 But I guess they're probably is potential for some kind of adaptive algorithm as well to 1559 02:07:58.730 --> 02:08:03.970 look at and take into account performance speaking performance on an initial task and then 1560 02:08:03.970 --> 02:08:09.270 to introduce some more complex or more difficult tasks that is tailored very much to the learners 1561 02:08:09.270 --> 02:08:12.210 what the machine thinks is the learner's real ability level. 1562 02:08:12.210 --> 02:08:15.520 There are them scoring systems that. Can do that. 1563 02:08:15.520 --> 02:08:18.660 So you could see that there is great potential for that kind of thing. 1564 02:08:18.660 --> 02:08:23.380 I think you know Helmer you were talking about sort of personalised learning. 1565 02:08:23.380 --> 02:08:26.780 I I you know I I hear about personalised medicine quite lot. 1566 02:08:26.780 --> 02:08:29.990 You know personalization is something that is coming out lot. 1567 02:08:29.990 --> 02:08:35.080 You know looking at genetic data and trying to come up with sort of you know different regimes 1568 02:08:35.080 --> 02:08:39.820 and predictors of what diseases you'll get in trying to find ways of mitigating that and 1569 02:08:39.820 --> 02:08:44.460 you could see that in learning contexts as well with systems that. 1570 02:08:44.460 --> 02:08:50.970 Process, you know complex data that there might be some kind of way of you know improving 1571 02:08:50.970 --> 02:08:55.810 feedback provision as Ari was talking about from from the learner, but also adopting, 1572 02:08:55.810 --> 02:09:00.850 adapting the kind of tasks but in very kind of fixed way. 1573 02:09:00.850 --> 02:09:06.930 You might have some kind of stem task, you know initial task to look at, you know the level 1574 02:09:06.930 --> 02:09:11.650 and then from there the machine would gradually in an adaptive kind of way gauge level 1575 02:09:11.650 --> 02:09:15.310 and and introduce more difficult or easier tasks and in this way. 1576 02:09:15.310 --> 02:09:19.620 They could get you know you could get some kind of metric of of performance 1577 02:09:19.620 --> 02:09:24.060 ability you know from beginner to advanced potentially. 1578 02:09:24.060 --> 02:09:31.160 I mean I think that the the opportunities are limitless it's it will be interesting to think 1579 02:09:31.160 --> 02:09:36.460 about how these things evolve and also how the assessment industry evolves. 1580 02:09:36.460 --> 02:09:40.840 But I think we always need to be kind of critical of having good critical eye of 1581 02:09:40.840 --> 02:09:45.870 these processes and you know there there will be lot to talk about. 1582 02:09:45.870 --> 02:09:54.730 And lot to debate in the years ahead, I'm sure. 1583 02:09:54.730 --> 02:09:58.490 Thank you so very much. I think we need to wrap it up now. 1584 02:09:58.490 --> 02:10:06.330 So I thank you our our guest discussants and and our local speakers 1585 02:10:06.330 --> 02:10:08.590 as well very much for the panel discussion. 1586 02:10:08.590 --> 02:10:13.430 And and on behalf of the digital project, I also thank our audience here 1587 02:10:13.430 --> 02:10:18.110 and online that you have participated and giving us very. 1588 02:10:18.110 --> 02:10:20.700 Good questions and comments. 1589 02:10:20.700 --> 02:10:28.480 And yeah, if, if there are any closing words from our project leader, no then. 1590 02:10:28.480 --> 02:10:31.770 Then I will close this seminar. Thank you very much.