• Measuring L2 speaking

      Nakatsuhara, Fumiyo; Inoue, Chihiro; Khabbazbashi, Nahal (Routledge, 2019-07-11)
      This chapter on measuring L2 speaking has three main focuses: (a) construct representation, (b) test methods and task design, and (c) scoring and feedback. We will briefly trace the different ways in which speaking constructs have been defined over the years and operationalized using different test methods and task features. We will then discuss the challenges and opportunities that speaking tests present for scoring and providing feedback to learners. We will link these discussions to the current understanding of SLA theories and empirical research, learning oriented assessment approaches and advances in educational technology.
    • Opening the black box: exploring automated speaking evaluation

      Khabbazbashi, Nahal; Xu, Jing; Galaczi, Evelina D. (Springer, 2021-02-10)
      The rapid advances in speech processing and machine learning technologies have attracted language testers’ strong interest in developing automated speaking assessment in which candidate responses are scored by computer algorithms rather than trained human examiners. Despite its increasing popularity, automatic evaluation of spoken language is still shrouded in mystery and technical jargon, often resembling an opaque "black box" that transforms candidate speech to scores in a matter of minutes. Our chapter explicitly problematizes this lack of transparency around test score interpretation and use and asks the following questions: What do automatically derived scores actually mean? What are the speaking constructs underlying them? What are some common problems encountered in automated assessment of speaking? And how can test users evaluate the suitability of automated speaking assessment for their proposed test uses? In addressing these questions, the purpose of our chapter is to explore the benefits, problems, and caveats associated with automated speaking assessment touching on key theoretical discussions on construct representation and score interpretation as well as practical issues such as the infrastructure necessary for capturing high quality audio and the difficulties associated with acquiring training data. We hope to promote assessment literacy by providing the necessary guidance for users to critically engage with automated speaking assessment, pose the right questions to test developers, and ultimately make informed decisions regarding the fitness for purpose of automated assessment solutions for their specific learning and assessment contexts.
    • Scoring validity of the Aptis speaking test : investigating fluency across tasks and levels of proficiency

      Tavakoli, Parveneh; Nakatsuhara, Fumiyo; Hunter, Ann-Marie (British Council, 2017-11-16)
      Second language oral fluency has long been considered as an important construct in communicative language ability (e.g. de Jong et al, 2012) and many speaking tests are designed to measure fluency aspect(s) of candidates’ language (e.g. IELTS, TOEFL iBT, PTE Academic). Current research in second language acquisition suggests that a number of measures of speed, breakdown and repair fluency can reliably assess fluency and predict proficiency. However, there is little research evidence to indicate which measures best characterise fluency at each level of proficiency, and which can consistently distinguish one proficiency level from the next. This study is an attempt to help answer these questions. This study investigated fluency constructs across four different levels of proficiency (A2–C1) and four different semi-direct speaking test tasks performed by 32 candidates taking the Aptis Speaking test. Using PRAAT (Boersma & Weenik, 2013), we analysed 120 task performances on different aspects of utterance fluency including speed, breakdown and repair measures across different tasks and levels of proficiency. The results suggest that speed measures consistently distinguish fluency across different levels of proficiency, and many of the breakdown measures differentiate between lower (A2, B1) and higher levels (B2, C1). The varied use of repair measures at different proficiency levels and tasks suggest that a more complex process is at play. The non-significant differences between most of fluency measures in the four tasks suggest that fluency is not affected by task type in the Aptis Speaking test. The implications of the findings are discussed in relation to the Aptis Speaking test fluency rating scales and rater training materials. 
    • Testing speaking skills: why and how?

      Nakatsuhara, Fumiyo; Inoue, Chihiro; University of Bedfordshire (2013-09-16)
    • Towards a model of multi-dimensional performance of C1 level speakers assessed in the Aptis Speaking Test

      Nakatsuhara, Fumiyo; Tavakoli, Parveneh; Awwad, Anas; British Council; University of Bedfordshire; University of Reading; Isra University, Jordan (British Council, 2019-09-14)
      This is a peer-reviewed online research report in the British Council Validation Series (https://www.britishcouncil.org/exam/aptis/research/publications/validation). Abstract The current study draws on the findings of Tavakoli, Nakatsuhara and Hunter’s (2017) quantitative study which failed to identify any statistically significant differences between various fluency features in speech produced by B2 and C1 level candidates in the Aptis Speaking test. This study set out to examine whether there were differences between other aspects of the speakers’ performance at these two levels, in terms of lexical and syntactic complexity, accuracy and use of metadiscourse markers, that distinguish the two levels. In order to understand the relationship between fluency and these other aspects of performance, the study employed a mixed-methods approach to analysing the data. The quantitative analysis included descriptive statistics, t-tests and correlational analyses of the various linguistic measures. For the qualitative analysis, we used a discourse analysis approach to examining the pausing behaviour of the speakers in the context the pauses occurred in their speech. The results indicated that the two proficiency levels were statistically different on measures of accuracy (weighted clause ratio) and lexical diversity (TTR and D), with the C1 level producing more accurate and lexically diverse output. The correlation analyses showed speed fluency was correlated positively with weighted clause ratio and negatively with length of clause. Speed fluency was also positively related to lexical diversity, but negatively linked with lexical errors. As for pauses, frequency of end-clause pauses was positively linked with length of AS-units. Mid-clause pauses also positively correlated with lexical diversity and use of discourse markers. Repair fluency correlated positively with length of clause, and negatively with weighted clause ratio. Repair measures were also negatively linked with number of errors per 100 words and metadiscourse marker type. The qualitative analyses suggested that the pauses mainly occurred a) to facilitate access and retrieval of lexical and structural units, b) to reformulate units already produced, and c) to improve communicative effectiveness. A number of speech exerpts are presented to illustrate these examples. It is hoped that the findings of this research offer a better understanding of the construct measured at B2 and C1 levels of the Aptis Speaking test, inform possible refinements of the Aptis Speaking rating scales, and enhance its rater training programme for the two highest levels of the test.
    • Towards new avenues for the IELTS Speaking Test: insights from examiners’ voices

      Inoue, Chihiro; Khabbazbashi, Nahal; Lam, Daniel M. K.; Nakatsuhara, Fumiyo (IELTS Partners, 2021-02-19)
      This study investigated the examiners’ views on all aspects of the IELTS Speaking Test, namely, the test tasks, topics, format, interlocutor frame, examiner guidelines, test administration, rating, training and standardisation, and test use. The overall trends of the examiners’ views of these aspects of the test were captured by a large-scale online questionnaire, to which a total of 1203 examiners responded. Based on the questionnaire responses, 36 examiners were carefully selected for subsequent interviews to explore the reasons behind their views in depth. The 36 examiners were representative of a number of differing geographical regions and a range of views and experiences in examining and giving examiner training. While the questionnaire responses exhibited generally positive views from examiners on the current IELTS Speaking Test, the interview responses uncovered various issues that the examiners experienced and suggested potentially beneficial modifications. Many of the issues (e.g. potentially unsuitable topics, rigidity of interlocutor frames) were attributable to the huge candidature of the IELTS Speaking Test, which has vastly expanded since the test’s last revision in 2001, perhaps beyond the initial expectations of the IELTS Partners. This study synthesized the voices from examiners and insights from relevant literature, and incorporated guidelines checks we submitted to the IELTS Partners. This report concludes with a number of suggestions for potential changes in the current IELTS Speaking Test, so as to enhance its validity and accessibility in today’s ever globalising world.
    • Validating speaking test rating scales through microanalysis of fluency using PRAAT

      Tavakoli, Parveneh; Nakatsuhara, Fumiyo; Hunter, Ann-Marie; University of Reading; University of Bedfordshire; St. Mary’s University (2017-07-06)
    • Video-conferencing speaking tests: do they measure the same construct as face-to-face tests?

      Nakatsuhara, Fumiyo; Inoue, Chihiro; Berry, Vivien; Galaczi, Evelina D.; ; University of Bedfordshire; British Council; Cambridge Assessment English (Routledge, 2021-08-23)
      This paper investigates the comparability between the video-conferencing and face-to-face modes of the IELTS Speaking Test in terms of scores and language functions generated by test-takers. Data were collected from 10 trained IELTS examiners and 99 test-takers who took two speaking tests under face-to-face and video-conferencing conditions. Many-facet Rasch Model (MFRM) analysis of test scores indicated that the delivery mode did not make any meaningful difference to test-takers’ scores. An examination of language functions revealed that both modes equally elicited the same language functions except asking for clarification. More test-takers made clarification requests in the video-conferencing mode (63.3%) than in the face-to-face mode (26.7%). Drawing on the findings, as well as practical implications, we extend emerging thinking about video-conferencing speaking assessment and the associated features of this modality in its own right.
    • What counts as ‘responding’? Contingency on previous speaker contribution as a feature of interactional competence

      Lam, Daniel M. K. (Sage, 2018-05-10)
      The ability to interact with others has gained recognition as part of the L2 speaking construct in the assessment literature and in high- and low-stakes speaking assessments. This paper first presents a review of the literature on interactional competence (IC) in L2 learning and assessment. It then discusses a particular feature – producing responses contingent on previous speaker contribution – that emerged as a de facto construct feature of IC oriented to by both candidates and examiners within the school-based group speaking assessment in the Hong Kong Diploma of Secondary Education (HKDSE) English Language Examination. Previous studies have, similarly, argued for the importance of ‘responding to’ or linking one’s own talk to previous speakers’ contributions as a way of demonstrating comprehension of co-participants’ talk. However, what counts as such a response has yet to be explored systematically. This paper presents a conversation analytic study of the candidate discourse in the assessed group interactions, identifying three conversational actions through which student-candidates construct contingent responses to co-participants. The thick description about the nature of contingent responses lays the groundwork for further empirical investigations on the relevance of this IC feature and its proficiency implications.