• Academic speaking: does the construct exist, and if so, how do we test it?

      Inoue, Chihiro; Nakatsuhara, Fumiyo; Lam, Daniel M. K.; Taylor, Lynda; University of Bedfordshire (2018-03-14)
    • Aspects of fluency across assessed levels of speaking proficiency

      Tavakoli, Parveneh; Nakatsuhara, Fumiyo; Hunter, Ann-Marie (Wiley, 2020-01-25)
      Recent research in second language acquisition suggests that a number of speed, breakdown, repair and composite measures reliably assess fluency and predict proficiency. However, there is little research evidence to indicate which measures best characterize fluency at each assessed level of proficiency, and which can consistently distinguish one level from the next. This study investigated fluency in 32 speakers’ performing four tasks of the British Council’s Aptis Speaking test, which were awarded four different levels of proficiency (CEFR A2-C1). Using PRAAT, the performances were analysed for various aspects of utterance fluency across different levels of proficiency. The results suggest that speed and composite measures consistently distinguish fluency from the lowest to upper-intermediate levels (A2-B2), and many breakdown measures differentiate between the lowest level (A2) and the rest of the proficiency groups, with a few differentiating between lower (A2, B1) and higher levels (B2, C1). The varied use of repair measures at different levels suggest that a more complex process is at play. The findings imply that a detailed micro-analysis of fluency offers a more reliable understanding of the construct and its relationship with assessment of proficiency.
    • Assessing speaking

      Nakatsuhara, Fumiyo; Khabbazbashi, Nahal; Inoue, Chihiro; University of Bedfordshire (Routledge, 2021-12-16)
      In this chapter on assessing speaking, the history of speaking assessment is briefly traced in terms of the various ways in which speaking constructs have been defined and diversified over the past century. This is followed by a discussion of elicitation tasks, test delivery modes, rating methods, and scales that offered opportunities and/or presented challenges in operationalising different constructs of speaking and providing feedback. Several methods utilised in researching speaking assessment are then considered. Informed by recent research and advances in technology, the chapter provides recommendations for practice in both high-stakes and low-stakes contexts.
    • Cognitive validity in the testing of speaking

      Field, John; University of Bedfordshire (2013-11-17)
    • The cognitive validity of tests of listening and speaking designed for young learners

      Field, John (Cambridge University Press, 2018-06)
      The notion of cognitive validity becomes considerably more complicated when one extends it to  tests designed for Young Learners. It then becomes necessary to take full account of the level of cognitive development of the target population (their ability to handle certain mental operations and not others). It may also be necessary to include some consideration of their level of linguistic development in L1: in particular, the degree of proficiency they may have achieved in each of the four skills. This chapter examines the extent to which awareness of the cognitive development of young learners up to the age of 12 should and does influence the decisions made by those designing tests of second language listening and speaking. The limitations and strengths of young learners of this age range are matched against the various processing demands entailed in second language listening and speaking and are then related to characteristics of the Young Learners tests offered by the Cambridge English examinations.
    • A comparative study of the variables used to measure syntactic complexity and accuracy in task-based research

      Inoue, Chihiro; University of Bedfordshire (Taylor & Francis (Routledge): SSH Titles, 2016-04-12)
      The constructs of complexity, accuracy and fluency (CAF) have been used extensively to investigate learner performance on second language tasks. However, a serious concern is that the variables used to measure these constructs are sometimes used conventionally without any empirical justification. It is crucial for researchers to understand how results might be different depending on which measurements are used, and accordingly, choose the most appropriate variables for their research aims. The first strand of this article examines the variables conventionally used to measure syntactic complexity in order to identify which may be the best indicators of different proficiency levels, following suggestions by Norris and Ortega. The second strand compares the three variables used to measure accuracy in order to identify which one is most valid. The data analysed were spoken performances by 64 Japanese EFL students on two picture-based narrative tasks, which were rated at Common European Framework of Reference for Languages (CEFR) A2 to B2 according to Rasch-adjusted ratings by seven human judges. The tasks performed were very similar, but had different degrees of what Loschky and Bley-Vroman term ‘task-essentialness’ for subordinate clauses. It was found that the variables used to measure syntactic complexity yielded results that were not consistent with suggestions by Norris and Ortega. The variable found to be the most valid for measuring accuracy was errors per 100 words. Analysis of transcripts revealed that results were strongly influenced by the differing degrees of task-essentialness for subordination between the two tasks, as well as the spread of errors across different units of analysis. This implies that the characteristics of test tasks need to be carefully scrutinised, followed by careful piloting, in order to ensure greater validity and reliability in task-based research.
    • Comparing rating modes: analysing live, audio, and video ratings of IELTS Speaking Test performances

      Nakatsuhara, Fumiyo; Inoue, Chihiro; Taylor, Lynda; (Taylor & Francis, 2020-08-26)
      This mixed methods study compared IELTS examiners’ scores when assessing spoken performances under live and two ‘non-live’ testing conditions using audio and video recordings. Six IELTS examiners assessed 36 test-takers’ performances under the live, audio, and video rating conditions. Scores in the three rating modes were calibrated using the many-facet Rasch model (MFRM). For all three modes, examiners provided written justifications for their ratings, and verbal reports were also collected to gain insights into examiner perceptions towards performance under the audio and video conditions. Results showed that, for all rating criteria, audio ratings were significantly lower than live and video ratings. Examiners noticed more negative performance features under the two non-live rating conditions, compared to the live condition. However, richer information about test-taker performance in the video mode appeared to cause raters to rely less on such negative evidence than audio raters when awarding scores. Verbal report data showed that having visual information in the video-rating mode helped examiners to understand what the test-takers were saying, to comprehend better what test-takers were communicating using non-verbal means, and to understand with greater confidence the source of test-takers’ hesitation, pauses, and awkwardness.
    • A comparison of holistic, analytic, and part marking models in speaking assessment

      Khabbazbashi, Nahal; Galaczi, Evelina D. (SAGE, 2020-01-24)
      This mixed methods study examined holistic, analytic, and part marking models (MMs) in terms of their measurement properties and impact on candidate CEFR classifications in a semi-direct online speaking test. Speaking performances of 240 candidates were first marked holistically and by part (phase 1). On the basis of phase 1 findings – which suggested stronger measurement properties for the part MM – phase 2 focused on a comparison of part and analytic MMs. Speaking performances of 400 candidates were rated analytically and by part during that phase. Raters provided open comments on their marking experiences. Results suggested a significant impact of MM; approximately 30% and 50% of candidates in phases 1 and 2 respectively were awarded different (adjacent) CEFR levels depending on the choice of MM used to assign scores. There was a trend of higher CEFR levels with the holistic MM and lower CEFR levels with the part MM. While strong correlations were found between all pairings of MMs, further analyses revealed important differences. The part MM was shown to display superior measurement qualities particularly in allowing raters to make finer distinctions between different speaking ability levels. These findings have implications for the scoring validity of speaking tests.
    • The design and validation of an online speaking test for young learners in Uruguay: challenges and innovations

      Khabbazbashi, Nahal; Nakatsuhara, Fumiyo; Inoue, Chihiro; Kaplan, Gabriela; Green, Anthony; University of Bedfordshire; Plan Ceibal (Cranmore Publishing on behalf of the International TESOL Union, 2022-02-10)
      This research presents the development of an online speaking test of English for students at the end of primary and beginning of secondary school education in state schools in Uruguay. Following the success of the Plan Ceibal one computer-tablet per child initiative, there was a drive to further utilize technology to improve the language ability of students, particularly in speaking, where the majority of students are at CEFR levels pre-A1 and A1. The national concern over a lack of spoken communicative skills amongst students led to a decision to develop a new speaking test, specifically tailored to local needs. This paper provides an overview of the speaking test development and validation project designed with the following objectives in mind: to establish, track, and report annually learners’ achievements against the Common European Framework of Reference for Languages (CEFR) targeting CEFR levels pre-A1 to A2, to inform teaching and learning, and to promote speaking practice in classrooms. Results of a three-phase mixed-methods study involving small-scale and large-scale trials with learners and examiners as well as a CEFRlinking exercise with expert panelists will be reported. Different sources of evidence will be brought together to build a validity argument for the test. The paper will also focus on some of the challenges involved in assessing young learners and discuss how design decisions, local knowledge and expertise, and technological innovations can be used to address such challenges with implications for other similar test development projects.
    • Developing tools for learning oriented assessment of interactional competence: bridging theory and practice

      May, Lyn; Nakatsuhara, Fumiyo; Lam, Daniel M. K.; Galaczi, Evelina D. (SAGE Publications, 2019-10-01)
      In this paper we report on a project in which we developed tools to support the classroom assessment of learners’ interactional competence (IC) and provided learning oriented feedback in the context of preparation for a high-stakes face-to-face speaking test.  Six trained examiners provided stimulated verbal reports (n=72) on 12 paired interactions, focusing on interactional features of candidates’ performance. We thematically analyzed the verbal reports to inform a draft checklist and materials, which were then trialled by four experienced teachers. Informed by both data sources, the final product comprised (a) a detailed IC checklist with nine main categories and over 50 sub-categories, accompanying detailed description of each area and feedback to learners, which teachers can adapt to suit their teaching and testing contexts, and (b) a concise IC checklist with four categories and bite-sized feedback for real-time classroom assessment. IC, a key aspect of face-to-face communication, is under-researched and under-explored in second/foreign language teaching, learning, and assessment contexts. This in-depth treatment of it, therefore, stands to contribute to learning contexts through raising teachers’ and learners’ awareness of micro-level features of the construct, and to assessment contexts through developing a more comprehensive understanding of the construct.
    • The discourse of the IELTS Speaking Test : interactional design and practice

      Seedhouse, Paul; Nakatsuhara, Fumiyo (Cambridge University Press, 2018-02-15)
      The volume provides a unique dual perspective on the evaluation of spoken discourse in that it combines a detailed portrayal of the design of a face-to-face speaking test with its actual implementation in interactional terms. Using many empirical extracts of interaction from authentic IELTS Speaking Tests, the book illustrates how the interaction is organised in relation to the institutional aim of ensuring valid assessment. The relationship between individual features of the interaction and grading criteria is examined in detail across a number of different performance levels.
    • The effects of extended planning time on candidates’ performance, processes and strategy use in the lecture listening-into-speaking tasks of the TOEFL iBT Test

      Inoue, Chihiro; Lam, Daniel M. K.; Educational Testing Service (Wiley, 2021-06-21)
      This study investigated the effects of two different planning time conditions (i.e., operational [20 s] and extended length [90 s]) for the lecture listening-into-speaking tasks of the TOEFL iBT® test for candidates at different proficiency levels. Seventy international students based in universities and language schools in the United Kingdom (35 at a lower level; 35 at a higher level) participated in the study. The effects of different lengths of planning time were examined in terms of (a) the scores given by ETS-certified raters; (b) the quality of the speaking performances characterized by accurately reproduced idea units and the measures of complexity, accuracy, and fluency; and (c) self-reported use of cognitive and metacognitive processes and strategies during listening, planning, and speaking. The results found neither a statistically significant main effect of the length of planning time nor an interaction between planning time and proficiency on the scores or on the quality of the speaking performance. There were several cognitive and metacognitive processes and strategies where significantly more engagement was reported under the extended planning time, which suggests enhanced cognitive validity of the task. However, the increased engagement in planning did not lead to any measurable improvement in the score. Therefore, in the interest of practicality, the results of this study provide justifications for the operational length of planning time for the lecture listening-into-speaking tasks in the speaking section of the TOEFL iBT test.
    • Establishing test form and individual task comparability: a case study of a semi-direct speaking test

      Weir, Cyril J.; Wu, Jessica R.W.; University of Luton; Language Training and Testing Center, Taiwan (SAGE, 2006-04-01)
      Examination boards are often criticized for their failure to provide evidence of comparability across forms, and few such studies are publicly available. This study aims to investigate the extent to which three forms of the General English Proficiency Test Intermediate Speaking Test (GEPTS-I) are parallel in terms of two types of validity evidence: parallel-forms reliability and content validity. The three trial test forms, each containing three different task types (read-aloud, answering questions and picture description), were administered to 120 intermediate-level EFL learners in Taiwan. The performance data from the different test forms were analysed using classical procedures and Multi-Faceted Rasch Measurement (MFRM). Various checklists were also employed to compare the tasks in different forms qualitatively in terms of content. The results showed that all three test forms were statistically parallel overall and Forms 2 and 3 could also be considered parallel at the individual task level. Moreover, sources of variation to account for the variable difficulty of tasks in Form 1 were identified by the checklists. Results of the study provide insights for further improvement in parallel-form reliability of the GEPTS-I at the task level and offer a set of methodological procedures for other exam boards to consider. © 2006 Edward Arnold (Publishers) Ltd.
    • Exploring performance across two delivery modes for the IELTS Speaking Test: face-to-face and video-conferencing delivery (Phase 2)

      Nakatsuhara, Fumiyo; Inoue, Chihiro; Berry, Vivien; Galaczi, Evelina D. (IELTS Partners, 2017-10-04)
      Face-to-face speaking assessment is widespread as a form of assessment, since it allows the elicitation of interactional skills. However, face-to-face speaking test administration is also logistically complex, resource-intensive and can be difficult to conduct in geographically remote or politically sensitive areas. Recent advances in video-conferencing technology now make it possible to engage in online face-to-face interaction more successfully than was previously the case, thus reducing dependency upon physical proximity. A major study was, therefore, commissioned to investigate how new technologies could be harnessed to deliver the face-to-face version of the IELTS Speaking test.  Phase 1 of the study, carried out in London in January 2014, presented results and recommendations of a small-scale initial investigation designed to explore what similarities and differences, in scores, linguistic output and test-taker and examiner behaviour, could be discerned between face-to-face and internet-based videoconferencing delivery of the Speaking test (Nakatsuhara, Inoue, Berry and Galaczi, 2016). The results of the analyses suggested that the speaking construct remains essentially the same across both delivery modes.  This report presents results from Phase 2 of the study, which was a larger-scale followup investigation designed to: (i) analyse test scores obtained using more sophisticated statistical methods than was possible in the Phase 1 study (ii) investigate the effectiveness of the training for the video-conferencing- delivered test which was developed based on findings from the Phase 1 study (iii) gain insights into the issue of sound quality perception and its (perceived) effect (iv) gain further insights into test-taker and examiner behaviours across the two delivery modes (v) confirm the results of the Phase 1 study. Phase 2 of the study was carried out in Shanghai, People’s Republic of China in May 2015. Ninety-nine (99) test-takers each took two speaking tests under face-to-face and internet-based video-conferencing conditions. Performances were rated by 10 trained IELTS examiners. A convergent parallel mixed-methods design was used to allow for collection of an in-depth, comprehensive set of findings derived from multiple sources. The research included an analysis of rating scores under the two delivery conditions, test-takers’ linguistic output during the tests, as well as short interviews with test-takers following a questionnaire format. Examiners responded to two feedback questionnaires and participated in focus group discussions relating to their behaviour as interlocutors and raters, and to the effectiveness of the examiner training. Trained observers also took field notes from the test sessions and conducted interviews with the test-takers.  Many-Facet Rasch Model (MFRM) analysis of test scores indicated that, although the video-conferencing mode was slightly more difficult than the face-to-face mode, when the results of all analytic scoring categories were combined, the actual score difference was negligibly small, thus supporting the Phase 1 findings. Examination of language functions elicited from test-takers revealed that significantly more test-takers asked questions to clarify what the examiner said in the video-conferencing mode (63.3%) than in the face-to-face mode (26.7%) in Part 1 of the test. Sound quality was generally positively perceived in this study, being reported as 'Clear' or 'Very clear', although the examiners and observers tended to perceive it more positively than the test-takers. There did not seem to be any relationship between sound quality perceptions and the proficiency level of test-takers. While 71.7% of test-takers preferred the face-to-face mode, slightly more test-takers reported that they were more nervous in the face-to-face mode (38.4%) than in the video-conferencing mode (34.3%).  All examiners found the training useful and effective, the majority of them (80%) reporting that the two modes gave test-takers equal opportunity to demonstrate their level of English proficiency. They also reported that it was equally easy for them to rate test-taker performance in face-to-face and video-conferencing modes.  The report concludes with a list of recommendations for further research, including suggestions for further examiner and test-taker training, resolution of technical issues regarding video-conferencing delivery and issues related to rating, before any decisions about deploying a video-conferencing mode of delivery for the IELTS Speaking test are made.
    • Exploring performance across two delivery modes for the same L2 speaking test: face-to-face and video-conferencing delivery: a preliminary comparison of test-taker and examiner behaviour

      Nakatsuhara, Fumiyo; Inoue, Chihiro; Berry, Vivien; Galaczi, Evelina D. (The IELTS Partners: British Council, Cambridge English Language Assessment and IDP: IELTS Australia, 2016-11-10)
      This report presents the results of a preliminary exploration and comparison of test-taker and examiner behaviour across two different delivery modes for an IELTS Speaking test: the standard face-to-face test administration, and test administration using Internetbased video-conferencing technology. The study sought to compare performance features across these two delivery modes with regard to two key areas:  • an analysis of test-takers’ scores and linguistic output on the two modes and their perceptions of the two modes  • an analysis of examiners’ test management and rating behaviours across the two modes, including their perceptions of the two conditions for delivering the speaking test.  Data were collected from 32 test-takers who took two standardised IELTS Speaking tests under face-to-face and internet-based video-conferencing conditions. Four trained examiners also participated in this study. The convergent parallel mixed methods research design included an analysis of interviews with test-takers, as well as their linguistic output (especially types of language functions) and rating scores awarded under the two conditions. Examiners provided written comments justifying the scores they awarded, completed a questionnaire and participated in verbal report sessions to elaborate on their test administration and rating behaviour. Three researchers also observed all test sessions and took field notes.  While the two modes generated similar test score outcomes, there were some differences in functional output and examiner interviewing and rating behaviours. This report concludes with a list of recommendations for further research, including examiner and test-taker training and resolution of technical issues, before any decisions about deploying (or not) a video-conferencing mode of the IELTS Speaking test delivery are made. 
    • Exploring the potential for assessing interactional and pragmatic competence in semi-direct speaking tests

      Nakatsuhara, Fumiyo; May, Lyn; Inoue, Chihiro; Willcox-Ficzere, Edit; Westbrook, Carolyn; Spiby, Richard; University of Bedfordshire; Queensland University of Technology; Oxford Brookes University; British Council (British Council, 2021-11-11)
      To explore the potential of a semi-direct speaking test to assess a wider range of communicative language ability, the researchers developed four semi-direct speaking tasks – two designed to elicit features of interactional competence (IC) and two designed to elicit features of pragmatic competence (PC). The four tasks, as well as one benchmarking task, were piloted with 48 test-takers in China and Austria whose proficiency ranged from CEFR B1 to C. A post-test feedback survey was administered to all test-takers, after which selected test-takers were interviewed. A total of 184 task performances were analysed to identify interactional moves utilised by test-takers across three proficiency groups (i.e., B1, B2 and C). Data indicated that test-takers at higher levels employed a wider variety of interactional moves. They made use of concurring concessions and counter views when seeking to persuade a (hypothetical) conversational partner to change opinions in the IC tasks, and they projected upcoming requests and made face-related statements in the PC tasks, seemingly to pre-empt a conversational partner’s negative response to the request. The test-takers perceived the tasks to be highly authentic and found the video input useful in understanding the target audience of simulated interactions.
    • Exploring the use of video-conferencing technology in the assessment of spoken language: a mixed-methods study

      Nakatsuhara, Fumiyo; Inoue, Chihiro; Berry, Vivien; Galaczi, Evelina D.; University of Bedfordshire; British Council; Cambridge English Language Assessment (Taylor & Francis, 2017-02-10)
      This research explores how internet-based video-conferencing technology can be used to deliver and conduct a speaking test, and what similarities and differences can be discerned between the standard and computer-mediated face-to-face modes. The context of the study is a high-stakes speaking test, and the motivation for the research is the need for test providers to keep under constant review the extent to which their tests are accessible and fair to a wide constituency of test takers. The study examines test-takers’ scores and linguistic output, and examiners’ test administration and rating behaviors across the two modes. A convergent parallel mixed-methods research design was used, analyzing test-takers’ scores and language functions elicited, examiners’ written comments, feedback questionnaires and verbal reports, as well as observation notes taken by researchers. While the two delivery modes generated similar test score outcomes, some differences were observed in test-takers’ functional output and the behavior of examiners who served as both raters and interlocutors.
    • The IELTS Speaking Test: what can we learn from examiner voices?

      Inoue, Chihiro; Khabbazbashi, Nahal; Lam, Daniel M. K.; Nakatsuhara, Fumiyo; University of Bedfordshire (2018-11-25)
    • Interactional competence with and without extended planning time in a group oral assessment

      Lam, Daniel M. K. (Routledge, Taylor & Francis Group, 2019-05-02)
      Linking one’s contribution to those of others’ is a salient feature demonstrating interactional competence in paired/group speaking assessments. While such responses are to be constructed spontaneously while engaging in real-time interaction, the amount and nature of pre-task preparation in paired/group speaking assessments may have an influence on how such an ability (or lack thereof) could manifest in learners’ interactional performance. Little previous research has examined the effect of planning time on interactional aspects of paired/group speaking task performance. Within the context of school-based assessment in Hong Kong, this paper analyzes the discourse of two group interactions performed by the same four student-candidates under two conditions: (a) with extended planning time (4–5 hours), and (b) without extended planning time (10 minutes), with the aim of exploring any differences in student-candidates’ performance of interactional competence in this assessment task. The analysis provides qualitative discourse evidence that extended planning time may impede the assessment task’s capacity to discriminate between stronger and weaker candidates’ ability to spontaneously produce responses contingent on previous speaker contribution. Implications for the implementation of preparation time for the group interaction task are discussed.