华艺学术文献数据库

题名	對話者之語言能力與評分嚴苛度對印尼語口語評量成績之影響
并列篇名	Influence of Interlocutor Proficiency and Rater Severity in Indonesian Language Oral Assessment
DOI	10.6251/BEP.202309_55(1).0002
作者	何德華(D. Victoria Rau)；張惠環(Hui-Huan Chang)；許婉儀(Wan-Yi Hsu)
关键词	多層面Rasch模式；口語評量；印尼語；評分嚴苛度；對話搭檔； many-facet Rasch model ； oral assessment ； Indonesian ； rater severity ； interlocutor proficiency
期刊名称	教育心理學報
卷期/出版年月	55卷1期（2023 / 09 / 01）
页次	25 - 46
内容语文	繁體中文;英文
中文摘要	外語課堂以溝通式教學為目標者，常見的口語評量模式是以二人一組搭檔對話的方式進行口試，並由評分者使用評分表檢定成效。然而學生在選擇口試搭檔時，可能因選擇不同對象而影響口試表現；而不同評分者在使用評分表時，也可能因個人評分嚴苛度有所差異，給予不同口試成績，因此教學者需要考慮是否需要規定口試對話搭檔之選擇標準，以及如何訓練助教團隊使用評分表以增進口試之公平客觀性。本研究以臺灣一所國立大學通識教育中心之印尼語課程為研究場域，使用Rasch模型檢測：（1）評分者不同的嚴苛度在經過訓練之後能否達成口試評分的一致性？（2）學生在口試搭檔的選擇上，選擇與個人語言背景相當（初學者與初學者搭檔）或與個人語言背景不相同者（初學者與印尼華人搭檔）是否會影響其口試成績？本研究結果發現不同評分者即使施予訓練仍無法完全達成評分一致性，因此目前由多位評分者共同擔綱，刪除離群值、取其平均數，或許是權宜之計。然而，根據多層面Rasch分析法檢測評分者嚴苛度，有助及早發現問題。其次，學生選擇與不同語言能力背景搭檔口試並不會影響其口試成績，因此應讓學生自由選擇對話搭檔，輔以鼓勵機制讓印尼華僑多跟初學者搭配，以達到雙贏的效果。
英文摘要	The use of pair work in speaking assessment has frequently been adopted as an authentic manner of testing oral proficiency in second-language communicative language classrooms; however, the findings of studies regarding whether interlocutor proficiency influences the outcomes of oral assessment and whether rater training enables long-term interrater reliability have been inconclusive or contradictory. Studies have indicated that if one of a pair of interlocutors exhibits higher proficiency than the other or if the individuals know each other well, they may collaborate to produce more speech and achieve higher performance in oral assessments (Iwashita, 1996; Norton, 2005; Storch, 2001). However, a higher volume of speech is not always associated with higher overall performance scores (Davis, 2009). Other studies (Galaczi, 2008, 2014) have found that weaker language users might be more reluctant to contribute in oral interactions when paired with more proficient interlocutors. Son (2016) reported that Korean students of English as a foreign language spoke less when paired with more proficient interlocutors, although their overall oral performance did not necessarily decrease. The outcomes of oral assessments may also be influenced by the reliability of the ratings of assessors. Rater severity can be identified by applying the many-facet Rasch model (MFRM; Eckes, 2009, 2015). Although rater training can theoretically increase the confidence and consistency of raters (Davis, 2012, 2016; Huang et al., 2016; McNamara, 1996), differences in rater severity often persist after training (Eckes, 2005, 2009, 2015; Knoch, 2011; Sundqvist et al., 2020; Weigle, 1998) but the results of training are not necessarily long-lasting (Bonk & Ockey, 2003; Chang et al., 2011; Kim, 2011; Lan, 2012; Liao, 2016; Lumley & McNamara, 1995). Because second language assessment generally involves more than one assessor, providing on-the-job rater training is necessary to increase interrater reliability in oral assessments. Therefore, the following must be explored: (1) Whether training raters in the use of assessment rubrics increases interrater reliability, and (2) whether test takers perform differently when paired with interlocutors of different proficiency levels. This study investigated oral assessment in two General Education Indonesian language classes at a national university in Taiwan that was conducted in the fall semesters of 2020 and 2021. The study used Rasch analysis to measure to what extent interlocutor proficiency (Indonesian language learning beginners vs. speakers of Indonesian as a first language) influenced the students' oral performance and to what extent the severity of the Indonesian teaching assistants (TAs) could be identified and controlled for. The 2020 class comprised 44 students (Taiwanese individuals = 26, Chinese Indonesian individuals = 10, individuals of other nationalities = 8; men = 10, women = 34) and 7 Indonesian TAs (TAs from North Sumatra = 4, TAs from Java = 2, TA from Sulawesi = 1; men = 2, women = 5), and the 2021 class comprised 38 students (Taiwanese individuals = 17, Chinese Indonesian individuals = 14, Chinese Malaysian individuals = 4, individuals of other nationalities = 3; men = 18, women = 20) and 8 Indonesian TAs (TAs from North Sumatra = 4, TAs from Java = 4; men = 4, women = 4). The data comprised six oral assessments performed throughout the semester for each class that were scored by the trained TAs according to a rubric containing five categories: Content, accuracy, fluency, pronunciation, and interaction. The participants self-assessed their Indonesian language proficiency at the beginning of the semester. Generally, the Chinese Indonesian and Chinese Malaysian students rated themselves as native speakers of Indonesian and Malay, respectively, whereas the Taiwanese students and those of other nationalities identified themselves as true beginners. The participants selected their partners for the oral exams from among their classmates. The data were analyzed using Facets (Linacre, 2022a) to investigate the oral performance of each student pair, the severity of their assessor, and the difficulty of the criteria in the scoring rubric. The scores were transformed into a logit scale for comparison. Analysis based on the MFRM was used to obtain the following information for interpretation: logit measurements, the information-weighted mean-square fit statistic (infit), the outlier sensitive mean-square fit statistic (outfit), the separation index, reliability of separation index, and Chi-square tests for homogeneity. The results were represented using a variable map for each semester, divided into sections for each of the aforementioned three facets. A higher logit value in the three facets indicated higher student pair performance in oral exams, more severe rating, and more difficult criteria for high scores. The results indicate that even after training, rater consistency was low. In the 2020 class, Chinese Indonesian students had the highest scores, as expected. Performance ranged widely among the Taiwanese students and those of other nationalities. Among the seven TAs, five provided similar ratings and two provided ratings that were either excessively high (logit = -2.42) or excessively low (logit = 1.03) for the midterm oral assessment. After further training was provided before the final exam, two different TAs provided markings that were either excessively high (-0.45 logits) or excessively low (0.97 logits); however, the rater severity among the seven TAs for the final exam was within 1 and -1 logits, the acceptable range. The rater variable interacted with the rating criteria. One TA rated accuracy favorably (t = 2.76) but rated interaction (t = -2.11) severely. Another rated fluency favorably (t = 2.55) but rated pronunciation severely (t = -4.25). In the 2021 class, although the eight TAs were fully trained to use the rubric consistently, variables beyond our control that influenced rating consistency, especially the interaction between the rater and criteria, remained. Therefore, using average scores after outliers are removed may be a viable alternative method of grading until a superior solution is identified. Nonetheless, identifying rater severity variability was helpful as a basis for further rater training. Different Indonesian proficiency levels between assessment partners did not influence individual student scores in the oral assessments. The students from the 2020 and 2021 classes were categorized into four groups, LL, LH, HL, and HH (L = true beginner, H = proficient Indonesian/Malaysian speaker). Their mean scores were analyzed using Kruskal-Wallis tests. We first investigated whether beginners paired with proficient speakers (LH) scored higher than did those paired with other beginners (LL). However, the scores of these groups did not differ significantly. Next, we determined whether proficient speakers paired with beginners (HL) would score lower than did those paired with other proficient speakers (HH). The scores of these groups did not differ significantly. Our results support the findings of Davis (2009) and Son (2016). We did not demonstrate that interlocutor proficiency positively or negatively affected the students' oral performance. However, based on the comprehensive analysis of students' feedback on the oral examination method, the students seemed to prefer to select partners and remain in their partnerships throughout the semester. Because they were allowed to prepare their scripts and practice their oral exams before the exams, the students developed a sense of solidarity and camaraderie with their partners. The amount of speech they used appeared to not be influenced by differences in interlocutor proficiency. The students were also tolerant of mistakes made by their partners and exhibited patience. Thus, allowing students to choose their own partners and encouraging local students to pair with Chinese Indonesian students would increase their intercultural experiences. The research site had two unique features that may not be present in other second language classrooms. One was team instruction conducted by a linguist and 7-8 TAs. The other was the presence of a considerable number of proficient speakers of Indonesian/Malay as students attending class with true beginners. Nonetheless, these unique features provide valuable information in this case study with multiyear data.
主题分类	社會科學 > 心理學社會科學 > 教育學
参考文献	王佳琪, C.-C.(2020)。科學想像力圖形測驗之驗證。教育心理學報，51，341-367。連結：何德華, D. V.(2019)。印尼語 TEAL 創意互動教學測驗與評量。通識教育學刊，24，79-131。連結：林小慧, H.-H.,林世華, S.-H.,吳心楷, H.-K.(2018)。科學能力的建構反應評量之發展與信效度分析：以自然科光學為例。教育科學研究期刊，63(1)，173-205。連結：林怡君, I.-C.,張麗麗, L.,陸怡琮, I.-C.(2013)。Rasch模式建置國小高年級閱讀理解測驗。教育心理學報，45，39-61。連結：姚漢禱, H.-D.(2004)。利用線性 logistic Rasch 模式估計排名賽的成績表現—以 34 屆世界盃棒球賽為例。國立體育學院論叢，15(1)，149-158。連結：張新立, H.-L.,吳舜丞, S.-C.(2008)。多層面 Rasch 模式於學術研討會論文評分之應用。測驗學刊，55，105-128。連結：陳建亨, C.-H.,楊凱琳, K.-L.(2021)。題型對學生數學表現水準之影響—以相似形為例。教育科學研究期刊，66(3)，247-277。連結：陳映孜, Y.-T.,何曉琪, H.-C.,劉昆夏, K.-H.,林煥祥, H.-S.,鄭英耀, Y.-Y.(2017)。從教師自編科學成就測驗之 Rasch 分析看教與學。教育科學研究期刊，62(3)，1-23。連結：陸雲鳳, Y.-F.(2016)。利用 Rasch 測量分析桌球女單優秀個案比賽技術分析。臺灣體育學術研究，61，139-150。連結：曾盟堡, M.-P.(2002)。是誰評判不公。測驗統計年刊，10，121-133。連結：謝名娟, M.-C.(2020)。從多層面Rasch模式來檢視不同的評分者等化連結設計對參數估計的影響。教育心理學報，52，415-436。連結：謝名娟, M.-C.(2017)。誰是好的演講者？以多層面 Rasch 來分析校長三分鐘即席演講的能力。教育心理學報，48，551-566。連結：謝如山, J.-S.,謝名娟, M.-C.(2013)。多層面 Rasch 模式在數學實作評量的應用。教育心理學報，45，1-18。連結： Berry, V.(2007).Personality differences and oral test performance.Peter Lang. Bonk, W. J.,Ockey, G. J.(2003).A many-facet Rasch analysis of the second language group oral discussion task.Language Testing,20(1),89-110. Brooks, L.(2009).Interacting in pairs in a test of oral proficiency: Co-constructing a better performance.Language Testing,26(3),341-366. Chapelle, C. A.(Ed.)(2013).The encyclopedia of applied linguistics.Wiley-Blackwell. Chuang, E.(2018).Tunghai University. Csépes, I.(2009).Measuring oral proficiency through paired-task performance.Peter Lang. Davis, L.(2009).The influence of interlocutor proficiency in a paired oral assessment.Language Testing,26(3),367-396. Davis, L.(2016).The influence of training and experience on rater performance in scoring spoken language.Language Testing,33(1),117-135. Davis, L.(2012).University of Hawaiʻi at Mānoa. Ducasse, A. M.,Brown, A.(2009).Assessing paired orals: Raters’ orientation to interaction.Language Testing,26(3),423-443. East, M.(2015).Coming to terms with innovative high-stakes assessment practice: Teachers’ viewpoints on assessment reform.Language Testing,32(1),101-120. Eckes, T.(2015).Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments.Peter Lang. Eckes, T.(2005).Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis.Language Assessment Quarterly,2(3),197-221. Eckes, T.,Jin, K.-Y.(2021).Measuring rater centrality effects in writing assessment: A Bayesian facets modeling approach.Psychological Test and Assessment Modeling,63(1),65-94. Együd, G.,Glover, P.(2001).Readers respond. Oral testing in pairs-secondary school perspective.ELT Journal,55(1),70-76. Engelhard, G., Jr.,Myford, C. M.(2003).Monitoring faculty consultant performance in the Advanced Placement English Literature and Composition program with a many-faceted Rasch model.ETS Research Report Series,2003(1),i-60. Farrokhi, F.,Esfandiari, R.(2011).A many-facet Rasch model to detect halo effect in three types of raters.Theory and Practice in Language Studies,1(11),1531-1540. Foot, M. C.(1999).Relaxing in pairs.ELT Journal,53(1),36-41. Galaczi, E. D.(2014).Interactional competence across proficiency levels: How do learners manage interaction in paired speaking tests?.Applied Linguistics,35(5),553-574. Galaczi, E. D.(2008).Peer-peer interaction in a speaking test: The case of the first certificate in English examination.Language Assessment Quarterly,5(2),89-119. Galaczi, E. D.,ffrench, A.,Hubbard, C.,Green, A.(2011).Developing assessment scales for largescale speaking tests: A multiple-method approach.Assessment in Education: Principles, Policy & Practice,18(3),217-237. Galaczi, E. D.,Taylor, L.(2018).Interactional competence: Conceptualisations, operationalisations, and outstanding questions.Language Assessment Quarterly,15(3),219-236. Hsieh, C.-N.(2011).Michigan State University. Huang, H.-T. D.,Hung, S.-T. A.,Hong, H.-T. V.(2016).Test-taker characteristics and integrated speaking test performance: A path-analytic study.Language Assessment Quarterly,13(4),283-301. Iwashita, N.(1996).The validity of the paired interview format in oral performance assessment.Melbourne Papers in Language Testing,5(2),51-66. Jones, L.(2007).The student-centered classroom.Cambridge University Press. Kim, H. J.(2011).Teachers College, Columbia University. Knoch, U.(2011).Investigating the effectiveness of individualized feedback to rating behavior – a longitudinal study.Language Testing,28(2),179-200. Lee, Y. J.(2012).Software to facilitate language assessment: Focus on quest, facets, and Turnitin.The Cambridge guide to second language assessment Linacre, J. M.(1989).Many-facet Rasch measurement.MESA Press. Linacre, J. M. (2022a). Facets computer program for many-facet Rasch measurement (Version 3.84.0) [Computer Software]. Winsteps.com. https://www.winsteps.com/facets.htm Linacre, J. M.(2002).What do infit and outfit, mean-square and standardized mean?.Rasch Measurement Transactions,16(2),878. Linacre, J. M.(2022).A user’s guide to WINSTEPS® MINISTEP Rasch-model computer programs. Program Manual 5.2.5. Long, M. H.,Crookes, G.(1992).Three approaches to task-based syllabus design.TESOL Quarterly,26(1),27-56. Lumley, T.,McNamara, T. F.(1995).Rater characteristics and rater bias: Implications for training.Language Testing,12(1),54-71. Luoma, S.(2004).Assessing speaking.Cambridge University Press. McNamara, T. F.(1996).Measuring second language performance.Longman. Myford, C. M.,Wolfe, E. W.(2003).Detecting and measuring rater effects using many-facet Rasch measurement: Part I.Journal of Applied Measurement,4(4),386-422. Nakatsuhara, F.(2011).Effects of test-taker characteristics and the number of participants in group oral tests.Language Testing,28(4),483-508. Norton, J.(2005).The paired format in the Cambridge speaking tests.ELT Journal,59(4),287-297. O’Brien, J.,Rothstein, M. G.(2011).Leniency: Hidden threat to large-scale, interview-based selection systems.Military Psychology,23(6),601-615. O’Neill, R.,Russell, A. M. T.(2019).Stop! Grammar time: University students’ perceptions of the automated feedback program Grammarly.Australasian Journal of Educational Technology,35(1),42-56. Ockey, G. J.(2009).The effects of group members’ personalities on a test taker’s L2 group oral discussion test scores.Language Testing,26(2),161-186. Park, T.(2004).An investigation of an ESL placement test of writing using many-facet Rasch measurement.Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics,4(1) Pollitt, A.,Hutchinson, C.(1987).Calibrating graded assessments: Rasch partial credit analysis of performance in writing.Language Testing,4(1),72-92. Rydell, M.(2019).Negotiating co-participation: Embodied word searching sequences in paired L2 speaking tests.Journal of Pragmatics,149,60-77. Saville, N.,Hargreaves, P.(1999).Assessing speaking in the revised FCE.ELT Journal,53(1),42-51. Smith、 E. V., Jr.,Smith, R. M.,莫慕貞（編）, M. M. C.(Ed. Trans.),張權（編）, Q.(Ed. Trans.)(2017).羅氏測量：應用與導讀.一豐印刷有限公司=Yi feng yinshua youxian gongsi. Son, Y. A.(2016).Interaction in a paired oral assessment: Revisiting the effect of proficiency.Papers inLanguage Testing and Assessment,5(2),43-68. Storch, N.(2001).How collaborative is pair work? ESL tertiary students composing in pairs.Language Teaching Research,5(1),29-53. Storch, N.,Aldosari, A.(2013).Pairing learners in pair work activity.Language Teaching Research,17(1),31-48. Sundqvist, P.,Sandlund, E.,Skar, G. B.,Tengberg, M.(2020).Effects of rater training on the assessment of L2 English oral proficiency.Nordic Journal of Modern Language Methodology,8(1),3-29. Takala, S.(Ed.)(2009).Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment.Council of Europe/Language Policy Division. Taylor, L.(2003).The Cambridge approach to speaking assessment.Research Notes,13,2-4. Tindal, G.(Ed.),Haladyna, T. M.(Ed.)(2002).Large-scale assessment programs for all students: Validity, technical adequacy, and implementation.Lawrence Erlbaum Associates Publishers. Wallace, M. J.(1998).Action research for language teachers.Cambridge University Press. Wang, J.,Brown, M. S.(2008).Automated essay scoring versus human scoring: A correlational study.Contemporary Issues in Technology & Teacher Education,8(4),310-325. Weigle, S. C.(1998).Using FACETS to model rater training effects.Language Testing,15(2),263-287. Weir, C.(Ed.),Milanovic, M.(Ed.)(2003).Continuity and innovation: Revising the Cambridge proficiency in English examination 1913–200.Cambridge University Press. Wind, S. A.(2018).Examining the impacts of rater effects in performance assessments.Applied Psychological Measurement,43(2),159-171. Wright, B. D.,Linacre, J. M.,Gustafson, J.-E.,Martin-Löf, P.(1994).Reasonable mean-square fit values.Rasch Measure-ment Transactions,8(3),370. 王文中, W.-C.(2004)。Rasch 測量理論與其在教育和心理之應用。教育與心理研究，27，637-694。何德華、李萍、賴思悅、潘家貝、阿芬達（2019）：〈印尼旅蛙來電了〉。YouTube。https://www.youtube.com/playlist?list=PLQn99bzkJv9yDZbCZaQE4Sj23guoQ9AVu [Rau, D. V., Pulungan, P. L. S., Lase, A., Panggabean, G. C., & Samosir, A. (2019). Indonesian travel frog called.YouTube. https://www.youtube.com/playlist?list=PLQn99bzkJv9yDZbCZaQE4Sj23guoQ9AVu] 余民寧, M.-N.(2013)。口試在國家考試應用之再檢討與改進。國家菁英季刊，9(2)，87-107。吳昭容, C.-J.,曾建銘, C.-M.,鄭鈐華, C.-H.,陳柏熹, P.-H.,吳宜玲, Y.-L.(2018)。領域特定詞彙知識的測量：三至八年級學生數學詞彙能力。教育研究與發展期刊，14(4)，1-40。張可家, K.-C.,施泰亨, T.-H.,藍珮君, P.-J.(2011)。「華語文口語能力測驗」評分者一致性探討。華語文能力測驗成果發表會，臺北市=Taipai: 莫慕貞（2019，11 月 8—9 日）：〈精進教學工作坊：Rasch 可觀測量在大學甄試檔案評量之應用〉（工作坊）。國立中正大學，嘉義。https://reurl.cc/edQbyW [Mok, M. M. C. (2019, November 8–9). Jingjin jiaoxue gongzuofang: Rasch keguanceliang zai daxue zhenshi dangan pingliang zhi yingyong (Workshop). National Chung Cheng University, Chiayi. https://reurl.cc/edQbyW] 廖才儀, T.-Y.(2016)。華語文口語能力測驗」評分者內評分偏誤研究—以入門基礎級為對象。第九屆國際電腦漢語教學研討會（TCLT9），澳門=Macau: 藍珮君, P. J.(2012)。以多面向 Rasch 測量模式分析 TOCFL 口語測驗評分者訓練效果。永續教育發展創新與實踐論文集：2010 年國際學術研討會—測驗及評量論文專輯