题名

「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究

并列篇名

Assessing the Validity of Standard-Setting for an English Language Assessment With a Hybrid Expert and Empirical Performance Model

DOI

10.6209/JORIES.202306_68(2).0001

作者

謝進昌(Jin-Chang Hsieh)

关键词

英語文理解 ; 混合專家與學生實徵表現導向效度評估 ; 臺灣學生成就長期追蹤評量 ; 標準本位大型評量 ; 標準設定 ; English comprehension ; hybrid of expert and student empirical performance models ; Taiwan Assessment of Student Achievement: Longitudinal study ; standard-based large-scale assessment ; standard setting

期刊名称

教育科學研究期刊

卷期/出版年月

68卷2期(2023 / 06 / 01)

页次

1 - 35

内容语文

繁體中文;英文

中文摘要

為評估《十二年國民基本教育課程綱要》推動對於學生表現之影響,遂推動臺灣學生成就長期追蹤評量計畫(TASAL),目的在追蹤臺灣學生素養成長表現、探究影響因子與回饋國家課程綱要。究其內涵為標準本位大型教育評量,而本研究目的在以標準發展的整體歷程觀點,提出「混合專家與學生實徵表現導向模式」,以多面向、多途徑(來源)累積支持過程、內部、外部與後效(間接預估)影響等證據,以回應第四學習階段英語文標準設定結果的有效性。在透過標準化流程發展評量工具時,研究者於過程中逐步融入標準設定各項任務元素,以建構理論、發展表現水準描述、標準設定素材與試題、開發相關大型評量技術等。研究者經蒐集來自15名專家成員對於評估問卷填答,以檢核標準設定過程、結果合理性,而結果顯示成員多能認同其適切性。此外,本研究透過回饋訊息提供、成員間討論與反思,也發現成員對於試題判讀,會隨著輪次增加而愈趨一致,分類誤差多能在合理區間內。加諸以七年級學生升至八年級之英語文理解表現作為外部效標,結果顯示所設定的切截分數是具有良好區別不同層級學生於外部效標表現差異之程度。整體而言,本研究在標準形成(或標準設定)階段,大致能獲得良好過程、內部、外部與後效(間接預估)影響等證據支持。文末,本研究並提出建議,供未來參考。

英文摘要

Background and Purpose. The Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL) was implemented to evaluate the effect of the new 12-year basic education curriculum on student performance in Taiwan. TASAL is a standards-based, large-scale assessment that aims to track the literacy growth of Taiwanese students, explore relevant factors, and collect empirical evidence to assist in the development of future curriculum guidelines. This study assessed the validity of standard-setting with a hybrid model combining expert and student empirical performance. The hybrid model exhibits multidimensional, multisource, and long-term cumulative features. The multidimensional feature provides evidence for procedural, internal, and external validity and for setting appropriate standards (Kane, 1994, 2001; Pant et al., 2009). The multisource feature indicates that the evidence of validity is derived from various sources, such as expert opinions and students' empirical performance. Finally, the long-term cumulative feature represents the process of accumulating evidence over a long period. Presenting every type of evidence in a study is challenging due to time and resource constraints. The burden placed on researchers and students should be considered. Method. 1. Sampling: In 2019, the evaluation of seventh-grade students was initiated formally in TASAL. In 2020, the same group of students, now in the eighth grade, was evaluated in TASAL. The sampling method was stratified two-stage cluster sampling. Initially, 256 junior high schools were selected to take part in the evaluation. Finally, 246 schools with a total of 2,793 students were enrolled for this project. Regarding the English test of TASAL, in 2019, 2,793 seventh-grade students took the TASAL English test. In 2020, 2,893 eighth-grade students took the test. Among the eighth-grade students, 2,554 took the English test in both years. 2. Materials: The TASAL English core competence assessment was developed through a standardized procedure, including purpose clarification, theory construction, assessment guidelines, performance level descriptor development, test item designation, test assembly, and data analysis. The TASAL English core competence assessment examines English reading comprehension according to the corresponding content in the 12-year basic education curriculum. Based on the concept of transforming verb-noun usage into cognitive processes and content knowledge, as proposed by Anderson et al. (2001), a separate set of assessment criteria and test items has been developed for the TASAL English core competence assessment to evaluate reading comprehension. In the TASAL English core competence assessment, six levels of performance descriptors was initially proposed (Hsieh, 2023). However, no corresponding test items were available for the sixth (highest) level of the assessment, because the standard-setting process still focused on the seventh-grade test items. Therefore, this study focused on the first five levels, which included acquiring linguistic fluency, locating explicitly stated information, literal comprehension, implicit comprehension, and evaluation and reflection beyond text comprehension. According to a review of the literature, various text types based on the OECD text types (2019) are used in the TASAL English core competence assessment, and these types are modified to include descriptive, introductive, transactional, expository, commentary, persuasive, narrative, and literary texts. The assessment for seventh-grade students contained 182 test items, and the assessment for eighth-grade students contained 196 test items; 84 common items were included in both assessments. The response consistency was good. The Expected A Posteriori (EAP) estimate of the items were 0.85 and 0.91 in the assessments for seventh-grade and eighth-grade students, respectively. 3. Standard-setting: This study employed the extended Angoff method (Hambleton & Plake, 1995) to establish assessment standards. A total of 15 experts from various regions in Taiwan were trained and participated in the standard-setting meeting. Among these experts, 10 were women and 5 were men, with an average teaching experience of 18.25 years. The standard-setting meeting was implemented in three rounds, and student ability and cutoff scores were estimated by weighted likelihood estimation (Warm, 1989). Statistical analyses were performed in R (R Core Team, 2022) and TAM software packages (Robitzsch et al., 2020). Result and Conclusion. Feedback was collected using a questionnaire on standard-setting. Most of the experts rated the process and outcome of the standard-setting meeting as being well above or above average. The experts agreed or strongly agreed that providing feedback and PLD procedures were helpful in establishing standards. In summary, this study provides satisfactory evidence for the procedural validity of standard-setting. This study also provides evidence for the internal validity of standard-setting. During the initial round, the standard error of cutoff scores was between 2.03 and 11.58, as reported by all experts across all levels. However, during subsequent rounds, the margin of error decreased. In general, most standard errors (relative to the measurement error of 34.64) were within an acceptable level of 0.33, which is consistent with the results of Kaftandjieva (2010, p. 104). Using the English comprehension performance of eighth-grade students as the external criteria, the use of the scores obtained from the seventh-grade assessment to set cutoff scores was effective for significantly distinguishing between different levels of achievement. A partial η^2 of .506 was obtained, indicating a large effect size, as suggested by Cohen (1988). In conclusion, this study provides evidence for the external validity of standard-setting. In summary, some valuable suggestions are provided based on the study results. For example, when evaluating changes in student performance, the regression toward the mean may be a crucial factor affecting the result of standard-setting during the implementation of vertical articulation of cutoff scores across grades. Additionally, continuously collecting evidence to support the validity of standard-setting is crucial in responding to educational policies and curriculum guidelines. Therefore, the study results indicate the importance of building ongoing proof of validity in future research.

主题分类 社會科學 > 教育學
参考文献
  1. 侯佩君, P.-C.,杜素豪, S.-H.,廖培珊, P.-S.,洪永泰, Y.-T.,章英華, Y.-H.(2008)。台灣鄉鎮市區類型之研究:「台灣社會變遷基本調查」第五期計畫之抽樣分層效果分析。調查研究─方法與應用,23,7-32。
    連結:
  2. 張銘秋, M.-C.,黃瓅瑩, L.-Y.,陳佳蓉, C.-J.,陳柏熹, P.-H.,曾芬蘭, F.-L.(2022)。國中教育會考數學科的回沖效應初探。教育科學研究期刊,67(1),227-254。
    連結:
  3. 曾芬蘭, F.-L.,林奕宏, Y.-H.,邱佳民, J.-M.(2017)。監控評分者效果的Yes/No Angoff標準設定法之效度檢核:以國中教育會考數學科為例。測驗學刊,64(4),403-432。
    連結:
  4. 謝進昌, J.-C.(2021)。以「補充性表現水平描述輔助自陳式測量構念」之延伸Angoff標準設定研究。教育心理學報,53(2),307-334。
    連結:
  5. 謝進昌, J.-C.,謝名娟, M.-C.,林世華, S.-H.,林陳涌, C.-Y.,陳清溪, C.-H.,謝佩蓉, P.-J.(2011)。大型資料庫國小四年級自然科學習成就評量標準設定結果之效度評估。教育科學研究期刊,56(1),1-32。
    連結:
  6. American Educational Research Association,American Psychological Association,National Council on Measurement in Education(2014).Standards for educational and psychological testing.American Educational Research Association.
  7. Anderson, L. W.(Ed.),Krathwohl, D. R.(Ed.),Airasian, P. W.(Ed.),Cruikshank, K. A.(Ed.),Mayer, R. E.(Ed.),Pintrich, P. R.(Ed.),Raths, J.(Ed.),Wittrock, M. C.(Ed.)(2001).A taxonomy for learning, teaching, and assessing: A revision of Bloom’s taxonomy of educational objectives.Longman.
  8. Beaton, A. E.,Allen, N. L.(1992).Interpreting scales through scale anchoring.Journal of Educational Statistics,17(2),191-201.
  9. Cizek, G. J.,Bunch, M. B.(2007).Standard setting: A guide to establishing and evaluating performance standards on tests.Sage.
  10. Cizek, G. J.,Bunch, M. B.,Koons, H.(2004).Setting performance standards: Contemporary methods.Educational Measurement: Issues and Practice,23(4),31-50.
  11. Cohen, J.(1988).Statistical power analysis for the behavioral sciences.Lawrence Erlbaum Associates.
  12. Efron, B.(1979).Bootstrap methods: Another look at the Jackknife.The Annals of Statistics,7(1),1-26.
  13. Ferrara, S.,Johnson, E.,Chen, W. H.(2005).Vertically articulated performance standards: Logic, procedures, and likely classification accuracy.Applied Measurement in Education,18(1),35-59.
  14. Gagne, E. D.,Yekovich, C. W.,Yekovich, F. R.(1993).The cognitive psychology of school learning.Harper Collins.
  15. Groves, R. M.,Fowler, F. J.,Couper, M. P.,Lepkowski, J. M.,Singer, E.,Tourangeau, R.(2011).Survey methodology.John Wiley & Sons.
  16. Hambleton, R. K.(2001).Setting performance standards on educational assessments and criteria for evaluating the process.Standard setting: Concepts, methods, and perspectives
  17. Hambleton, R. K.,Plake, B. S.(1995).Using an extended Angoff procedure to set standards on complex performance assessments.Applied Measurement in Education,8(1),41-55.
  18. Hoover, W. A.,Gough, P. B.(1990).The simple view of reading.Reading and Writing: An Interdisciplinary Journal,2,127-160.
  19. Ingels, S. J.,Pratt, D. J.,Rogers, J. E.,Siegel, P. H.,Stutts, E. S.(2005).,National Center for Education Statistics.
  20. Kaftandjieva, F.(2010).Methods for setting cut scores in criterion-referenced achievement tests: A comparative analysis of six recent methods with an application to tests of reading in EFL.CITO.
  21. Kane, M. T.(1994).Validating the performance standards associated with passing scores.Review of Educational Research,64(3),425-461.
  22. Kane, M. T.(2001).So much remains the same: Conception and status of validation in setting standards.Standard setting: Concepts, methods, and perspectives
  23. Kelly, D. L.(1999).Boston College.
  24. Kolen, M. J.,Brennan, R. L.(2004).Test equating, scaling, and linking: Methods and practices.Springer.
  25. LaRoche, S.,Joncas, M.,Foy, P.(2020).Methods and procedures: TIMSS 2019 technical reportMethods and procedures: TIMSS 2019 technical report,未出版
  26. Linacre, J. M.(2005).A user’s guide to Winsteps/Ministeps Rasch model programs.MESA Press.
  27. Linn, R. L.,Herman, J. L.(1997).,the Education Commission of the States.
  28. Martin, M. O.(Ed.),Mullis, I. V. S.(Ed.),Hooper, M.(Ed.)(2016).Methods and procedures in TIMSS 2015.Boston College, TIMSS & PIRLS International Study Center.
  29. Messick, S.(1994).The interplay of evidence and consequences in the validation of performance assessments.Educational Researcher,23(2),13-23.
  30. Mullis, I. V. S.,Martin, M. O.(2015).PIRLS 2016 assessment framework.:TIMSS & PIRLS International Study Center, Lynch School of Education, Boston College.
  31. Mullis, I. V. S.,Prendergast, C. O.(2017).Using scale anchoring to interpret the PIRLS and ePIRLS 2016 achievement scales.Methods and procedures in PIRLS 2016
  32. Nassif, P. M.(1978).Standard setting for criterion referenced teacher licensing tests.The annual meeting of the National Council on Measurement in Education,Toronto, Canada:
  33. Organisation for Economic Co-operation and Development(2020).PISA 2018 technical report.OECD Publishing.
  34. Organisation for Economic Co-operation and Development(2019).PISA 2018 assessment and analytical framework.OECD Publishing.
  35. Pant, H. A.,Rupp, A. A.,Tiffin-Richards, S. P.,Köller, O.(2009).Validity issues in standard-setting studies.Studies in Educational Evaluation,35(2-3),95-101.
  36. Plake, B. S.,Cizek, G. J.(2012).The modified Angoff, extended Angoff, and yes/no standard setting methods.Setting performance standards. Foundations, methods, and innovations
  37. R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
  38. Robitzsch, A., Kiefer, T., & Wu, M. (2020). TAM: Test analysis modules. R package version 3.5-19. https://cran.r-project.org/web/packages/TAM/index.html
  39. Schafer, W. D.(2006).Growth scales as an alternative to vertical scales.Practical Assessment Research & Evaluation,11(4)
  40. Sireci, S. G.,Hauger, J. B.,Wells, C. S.,Shea, C.,Zenisky, A. L.(2009).Evaluation of the standard setting on the 2005 grade 12 national assessment of educational progress mathematics test.Applied Measurement in Education,22(4),339-358.
  41. Thorndike, R. L.(Ed.)(1971).Educational measurement.American Council on Education..
  42. van der Linden, W. J.(Ed.),Hambleton, R. K.(Ed.)(1997).Handbook of modern item response theory.Springer.
  43. Warm, T. A.(1989).Weighted likelihood estimation of ability in item response theory.Psychometrika,54,427-450.
  44. Wixson, K. K.,Valencia, S. W.,Murphy, S.,Phillips, G. W.(2013).,ERIC.
  45. Wyse, A. E.(2017).Five methods for estimating angoff cut scores with IRT.Educational Measurement: Issues and Practice,36(4),16-27.
  46. Wyse, A. E.(2018).Equating Angoff standard-setting ratings with the Rasch model.Measurement: Interdisciplinary Research and Perspectives,16(3),181-194.
  47. 任宗浩, T.-H.(2018)。,國家教育研究院=National Academy for Educational Research。
  48. 吳正新, J.-S.(2019)。,國家教育研究院=National Academy for Educational Research。
  49. 國家教育研究院(2018)。十二年國民基本教育課程綱要:國民中小學暨普通型高級中等學校:語文領域─英語文。作者。【National Academy for Educational Research. (2018). 12-year basic education curriculum for elementary and high school: English. Author.】
  50. 國家教育研究院(無日期)。首頁。臺灣學生成就長期追蹤評量計畫。2022年3月30日, https://tasal.naer.edu.tw/ 【National Academy for Educational Research. (n.d.). Homepage. Taiwan Assessment of Student Achievement: Longitudinal Study. Retrieved March 30, 2022, from https://tasal.naer.edu.tw/】
  51. 國家教育研究院課程及教學研究中心核心素養工作圈, Research Center for Curriculum and Instruction, National Academy for Educational Research(2015).十二年國民基本教育領域課程綱要─核心素養發展手冊.國家教育研究院=National Academy for Educational Research.
  52. 教育部統計處(2019)。各級學校地理資訊及地區別統計查詢。https://stats.moe.gov.tw/EduGis/ 【Department of Statistics, Ministry of Education. (2019). Statistical query of geographical and regional information for schools at all levels. https://stats.moe.gov.tw/EduGis/】
  53. 黃馨瑩, H.-Y.,謝名娟, M.-C.,謝進昌, J.-C.(2013)。臺灣學生學習成就評量英語科標準設定之效度評估研究。教育與心理研究,36(2),87-112。
  54. 謝進昌, J.-C.(2023).建構英語文素養評量指引:TASAL標準本位大型評量.國家教育研究院=National Academy for Educational Research.