题名

以「補充性表現水平描述輔助自陳式測量構念」之延伸Angoff標準設定研究

并列篇名

Extended Angoff Method in Setting Standards for Self-Report Measures With Supplementary Performance-Level Descriptors

DOI

10.6251/BEP.202112_53(2).0003

作者

謝進昌(Jin-Chang Hsieh)

关键词

延伸Angoff法 ; 自陳式測量 ; 補充性表現水平描述 ; 標準設定 ; 臺灣學生成就長期追蹤評量 ; extended Angoff method ; self-report measure ; supplementary performance-level descriptors ; standard setting ; Taiwan Assessment of Student Achievement: Longitudinal Study

期刊名称

教育心理學報

卷期/出版年月

53卷2期(2021 / 12 / 01)

页次

307 - 334

内容语文

繁體中文

中文摘要

隨著十二年國民基本教育課程推動,其核心重視學生知識、態度、技能與策略等全人培養。為回應這波課程革新對於學生表現影響,遂成立臺灣學生成就長期追蹤評量計畫(TASAL),目的在描述臺灣學生表現及探究影響因子,其本質為標準本位評量。然而,在回顧過去標準設定研究,多以學科認知範疇的標準設定為焦點,較少涉及情意、策略面向。據此,本研究提出補充性表現水平描述(S-PLD)概念,以輔助專家教師使用延伸Angoff(extended Angoff)標準設定法,進行自陳式測量構念通過分數設定,並透過檢視效度之過程、內部與外部證據,以支持本研究結果合理性。本研究結果顯示標準設定成員對於標準設定過程,多認同其適切性。藉由討論與反思,凝聚出可接受、適當結果,而在檢視成員於各輪次評定結果的穩定與一致性,其誤差也多能在合理範圍內。最後,本研究所設定兩個切截點,也具有一定區別不同層級策略使用者於外在效標(英語文理解)的表現。整體而言,本研究結果是具有過程、內部與外部證據支持,並於文末,提出幾點建議,供未來研究者參考。

英文摘要

One vision of the 12-Year Basic Education Curriculum in Taiwan is to promote the comprehensive learning and development of all students. To ensure the quality of this curriculum reform, the Ministry of Education funded a long-term project, the Taiwan Assessment of Student Achievement: Longitudinal Study (TASAL), to evaluate the impact of the curriculum on student performance. The TASAL is a large-scale standards-based assessment, and standard setting is one of its main task. A comprehensive literature review indicated that most empirical studies related to standard setting have focused on cognitive domains and few study undertake expert-oriented standard-setting processes in affective domains because of some practical limitations. The present study suggests a new approach, employing supplementary performance-level descriptors (S-PLDs) in an extended Angoff method in setting standards for self-report measures. The purpose of this study was to uncover evidence of the procedural, internal, and external validity of implementing an extended Angoff method procedure with S-PLDs in standard setting for English comprehension strategy use among seventh grade students in Taiwan. PLDs are designed to outline the knowledge, skills, and practices that indicate the level of student performance in a target domain. In the present study, the use of comprehension strategies for learning English as a foreign language was examined. S-PLDs provide comparable but unique functions within the standard-setting process. S-PLDs offer supplementary material to subject matter experts to facilitate the formation of profiles of student performance in target domains, especially when ambiguities in conventional PLDs may prevent expert consensus during the standard-setting process. In this study, stratified two-stage cluster sampling was adopted to select representative seventh graders in Taiwan during the 2018-2019 academic year. After sampling, 7,246 students had been selected; only 2,732 students, 1,417 boys and 1,315 girls, received an English comprehension strategy use questionnaire and English proficiency test. Student performance on both measurement instruments was the basis for writing PLDs and S-PLDs. The scale measuring English comprehension strategy use was a 4-point discrete visual analogue scale self-report measure developed through standardized procedures and comprises four dimensions: memorization (6 items), cognition (6 items), inference (8 items), and comprehension monitoring (10 items) strategies. The results of four-dimensional confirmatory factor analysis indicated a favorable model-data fit, except for the chi-square value, which was affected by the large sample size . Moreover, the English proficiency test used was a cognitive measure assessing students' listening and reading comprehension abilities through the use of multiple-choice and constructed-response items. A total of 182 items were developed through a standardized procedure and divided into 13 blocks to assemble 26 test booklets. Each booklet, containing 28 items, was randomly delivered to a participating student; each student completed only one booklet. After data cleansing and item calibration with a multidimensional random coefficient multinomial logit model and the Test Analysis Modules (Robitzsch et al., 2020), the information-weighted fit mean-square indices for all test items ranged from 0.79 to 1.37, meeting the criterion proposed by Linacre (2005). An expert-oriented standard-setting meeting was hosted on May 20, 2020, after advanced materials, such as agenda, instruction of standard-setting method, had been sent to all experts. Eight experts from across Taiwan were invited to join the meeting, and they all had experience involving standard-setting meetings for student performance on English proficiency tests. The average number of years of teaching experience for these experts was 18.75, and seven had experience in teaching low achievers. Overall, the experts had sufficient prerequisite knowledge and experience with standard-setting processes. On the day of the standard-setting meeting, a series of events, including orientation, training and practice, and three rounds of extended Angoff standard-setting methods with different types of feedback provided between rounds, were undertaken. Feedback questionnaires were developed , and discussions among the experts between the rounds were recorded and analyzed as evidence of procedural and internal validity. Most of the subject matter experts were satisfied with the events during the standard-setting process and agreed that they could set satisfactory cutoff scores for future usage. From the results of feedback questionnaires completed between rounds, the experts nearly unanimously agreed that the materials received in advance; the introductions to PLDs, S-PLDs, and the extended Angoff method; and previous experience in setting standards for English proficiency were beneficial in judging items during the process. Additionally, the experts agreed that the S-PLDs played a key role in facilitating the formation of outlines for student performance in comprehension strategy use across different levels. All of these results indicate procedural validity. For evidence of internal validity, classification error (the ratio of the standard error of the passing score to the measurement error), was computed to indicate the consistency of the item ratings between and within the experts during the three-round process. Between experts and across rounds, the classification error ranged from 0.08 to 0.36 for memorization strategies, 0.14 to 0.49 for cognition strategies , 0.19 to 0.61 for inference strategies, and 0.24 to 0.72 for comprehension monitoring strategies. These results indicate that the cognitive levels for the four dimensions affect the consistency of item rating. Strategies with more abstract item content tended to have higher classification error. Furthermore, the lowest classification error values occurred in the second round for memorization and inference strategies and in the third round for cognition and comprehension monitoring strategies. All low values for each dimension were beneath the cutoff of 0.33 proposed by Kaftandjieva (2010), except for the value of 0.37 for comprehension monitoring strategystrategies. Regarding the rating consistency within experts between rounds, the results showed no extreme classification error, and most of the values were beneath 0.33, with the exceptions of 0.35 for cognition strategies and 0.37, 0.42, and 0.61 for comprehension monitoring strategies. Therefore, most experts exhibited rating consistency between the rounds. Additionally, the results of a content analysis of the item rating discussions indicated that three reference sources might affect experts' judgments regarding the items: (1) students' actual performance, (2) PLDs and S-PLDs, and (3) experts' personal expectations. For example, one expert might give a lower score because his students tend to exhibit poor performance on a particular item dependent on their teaching experience, whereas another expert might give a higher score because of their personal expectations. To examine external validity, student performance on English proficiency tests was adopted as an external criterion. With two cutoff scores used to divide students into basic, proficient, and advanced users in each dimension, a medium effect size was obtained for memorization strategies, and large effect sizes were obtained for cognition, inference, and comprehension monitoring strategies. Furthermore, to compare the final cutoff scores obtained through the study method with existing methods , the study adopted the concept from TIMSS and PIRLS for setting standards for affective domains (Martin et al., 2014, p. 308). The classification accuracy indices, which indicate the proportions of students classified identically, were 90.25%, 81.20%, 82.52%, and 87.56% for the four dimensions. To sum up, the present study obtained satisfactory evidence of the procedural, internal, and external validity of using an extended Angoff procedure for setting standards for self-report measures with S-PLDs; additional suggestions are presented herein.

主题分类 社會科學 > 心理學
社會科學 > 教育學
参考文献
  1. 吳宜芳, Y.-F.,鄒慧英, H.,林娟如, J.-R.(2010)。標準設定效度驗證之探究:以大型數學學習成就評量為例。測驗學刊,57(1),1-27。
    連結:
  2. 吳毓瑩, Y.-Y.,陳彥名, Y.-M.,張郁雯, Y.-W.,陳淑惠, S.-H. E.,何東憲, T.-H.,林俊吉, J.-J.(2009)。以常態混組模型討論書籤標準設定法對英語聽讀基本能力標準設定有效性之輻合證據。教育心理學報,41(1),69-89。
    連結:
  3. 林小慧, H.-H.,吳心楷, H.-K.(2019)。科學探究能力評量之標準設定與其效度檢核。教育心理學報,50(3),473-502。
    連結:
  4. 柯華葳, H.-W.(2020)。臺灣閱讀策略教學政策與執行。教育科學研究期刊,65(1),93-114。
    連結:
  5. 曾芬蘭, F.-L.,林奕宏, Y.-H.,邱佳民, J.-M.(2017)。監控評分者效果的 Yes/No Angoff 標準設定法之效度檢核:以國中教育會考數學科為例。測驗學刊,64(4),403-432。
    連結:
  6. 謝名娟, M.-C.(2013)。以多層面 Rasch 分析的角度來評估標準設定之變異性。教育心理學報,44(4),793-811。
    連結:
  7. 謝進昌, J.-C.,謝名娟, M.-C.,林世華, S.-H.,林陳涌, C.-Y.,陳清溪, C.-H.,謝佩蓉, P.-J.(2011)。大型資料庫國小四年級自然科學習成就評量標準設定結果之效度評估。教育科學研究期刊,56(1),1-32。
    連結:
  8. 十二年國民基本教育課程綱要總綱(2014 年 11 月)。[Curriculum Guidelines of 12-Year Basic Education: General Guidelines. (2014, November).]
  9. Adams, R. J.,Wilson, M. R.,Wang, W. L.(1997).The multidimensional random coefficients multinomial logit model.Applied Psychological Measurement,21(1),1-24.
  10. American Educational Research Association,American Psychological Association,National Council on Measurement in Education(2014).Standards for educational and psychological testing.American Educational Research Association.
  11. Ardasheva, Y.,Wang, Z.,Adesope, O. O.,Valentine, J. C.(2017).Exploring effectiveness and moderators of language learning strategy instruction on second language and self-regulated learning outcomes.Review of Educational Research,87(3),544-582.
  12. Barjesteh, H.,Mukundan, J.,Vaseghi, R.(2014).Synthesis of language learning strategies: Current issues, problems and claims made in learner strategy research.Advances in Language & Literary Studies,5(6),68-74.
  13. Beaton, A. E.,Allen, N. L.(1992).Interpretation scales through scale anchoring.Journal of Educational Statistics,17(2),191-201.
  14. Beuk, C. H.(1984).A method for reaching a compromise between absolute and relative standards in examinations.Journal of Educational Measurement,21(2),147-152.
  15. Bock, R. D.,Mislevy, R. J.(1982).Adaptive EAP estimation of ability in a microcomputer environment.Applied Psychological Measurement,6(4),431-444.
  16. Bourque, M. L.(2009).,未出版
  17. Cizek, G. J.(Ed.)(2001).Standard setting: Concepts, methods, and perspectives.Erlbaum.
  18. Cizek, G. J.,Bunch, M. B.(2007).Standard setting: A guide to establishing and evaluating performance standards on tests.SAGE Publications.
  19. Cizek, G. J.,Bunch, M. B.,Koons, H.(2004).Setting performance standards: Contemporary methods.Educational Measurement: Issues and Practice,23(4),31-50.
  20. Cohen, A. S.,Kane, M. T.,Crooks, T. J.(1999).A generalized examinee-centered method for setting standards on achievement tests.Applied Measurement in Education,12(4),343-366.
  21. Cohen, J.(1988).Statistical power analysis for the behavioral sciences.Lawrence Erlbaum Associates Publishers.
  22. Efron, B.(1981).Nonparametric estimates of standard error: The jackknife, the bootstrapping and other methods.Biometrika,68(3),589-599.
  23. Gambrell, L. B.,Bales, R. J.(1986).Mental imagery and the comprehension-monitoring performance of fourth- and fifth- grade poor readers.Reading Research Quarterly,21(4),454-464.
  24. Griffiths, C.(2007).Language learning strategies: Students’ and teachers’ perceptions.ELT Journal,61(2),91-99.
  25. Hambleton, R. K.,Plake, B. S.(1995).Using an extended Angoff procedure to set standards on complex performance assessments.Applied Measurement in Education,8(1),41-55.
  26. Hu, L.,Bentler, P. M.(1999).Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives.Structural Equation Modeling,6(1),1-55.
  27. Impara, J. C.,Plake, B. S.(1997).Standard setting: An alternative approach.Journal of Educational Measurement,34(4),353-366.
  28. Jaeger, R. M.(1991).Selection of judges for standard‐setting.Educational Measurement: Issues and Practice,10(2),3-14.
  29. Jeon, E. H.,Yamashita, J.(2014).L2 reading comprehension and its correlates: A meta-analysis.Language Learning,64(1),160-212.
  30. Kaftandjieva, F.(2010).Methods for setting cut scores in criterion-referenced achievement tests: A comparative analysis of six recent methods with an application to tests of reading in EFL.EALTA.
  31. Kane, M. T.(1994).Validating the performance standards associated with passing scores.Review of Educational Research,64(3),425-461.
  32. Kelly, D. L.(1999).Boston College.
  33. Kolen, M. J.,Brennan, R. L.(2004).Test equating, scaling, and linking: Methods and practices.Springer Publishing Company.
  34. Linacre, J. M.(2005).A user’s guide to Winsteps/Ministeps: Rasch-Model programs.MESA Press.
  35. McDonald, R. P.,Ho, M.-H. R.(2002).Principles and practice in reporting structural equation analyses.Psychological Methods,7(1),64-82.
  36. Mullis, I. V. S.,Cotter, K. E.,Centurino, V. A. S.,Fishbein, B. G.,Liu, J.(2016).Using scale anchoring to interpret the TIMSS 2015 achievement scales.Methods and procedures in TIMSS 2015
  37. Mullis, I. V. S.,Prendergast, C. O.(2017).Using scale anchoring to interpret the PIRLS and ePIRLS 2016 achievement scales.Methods and procedures in PIRLS 2016
  38. Nassif, P. M.(1978).Standard setting for criterion referenced teacher licensing tests.National Council on Measurement in Education Annual Meeting
  39. Organisation for Economic Co-operation and Development(2020).,未出版
  40. Organisation for Economic Co-operation and Development(2019).PISA 2018 assessment and analytical framework.
  41. Padron, N.Y.,Waxman, H. C.(1988).The effect of ESL students’ perception of their cognitive strategies on reading achievement.TESOL Quarterly,22(1),146-150.
  42. Pearson, P. D.(Ed.)(1984).Metacognitive skills and reading.Longman.
  43. Plonsky, L.(2011).The effectiveness of second language strategy instruction: A meta-analysis.Language Learning,61(4),993-1038.
  44. Robitzsch, A., Kiefer, T., & Wu, M. (2020). TAM: Test analysis modules. R package version 3.5–19. The Comprehensive R Archive Network. https://CRAN.R-project.org/package=TAM
  45. Rutkowski, L.(Ed.),von Davier, M.(Ed.),Rutkowski, D.(Ed.)(2014).Handbook of international large-scale assessment.Chapman & Hall/CRC.
  46. Sireci, S. G.,Hauger, J. B.,Wells, C. S.,Shea, C.,Zenisky, A. L.(2009).Evaluation of the standard setting on the 2005 grade 12 National Assessment of Educational Progress mathematics test.Applied Measurement in Education,22(4),339-358.
  47. Thorndike, R. L.(Ed.)(1971).Educational measurement.American Council on Education.
  48. van der Linden, W. J.(Ed.),Hambleton, R. K.(Ed.)(1997).Handbook of modern item response theory.Springer Publishing Company.
  49. Warm, T. A.(1989).Weighted likelihood estimation of ability in item response theory.Psychometrika,54,427-450.
  50. Yao, L.,Schwarz, R. D.(2006).A multidimensional partial credit model with associated item and test statistics: An application to mixed-format test.Applied Psychological Measurement,30(6),469-492.
  51. Zieky, M. J.,Livingston, S. A.(1977).Manual for setting standards on the Basic Skills Assessment tests.Educational Testing Service.
  52. 國家教育研究院(無日期):〈臺灣學生成就長期追蹤評量計畫網站〉。https://tasal.naer.edu.tw/ [National Academy for Educational Research. (n.d.). Taiwan Assessment of Student Achievement: Longitudinal Study website. https://tasal.naer.edu.tw/]
  53. 曾建銘, C.-M.,王暄博, H.-P.(2012)。標準設定之效度評估:以 TASA 國語科為例。教育學刊,39,77-118。
  54. 曾建銘, C.-M.,王暄博, H.-P.(2012)。臺灣學生學習成就評量資料庫標準設定探究:以 2009 年國小六年級社會科為例。教育與心理研究,35(3),115-149。
  55. 黃馨瑩, H.-Y.,謝名娟, M.-C.,謝進昌, J.-C.(2013)。臺灣學生學習成就評量英語科標準設定之效度評估研究。教育與心理研究,36(2),87-112。
  56. 謝進昌(計畫主持人), J.-C.(Principal Investigator)(2021)。國家教育研究院年度研究成果報告國家教育研究院年度研究成果報告,國家教育研究院=National Academy for Educational Research。
  57. 謝進昌(2021b):《「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究》(已投稿),國家教育研究院測驗及評量研究中心。[Hsieh, J.-C. (2021b). Assessing the validity of standard setting of large-scale assessment for English as a foreign language students with a hybrid of expert and empirical performance model (Manuscript submitted for publication). Research Center for Testing and Assessment, National Academy for Educational Research.]
被引用次数
  1. 謝進昌(2023)。「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究。教育科學研究期刊,68(2),1-35。