华艺学术文献数据库

题名	監控評分者效果的Yes/No Angoff標準設定法之效度檢核：以國中教育會考數學科為例
并列篇名	Validation of the Rater-Effects-Monitored Yes/No Angoff Standard-Setting Method: Using the Taiwan Comprehensive Assessment Program for Junior High School Students Math Exam as an Example
作者	曾芬蘭(Fen-Lan Tseng)；林奕宏(Yi-Hung Lin)；邱佳民(Jia-Min Chiou)
关键词	效度；國中教育會考；評分者效果／評分者效應；標準設定法； Comprehensive Assessment Program for Junior High School Students ； rater effects ； standard-setting method ； validity
期刊名称	測驗學刊
卷期/出版年月	64卷4期（2017 / 12 / 01）
页次	403 - 432
内容语文	繁體中文
中文摘要	本研究為先前研究「監控評分者效果的Yes/No Angoff 標準設定法」之延伸，目的在於蒐集各種效度來源的資料，驗證前述標準設定法結果之效度，據以檢核該標準設定法是否適用於國中教育會考。先前的研究將評分者效果的監控整合在標準設定流程裡，運用試題反應理論的部分計分模式來分析學科專家在標準設定流程中對試題難度的判定資料，提出了前述的標準設定法。本研究則依據Kane（1994）以及Pitoniak（2003）建議的程序、內部、外部等多元效度證據的觀點，以2013年國中教育會考數學科測驗資料為例，蒐集不同來源的效度資料來評估此標準設定結果之效度。結果發現：(1)專家們同意本標準設定法之執行過程是合理的；(2)本標準設定法在方法內、專家內、專家間的結果具有高度一致性；(3)本標準設定法與群聚分析取向標準設定法之分類結果具有89.82%的一致性；(4)本標準設定法分類結果與考生的國中三年在校成績分類之相關達.75。以上這些結果指出本標準設定法的程序及產出具有理想的效度，可供相關測驗機構進行標準設定時的參考。
英文摘要	This study is an extension of the newly proposed ＂rater-effects-monitored Yes/No Angoff standard-setting method,＂ and the purpose of it is to collect validity evidence from various sources to verify the standard-setting results and to examine whether this method is applicable to the Taiwan Comprehensive Assessment Program for Junior High School Students (CAPJHSS). This method uses the partial credit model of the item response theory to analyze content experts' rating data generated in the standard-setting process to monitor the rater effects. Based on suggested procedure proposed by Kane (1994) and Pitoniak (2003), this study combines validity evidence collected from procedural, internal, and external sources to examine the 2013 CAPJHSS Math test data. The results reveal that: (a) content experts agree that the implementation process of this standard-setting method is reasonable, (b) the standard-setting results generated within-method, within-expert, and between-experts are consistent, (c) the consistency between the results from this method and the cluster-analysis-approach method is 89.82%, and (d) the correlation between the classification outcomes from this method and students school grades is .75. These results indicate that this standard-setting method possesses adequate validity, and could be considered by other testing organizations. Suggestions for future research are provided.
主题分类	社會科學 > 心理學社會科學 > 教育學
参考文献	宋曜廷、周業太、曾芬蘭(2014)。十二年國民基本教育的入學考試與評量變革。教育科學研究期刊，59(1)，1-32。連結：陳柏熹、邱佳民、曾芬蘭(2010)。高中職入學制度中在校成績採計校正方式之比較。教育科學研究期刊，55(2)，115-139。連結：謝進昌、謝名娟、林世華、林陳涌、陳清溪、謝佩蓉(2011)。大型資料庫國小四年級自然科學習成就評量標準設定結果之效度評估。教育科學研究期刊，56(1)，1-32。連結： American Educational Research Association,American Psychological Association,National Council on Measurement in Education(1999).Standards for educational and psychological testing.Washington, DC:American Educational Research Association. Ang-Aw, H. T.,Goh, C. C. M.(2011).Understanding discrepancies in rater judgment on national-level oral examination tasks.RELC Journal,42(1),31-51. Brandon, P. R.(2004).Conclusions about frequently studied modified Angoff standard-setting topics.Applied Measurement in Education,17(1),59-88. Brennan, R. L.(Ed.)(2006).Educational measurement.Westport, CT:American Council on Education. Brown, A.(1995).The effect of rater variables in the development of an occupation-specific language performance test.Language Testing,12(3),1-15. Cizek, G. J.(Ed.)(2001).Setting performance standards: Concepts, methods, and perspectives.Mahwah, NJ:Lawrence Erlbaum Associates. Cizek, G. J.(Ed.)(2001).Standard setting: Concepts, methods, and perspectives.Mahwah, NJ:Lawrence Erlbaum Associates. Clauser, J. C.(2013).Amherst, MA,University of Massachusetts. Cuesta-Albertos, J. A.,Gordaliza, A.,Matrán, C.(1997).Trimmed k-means: An attempt to robustify quantizers.The Annals of Statistics,25(2),553-576. Ferdous, A. A.,Plake, B. S.(2005).Understanding the factors that influence decisions of panelists in a standard-setting study.Applied Measurement in Education,18(3),257-267. Hein, S. F.,Skaggs, G. E.(2009).A qualitative investigation of panelists' experiences of standard setting using two variations of the bookmark method.Applied Measurement in Education,22(3),207-228. Huang, Z.(1997).A fast clustering algorithm to cluster very large categorical data sets in data mining.Data Mining and Knowledge Discovery,2(3),1-8. Huang, Z.(1997).Clustering large data sets with mixed numeric and categorical values.Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining Impara, J. C.,Plake, B. S.(2005).Teachers' ability to estimate item difficulty: A test of the assumption in the Angoff standard setting method.Journal of Educational Measurement,35(1),69-81. Kaftandjieva, F.(2010).Methods for setting cut scores in criterion-referenced achievement tests: A comparative analysis of six recent methods with an application to tests of reading in EFL.Cito, Arnhem, The Netherlands:European Association for Language Testing and Assessment. Kane, M. T.(1994).Validating the performance standards associated with passing scores.Review of Educational Research,64(3),425-461. Kane, M. T.(1987).On the use of IRT models with judgmental standard setting procedures.Journal of Educational Measurement,24(4),333-345. Khalid, M. N.(2011).Cluster analysis: A standard setting technique in measurement and testing.Journal of Applied Quantitative Method,6(2),46-58. Lin, Y.-H.,Tseng, F.-L.,Sung, Y.-T.(2013).The development and application of the rater-effects-monitored Yes/No Angoff standard-setting method: Some preliminary results.the annual meeting of the International Conference on Standard-Based Assessment,Taipei, Taiwan: Linacre, J. M.(2002).What do infit and outfit, mean-square and standardized mean?.Rasch Measurement Transactions,16(2),878. Linacre, J. M.(1989).Many-facet Rasch measurement.Chicago, IL:MESA Press. Lumley, T.(1998).Perceptions of language-trained raters and occupational experts in a test of occupational English language proficiency.English for Specific Purposes,17(4),347-367. Lunz, M. E.,Stahl, J. A.(1990).Judge consistency and severity across grading periods.Evaluation & the Health Professions,13(4),425-444. MacCann, R. G.,Stanley, G.(2006).The use of Rasch modeling to improve standard setting.Practical Assessment, Research & Evaluation,11(2),1-17. Masters, G. N.(1982).A Rasch model for partial credit scoring.Psychometrika,47,149-174. Orr, M.(2002).The FCE speaking test: Using rater reports to help interpret test scores.System,30(2),143-154. Pitoniak, M. J.(2003).Amherst, MA,University of Massachusetts. Plake, B. S.,Melican, G. J.,Mills, C. N.(1991).Factors influencing intrajudge consistency during standard-setting.Educational Measurement: Issues and Practice,10(2),15-16. Scullen, S. E.,Mount, M. K.,Goff, M.(2000).Understanding the latent structure of job performance ratings.Journal of Applied Psychology,85,956-970. Sireci, S. G.,Hauger, J. B.,Wells, C. S.,Shea, C.,Zenisky, A. L.(2009).Evaluation of the standard setting on the 2005 Grade 12 National Assessment of Educational Progress Mathematics Test.Applied Measurement in Education,22,339-358. Smith, E. V.(Ed.),Smith, R. M.(Ed.)(2004).Introduction to Rasch measurement: Theory, models and applications.Maple Grove, MN:JAM Press. Thorndike, R. L.(Ed.)(1971).Educational measurement.Washington, DC:American Council on Education. Timm, N. H.(2002).Applied multivariate analysis.New York, NY:Springer-Verlag. Trochim, W.,Donnelly, J. P.,Arora, K.(2015).Research methods: The essential knowledge base.Belmont, CA:Wadsworth. U.S. Department of Education(2009).Evaluation of the National Assessment of Educational Progress: Study report.Washington, DC:Author. Violato, C.,Marini, A.,Lee, C.(2003).A validity study of expert judgment procedures for setting cutoff scores on high-stakes credentialing examinations using cluster analysis.Evaluations and the Health Professions,26(1),59-72. Wang, N.,Wiser, R. F.,Newman, L. S.(2001).Use of the Rasch IRT model in standard setting: An item mapping method.the Annual Meeting of the National Council on Measurement in Education,Seattle, WA: 吳宜芳、鄒慧英(2010)。試題呈現與回饋模式對Angoff標準設定結果一致性提升效益之比較研究。教育研究與發展，6(4)，47-80。林奕宏、曾芬蘭、王建雅(2016)。應用群聚分析法檢核國中教育會考標準設定結果之效度：以歷史科及國文科資料為例。慈濟大學教育研究學刊，13，39-60。陳柏熹(2011)。心理與教育測驗：測驗編製理論與實務。新北市:精策教育。黃馨瑩、謝名娟、謝進昌(2013)。臺灣學生學習成就評量英語科標準設定之效度評估研究。教育與心理研究，36(2)，87-112。謝進昌(2006)。精熟標準設定方法的歷史演進與詮釋的新概念。國民教育學報，16，157-193。
被引用次数	謝進昌(2021)。以「補充性表現水平描述輔助自陳式測量構念」之延伸Angoff標準設定研究。教育心理學報，53(2)，307-334。謝進昌(2023)。「混合專家與學生實徵表現導向」大型教育評量標準設定之效度評估研究。教育科學研究期刊，68(2)，1-35。楊心怡,陳柏熹,吳昭容,吳宜玲(2021)。三至九年級學生數學運算能力等化測量與多向度分析。清華教育學報，38(2)，111-150。