题名

四個試題反應模式的整體能力與領域能力估計精確性之比較研究

并列篇名

Comparing the Accuracy of Examinees' Overall and Domain Ability Estimates of Four IRT Models

作者

吳俊賢(Chun-Hsien Wu);凃柏原(Bor-Yaun Twu)

关键词

多向度隨機係數多項logit模式 ; 高階試題反應模式 ; 領域能力 ; 廣義次向度模式 ; 整體能力 ; domain abilities ; generalized subdimensional model ; HO-IRT ; MRCMLM ; overall abilities

期刊名称

教育學誌

卷期/出版年月

39期(2018 / 05 / 01)

页次

87 - 151

内容语文

繁體中文

中文摘要

本研究旨在比較廣義次向度模式(Brandt, 2012)、多向度隨機係數多項logit模式(MRCMLM;Adams, Wilson, & Wang, 1997)及高階試題反應模式(HO-IRT;de la Torre & Song, 2009)在各種模擬條件下對考生在測驗上的表現所估計得到的整體能力值及領域能力值估計的精確性。模擬資料時,領域向度數(3、5個)、各領域的題數(20、40題)以及各領域能力之間的相關(0.2、0.5及0.8)等三個變項被操弄。在12個條件的組合之下,每一個皆重複模擬30次,每次模擬1000個考生的作答反應。這些資料以MCMC法分別以上述三個模式來估計考生的整體能力以及各領域的能力。以r_(ζζ)、RMSE及BIAS來評估各試題反應模式的整體能力及領域能力估計的精確性。同時也以98年第一次國中基本學力測驗五個科目1000名考生的作答反應資料進行整體能力及領域能力的估計,並比較估計能力參數差異情形。研究發現如下:(1)模擬研究結果顯示,當領域向度數增加、各領域能力之間的相關係數變大,各領域內題數增加時,三個試題反應模式的整體能力和領域能力估計精確性均相當不錯。同時,領域能力估計值之間的相關亦均高於模擬時所設定的真值。(2)以貝氏估計的DIC指標作為評估三種試題反應模式對98年第一次國中基本學力測驗資料的適配情形,結果發現廣義次向度模式的適配情形最好,其次為MRCMLM模式和HO-IRT模式。(3)利用廣義次向度、MRCMLM及HO-IRT等三種模式來估計98年第一次國中基本學力測驗的資料,發現三種模式所估計得到的領域能力之間的關介於.86與.92之間,屬於高度相關;三個模式的相關平均數分別為0.88、0.90、0.90,而根據DETECT分析的結果,推測98年第一次國中基本測驗的資料應接近本質上的單一向度。

英文摘要

The purpose of this study is to compare the accuracy of overall and domain ability estimates given by four multidimensional item response models, including Generalized Sub-dimensional Model (GSM; Brandt, 2012), MRCMLM (Adams, Wilson, & Wang, 1997), and Higher-Order IRT Model (de la Torre, & Song, 2009). The numbers of domain abilities, and numbers of items in each domain as well as the size of correlation coefficients were manipulated, and under each combination of conditions, 1,000 examinees' response vector were generatedto be a multidimensional data set with simple structure. Ability and item parameters were calibrated by using the MCMC algorithm. Addtionally, the empirical data given by the 1000 examinees' responses on the 2009 Basic Competence Test (BCTEST) were also analyzed. The results are as the following. 1. In the simulation study, it was found that the accuracy of estimation of overall ability will be better when the numbers of domain abilities are larger, the correlation among the domain traits are larger, or there are more items in each domain. In addition, the correlation coefficients between estimated domain abilities are higher than those between the true abilities. 2. The values of the DIC index indicated that the GSM model fit the BCTEST data best. 3. The average correlation coefficients between the domain abilities were 0.88, 0.90 and 0.90 for the GSM, MRCMLM and HO-IRT, respectively. This result suggested that the BCTEST data is nearly essentially unidimensional.

主题分类 社會科學 > 教育學
参考文献
  1. 郭伯臣、謝典佑、吳慧珉、林佳樺(2012)。一因子高層次試題反應理論模式之評估。測驗學刊,59(3),329-348。
    連結:
  2. (1997).Handbook of modern item response theory.New York, NY:Springer.
  3. Adams, R. J.,Wilson, M.,Wang, W. C.(1997).The multidimensional random coefficients multinomial logit model.Applied psychological measurement,21(1),1-23.
  4. Andrich, D.(1978).A rating formulation for ordered response categories.Psychometrika,43(4),561-573.
  5. Birnbaum, A.(1968).Some latent train models and their use in inferring an examinee's ability.Statistical theories of mental test scores,Reading, MA:
  6. Bolt, D. M.,Lall, V. F.(2003).Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo.Applied Psychological Measurement,27(6),395-414.
  7. Brandt, S.(2012).Definition and classification of a generalized subdimension model.the 2012 annual conference of the National Council on Measurement in Education (NCME),Vancouver, BC:
  8. Brandt, S.(2008).Estimation of a Rasch model including subdimensions.IERI monograph series: Issues and methodologies in large-scale assessments
  9. Brandt, S.,Duckor, B.(2013).Increasing unidimensional measurement precision using a multidimensional item response model approach.Psychological Test and Assessment Modeling,55(2),148.
  10. Cao, J.,Stokes, S. L.(2008).Bayesian IRT guessing models for partial guessing behaviors.Psychometrika,73(2),209-230.
  11. Childs, R. A.,Elgie, S.,Gadalla, T.,Traub, R.,Jaciw, A. P.(2004).IRT-linked standard errors of weighted composites.Practical Assessment, Research & Evaluation,9(13)
  12. Christoffersson, A.(1975).Factor analysis of dichotomized variables.Psychometrika,40(1),5-32.
  13. de Ayala, R. J.(2013).Theory and practice of item response theory.New York, NY:Guilford Publications.
  14. de la Torre, J.,Hong, Y.(2010).Parameter estimation with small sample size a higher-order IRT model approach.Applied Psychological Measurement,34(4),267-285.
  15. de la Torre, J.,Song, H.(2009).Simultaneous estimation of overall and domain abilities: A higher-order IRT model approach.Applied Psychological Measurement,33(8),620-639.
  16. de la Torre, J.,Song, H.,Hong, Y.(2011).A comparison of four methods of IRT subscoring.Applied Psychological Measurement,35(4),296-316.
  17. Duckor, B. M.(2006).Measuring measuring: An item response theory approach.International Objective Measurement Workshop,Berkeley, CA.:
  18. Duckor, B.,Draney, K.,Wilson, M.(2009).Measuring measuring: toward a theory of proficiency with the constructing measures framework.Journal of applied measurement,10(3),296-319.
  19. Embretson, S. E.,Reise, S. P.(2000).Item response theory for psychologists.Mahwah, NJ:Lawrence Erlbaum Associates.
  20. Fischer, G. H.(1973).The linear logistic test model as an instrument in educational research.Acta psychologica,37(6),359-374.
  21. Gulliksen, H.,Wilks, S. S.(1950).Regression tests for several samples.Psychometrika,15,91-114.
  22. Hambleton, R. K.,Swaminathan, H.(1985).Item response theory principles and applications.Boston, MA:Kluwer-Nijhoff Publishing.
  23. Kaplan, A.(Ed.)(2004).The sage handbook of quantitative methodology for the social sciences.Thousand Oaks, CA:Sage Publications.
  24. Kim, H.(1994).Urbana-Champaign, IL.,University of Illinois.
  25. Klein Entink, R. H.,Fox, J. P.,van der Linden, W. J.(2009).A multivariate multilevel approach to the modeling of accuracy and speed of test takers.Psychometrika,74(1),21-48.
  26. Li, Y.,Bolt, D. M,Fu, J.(2006).A comparison of alternative models for testlets.Applied Psychological Measurement,30(1),3-21.
  27. Linn, R. L.(Ed.)(1989).Educational measurement.New York, NY:Macmillan.
  28. Linn, R. L.(Ed.)(1989).Educational measurement.Phoenix, AZ:The Oryx Press.
  29. Lord, F. M.(1952).A theory of test scores.Psychometric Monograph,7
  30. Lord, F. M.,Novick, M. R.(1968).Statistical theories of mental test scores.Reading, MA:Addison-Wesley Publishing Company.
  31. Lunn, D. J.,Thomas, A.,Best, N.,Spiegelhalter, D.(2000).WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility.Statistics and computing,10(4),325-337.
  32. Masters, G. N.(1982).A Rasch model for partial credit scoring.Psychometrika,47(2),149-174.
  33. Rasch, G.(1960).Probabilistic models for some intelligence and attainment tests.Copenhagen:Danmarks Paedagogiske Institut.
  34. Reckase, M. D.(1985).The difficulty of test items that measure more than one ability.Applied Psychological Measurement,9(4),401-412.
  35. Reckase, M. D.,McKinley, R. L.(1991).The discriminating power of items that measure more than one dimension.Applied Psychological Measurement,15(4),361-373.
  36. Rijmen, F.,Briggs, D. C.(2004).Multiple person dimensions and latent item predictors.Explanatory item response models: A generalized linear and nonlinear approach,New York,NY:
  37. Rudner, L. M.(2001).Informed test component weighting.Educational Measurement: Issues and Practice,20(1),16-19.
  38. Sheng, Y.,Wikle, C. K.(2008).Bayesian multidimensional IRT models with a hierarchical structure.Educational and psychological measurement,68,413-430.
  39. Sinharay, S.,Puhan, G.,Haberman, S. J.(2011).An NCME instructional module on subscores.Educational Measurement: Issues and Practice,30(3),29-40.
  40. Swygert, K. A,McLeod, L. D,Thissen, D(2001).Factor analysis for items or testlets scored in more than two categories.Test scoring,Mahwah, NJ:
  41. Sympson, J. B.(1978).A model for testing with multidimensional items.Proceedings of the 1977 computerized adaptive testing conference,Minneapolis:
  42. van der Linden, W. J.,Klein Entink, R. H.,Fox, J. P.(2010).IRT parameter estimation with response times as collateral information.Applied Psychological Measurement,34(5),327-347.
  43. Wainer, H.,Thissen, D.(1993).Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction.Applied Measurement in Education,6(2),103-118.
  44. Wang, M. W.,Stanley, J. C.(1970).Differential weighting: A review of methods and empirical studies.Review of Educational Research,40(5),663-705.
  45. Whitely, S. E.(1980).Multicomponent latent trait models for ability tests.Psychometrika,45(4),479-494.
  46. Yao, L.(2010).Reporting valid and reliable overall scores and domain scores.Journal of Educational Measurement,47(3),339-360.
  47. Yao, L.(2012).Multidimensional CAT item selection methods for domain scores and composite scores: Theory and applications.Psychometrika,77(3),495-523.
  48. Zhang, J. M.,Stout, W.(1999).The theoretical DETECT index of dimensionality and its application to approximate simple structure.Psychometrika,64(2),213-249.
  49. 凃柏原、盧思丞(2012)。成就測驗組合分數議題探討。教育研究學報,46(1),119-137。