
Applying the Mantel-Haenszel Method to Exploratory Differential Item Functioning Assessment in Continuous Response Items






陳俊宏(Jyun-Hong Chen);趙秀怡(Hsiu-Yi Chao);施慶麟(Ching-Lin Shih)


Mantel-Haenszel method ; differential item functioning ; continuous response items ; scale purification ; Mantel-Haenszel法 ; 差異試題功能 ; 連續反應試題 ; 量尺淨化




60卷4期(2018 / 12 / 01)


217 - 231




Differential item functioning (DIF) assessment is critical for ensuring test validity and fairness. Many DIF assessment methods have been proposed in the past several decades, including the Mantel-Haenszel (MH) method. Such methods are applied to DIF assessment for discrete response items; however, no research has been done for continuous response items. Considering the popularity of the MH method in practical applications, its continuous counterpart (called the MHC method) proposed by Rayner and Best (2012) is applied to assess DIF for continuous response items in this study. The scale purification (SP) is further incorporated to improve the performance of the MHC method in DIF assessment. According to the simulation results, the MHC method with the SP procedure can yield high power rates while controlling type I error rates well in DIF assessment. Since the MHC method can be easily implemented with the SP procedure, it is recommended for test practitioners to conduct DIF assessments to improve the test quality for continuous response items.


差異試題功能(differential item functioning, DIF)檢核對於測驗效度與測驗公平性的維護相當重要。在過去數十年,陸續有許多DIF檢核方法被提出,例如Mantel-Haenszel(MH)法等。然而,這些方法皆被應用於間斷反應試題(discrete response items);對於連續反應試題(continuous response items)的DIF檢核,則尚未有研究提及。考量MH法在實務上的穩健表現,本研究進一步將Rayner與Best(2012)所推導出的連續反應MH統計量(稱之為MHC法),應用於連續反應試題的DIF檢核,並搭配量尺淨化(scale purification, SP)程序以提升DIF檢核效能。透過模擬研究顯示,MHC法搭配SP能有效控制型一誤差,同時具有良好的檢定力。由於此方法在執行上的難度並不高,本研究建議實務測驗應用者利用MHC法搭配SP對連續反應試題進行DIF檢核,以維護測驗品質。

主题分类 社會科學 > 心理學
  1. Almenberg, J.,Dreber, A.(2011).When does the price affect the taste? Results from a wine experiment.Journal of Wine Economics,6,111-121.
  2. Chang, H.-H.,Mazzeo, J.,Roussos, J.(1996).Detecting DIF for polytomously scored items: An adaptation of the SIBTEST procedure.Journal of Educational Measurement,33,333-353.
  3. Chen, J.-H.,Chen, C.-T.,Shih, C.-L.(2014).Improving the control of type I error rate in assessing differential item functioning for hierarchical generalized linear model when impact is presented.Applied Psychological Measurement,38,18-36.
  4. Clauser, B.,Mazor, K.,Hambleton, R. K.(1993).The effects of purification of matching criterion on the identification of DIF using the Mantel-Haenszel procedure.Applied Measurement in Education,6,269-279.
  5. Davis, C. S.(2002).Statistical methods for the analysis of repeated measurements.New York, NY:Springer.
  6. DeMars, C. E.(2009).Modification of the Mantel-Haenszel and logistic regression DIF procedures to incorporate the SIBTEST regression correction.Journal of Educational and Behavioral Statistics,34,149-170.
  7. DeMars, C. E.(2010).Type I error inflation for detecting DIF in the presence of impact.Educational and Psychological Measurement,70,961-972.
  8. Donoghue, J. R.,Allen, N. L.(1993).Thin versus thick matching in the Mantel-Haenszel procedure for detecting DIF.Journal of Educational Statistics,18,131-154.
  9. Dorans, N. J.,Holland, P. W.(1993).DIF detection and description: Mantel-Haenszel and standardization.Differential item functioning,Hillsdale, NJ:
  10. Douglas, J. A.,Roussos, L. A.,Stout, W.(1996).Itembundle DIF hypothesis testing: Identifying suspect bundles and assessing their differential functioning.Journal of Educational Measurement,33,465-484.
  11. Fariello, J. Y.,Whitmore, K. E.(2013).Clinical evaluation and diagnosis of bladder pain syndrome.Bladder pain syndrome: A guide for clinicians,New York, NY:
  12. Ferrando, P. J.(2003).A kernel density analysis of continuous typical-response scales.Educational and Psychological Measurement,63,809-824.
  13. Ferrando, P. J.(2004).Person reliability in personality measurement: An item response theory analysis.Applied Psychological Measurement,28,126-140.
  14. Ferrando, P. J.(2010).Some statistics for assessing person-fit based on continuous-response models.Applied Psychological Measurement,34,219-237.
  15. Ferrando, P. J.(2002).Theoretical and empirical comparisons between two models for continuous item responses.Multivariate Behavioral Research,37,521-542.
  16. Fidalgo, A. M.(2011).GMHDIF: A computer program for detecting DIF in dichotomous and polytomous items using generalized Mantel-Haenszel statistics.Applied Psychological Measurement,35,247-249.
  17. Fidalgo, A. M.,Ferreres, D.,MuÑiz, J.(2004).Utility of the Mantel-Haenszel procedure for detecting differential item functioning in small samples.Educational and Psychological Measurement,64,925-936.
  18. Fidalgo, A. M.,Madeira, J. M.(2008).Generalized Mantel-Haenszel methods for differential item functioning detection.Educational and Psychological Measurement,68,940-958.
  19. French, B. F.,Finch, W. H.(2013).Extensions of Mantel-Haenszel for multilevel DIF detection.Educational and Psychological Measurement,73,648-671.
  20. Freyd, M. (1923). The graphic rating scale. Journal of Educational Psychology, 14, 83-102. doi:10.1037/h0074329
  21. Holland, P. W.,Thayer, D. T.(1985).An alternative definition of ETS delta scale of item difficulty.Princeton, NJ:Educational Testing Service.
  22. Jodoin, M. G.,Gierl, M. J.(2001).Evaluating type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection.Applied Measurement in Education,14,329-349.
  23. Kamata, A.(2001).Item analysis by the hierarchical generalized linear model.Journal of Educational Measurement,38,79-93.
  24. Kan, A.(2009).Effect of scale response format on psychometric properties in teaching self-efficacy.Eurasian Journal of Educational Research,34,215-218.
  25. Kopf, J.,Zeileis, A.,Strobl, C.(2015).Anchor selection strategies for DIF analysis: Review, assessment, and new approaches.Educational and Psychological Measurement,75,22-56.
  26. Lord, F. M.(Ed.),Novick, M. R.(Ed.)(1968).Statistical theories of mental test scores.Reading, MA:Addison-Wesley.
  27. Magis, D.,De Boeck, P.(2014).type I error inflation in DIF identification with Mantel-Haenszel: An explanation and a solution.Educational and Psychological Measurement,74,713-728.
  28. Mantel, N.(1963).Chi-square tests with one degree of freedom; Extensions of the Mantel-Haenszel procedure.Journal of the American Statistical Association,58,690-700.
  29. Mantel, N.,Haenszel, W.(1959).Statistical aspects of the analysis of data from retrospective studies of disease.Journal of the National Cancer Institute,22,719-748.
  30. Mazor, K. M.,Clauser, B. E.,Hambleton, R. K.(1994).Identification of nonuniform differential item functioning using a variation of the Mantel-Haenszel procedure.Educational and Psychological Measurement,54,284-291.
  31. Musangu, L. M.,Kekwaletswe, R. M.(2012).Comparison of Likert scale with visual analogue scale for strategic information systems planning measurements: A preliminary study.Proceedings of the IADIS International Conference Information Systems,Berlin, Germany:
  32. Narayanan, P.,Swaminathan, H.(1994).Performance of the Mantel-Haenszel and simultaneous item bias procedures for detecting differential item functioning.Applied Psychological Measurement,18,315-328.
  33. Noel, Y.,Dauvier, B.(2007).A beta item response model for continuous bounded responses.Applied Psychological Measurement,31,47-73.
  34. Parkin, D.,Devlin, N.(2006).Is there a case for using visual analogue scale valuations in cost-utility analysis?.Health Economics,15,653-664.
  35. Penfield, R. D.(2005).DIFAS: Differential item functioning analysis system.Applied Psychological Measurement,29,150-151.
  36. Pine, S. M.(1977).Applications of item characteristic curve theory to the problem of test bias.Applications of computerized adaptive testing: Proceedings of a symposium presented at the 18th Annual Convention of the Military Testing Association,Minneapolis, MN:
  37. Rayner, J. C.,Best, D.(2012).Rayner, J. C., & Best, D. (2012). Continuous analogues of Cochran-Mantel-Haenszel statistics. Unpublished manuscript, Centre for Statistical and Survey Methodology, University of Wollongong, New South Wales, Australia..
  38. Rogers, H. J.,Swaminathan, H.(1993).A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning.Applied Psychological Measurement,17,105-116.
  39. Samejima, F.(1969).Estimation of latent ability using a response pattern of graded scores.Richmond, VA:Psychometric Society.
  40. Samejima, F.(1973).Homogeneous case of the continuous response model.Psychometrika,38,203-219.
  41. Shealy, R.,Stout, W.(1993).A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF.Psychometrika,58,159-194.
  42. Shih, C.-L.,Wang W.-C.(2009).Differential item functioning detection using the multiple indicators, multiple causes method with a pure short anchor.Applied Psychological Measurement,33,184-199.
  43. Shojima, K.(2005).A noniterative item parameter solution in each EM cycle of the continuous response model.Educational Technology Research,28,11-22.
  44. Shojima, K.(2003).Linking tests under the continuous response model.Behaviormetrika,30,155-171.
  45. Swaminathan, H.,Rogers, H. J.(1990).Detecting differential item functioning using logistic regression procedures.Journal of Educational Measurement,27,361-370.
  46. Wainer, H.(Ed.),Braun, H. I.(Ed.)(1988).Test validity.Hillsdale, NJ:Lawrence Erlbaum.
  47. Waller, N. G.(1998).EZDIF: Detection of uniform and nonuniform differential item functioning with the Mantel-Haenszel and logistic regression procedures.Applied Psychological Measurement,22,391.
  48. Wang, T.,Zeng, L.(1998).Item parameter estimation for a continuous response model using an EM algorithm.Applied Psychological Measurement,22,333-344.
  49. Wang, W.-C.(2004).Effects of anchor item methods on the detection of differential item functioning within the family of Rasch models.Journal of Experimental Education,72,221-261.
  50. Wang, W.-C.,Shih, C.-L.(2010).MIMIC methods for assessing differential item functioning in polytomous items.Applied Psychological Measurement,34,166-180.
  51. Wang, W.-C.,Shih, C.-L.,Sun, G.-W.(2012).The DIFfree-then-DIF strategy for the assessment of differential item functioning.Educational and Psychological Measurement,72,687-708.
  52. Wang, W.-C.,Shih, C.-L.,Yang, C.-C.(2009).The MIMIC method with scale purification for detecting differential item functioning.Educational and Psychological Measurement,69,713-731.
  53. Wang, W.-C.,Su, Y.-H.(2004).Effects of average signed area between two item characteristic curves and test purification procedures on the DIF detection via the Mantel-Haenszel method.Applied Measurement in Education,17,113-144.
  54. Zopluoglu, C.(2013).A comparison of two estimation algorithms for Samejima's continuous IRT model.Behavior Research Methods,45,54-64.
  55. Zopluoglu, C.(2012).EstCRM: An R package for Samejima's continuous IRT model.Applied Psychological Measurement,36,149-150.
  56. Zwick, R.,Thayer, D. T.(1996).Evaluating the magnitude of differential item functioning in polytomous items.Journal of Educational and Behavioral Statistics,21,187-201.
  1. 鄧鈞文,陳俊瑋,林仁傑(2019)。數學成就測驗的性別差異試題功能(DIF)現象:以臺灣學生學習成就評量資料為例。教育科學期刊,18(1),71-91。