Test Operations

Test Form Construction

Test Composition

Standard 4.12 Test developers should document the extent to which the content domain of a test represents the domain defined in the test specifications.

Standard 4.7 The procedures used to develop, review, and try out items and to select items from the item pool should be documented. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Each test form is created to match specifications regarding the number and type of items (multiple-choice, open-response) within content subareas. Each test form includes a percentage of items designated as non-scorable, so that data can be collected on their psychometric characteristics before they contribute to candidate scores. These items are made scorable on future test forms only after an item analysis of responses is conducted and the items are found to meet acceptable psychometric criteria. Information regarding test composition is available to candidates and others in the Test Objectives. Below is a sample of test composition information.

Test objectives weighting by number of questions per subarea
Subareas Range of Objectives Approximate Test Weighting
Multiple-Choice
Literature and Language 01–09 51%
Rhetoric and Composition 10–12 17%
Reading Theory, Research, and Instruction 13–14 12%
80%
Open-Response*
Integration of Knowledge and Understanding 15 20%

*The open-response items may relate to topics covered in any of the subareas.

Relationships Among Test Forms

One or more forms of each test are administered at a time. New forms are typically constructed after a sufficient number of candidates have responded to the items on a previous test form and the previous test form can be used for equating. Typically, the scorable multiple-choice items in the new form are embedded in the previous form, either as scorable or non-scorable items. The creation of new forms continues until a pool of forms has been created for rotation. In order to establish continuity and consistency across different forms of a test, the following relationships among test forms are maintained as subsequent test forms are constructed:

  • Content relationships. Each form of the test is constructed to be comparable to previous test forms with respect to content coverage. This is accomplished by selecting items for the test according to the proportions provided in the test objectives for the field.
  • Statistical relationships. For fields for which performance data are available (most fields), new test forms are constructed to be comparable to the previous test form in estimated overall test difficulty. The overall test form difficulty is determined by averaging item p-values (percent of candidates answering the item correctly) for the scorable items on the form, obtained from operational administrations when possible, or from pilot test administrations before operational p-values are available. In addition, the test results for each form are statistically equated to those of the previous form to enable comparability of passing decisions across administrations. See Test Equating for further information.
  • Relationships among open-response items. Sets of open-response items are designed to be comparable to one another. For example, if a test has two open-response items, the items in the bank for Open-Response Item Type #1 are designed to be comparable in regard to the difficulty level and the performance characteristics measured. Comparability of the open-response items across test forms is established through several activities, including preparation of item specifications for creating multiple items of a type, pilot testing of items, establishing marker responses that exemplify the score points, and the training and calibration of scorers. See Establishing Comparability of Open-Response Items for further information.

Test Administration

Standard 6.1 Test administrators should follow carefully the standardized procedures for administration and scoring specified by the test developer and any instructions from the test user.

Standard 6.3 Changes or disruptions to standardized test administration procedures or scoring should be documented and reported to the test user.

Standard 6.4 The testing environment should furnish reasonable comfort with minimal distractions to avoid construct-irrelevant variance. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The MTEL are administered under standardized, controlled procedures at computer-based and (until July 2015) paper-based test sites. As of fall 2014, 31 of the 40 tests in the MTEL program were administered on computer at Pearson-authorized test centers, with the remaining tests to be administered on computer by fall 2015.

Test administrations are designed to provide a secure, controlled testing environment with minimal distractions so as to minimize the possibility of irrelevant characteristics affecting candidates' scores. Test takers are monitored continuously throughout the test administration. Test sites adhere to guidelines relating to test security, accessibility, lighting, workspace, comfort, and quiet surroundings. Test administrators follow documented, standardized procedures for test administration. Procedures are in place for the documentation, review, and resolution of any deviations from standard administration procedures.

Test Security

Standard 6.6 Reasonable efforts should be made to ensure the integrity of test scores by eliminating opportunities for test takers to attain scores by fraudulent or deceptive means. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Candidate security. In order to eliminate opportunities for test-taking fraud, stringent procedures for identification of candidates are in place, including requiring government-issued photograph-bearing identification in order to be admitted to the test, as well as fingerprinting, palm vein scanning, or other biometric requirements. Before testing, candidates agree to policies regarding prohibited materials and activities, which are strictly enforced at test administrations.

Test form security. In order to minimize the possibility of breaches in test security and to help ensure the integrity of candidate test scores, a number of guidelines, as indicated below, are followed related to the construction and administration of MTEL test forms. These guidelines are followed as allowed by sufficient candidate numbers; exceptions may be made for test fields taken by few candidates.

  • Multiple-choice items are re-ordered within a subarea from one candidate to another when administered by computer so that candidates taking the same test form within a testing period receive the items in a different order. Additionally, a candidate assigned the same test form in different testing periods receives the multiple-choice items in a different order.
  • Multiple test forms are administered within a testing period for most fields. Additionally, for most fields, there are multiple sets of open-response items that are assigned to candidates within a testing period separate from the multiple-choice test form assigned. Therefore, different candidates typically receive different test items within a testing period.
  • Test forms and open-response items for most fields are changed from one testing period to another so that most candidates who retest in an adjacent testing period receive different sets of test items.

Information to Test Takers

Standard 8.2 Test takers should be provided in advance with as much information about the test, the testing process, the intended test use, test scoring criteria, testing policy, availability of accommodations, and confidentiality protection as is consistent with obtaining valid responses and making appropriate interpretations of test scores.

Standard 4.16 The instructions presented to test takers should contain sufficient detail so that test takers can respond to a task in the manner that the test developer intended. When appropriate, sample materials, practice or sample questions, criteria for scoring, and a representative item identified with each item format or major area in the test's classification or domain should be provided to the test takers prior to the administration of the test, or should be included in the testing material as part of the standard administration instructions.

Standard 6.5 Test takers should be provided appropriate instructions, practice, and other support necessary to reduce construct-irrelevant variance. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The MTEL website provides program and test information to test takers, educator preparation program providers, and the public. It provides information and guidance useful to candidates before, during, and after testing, including the following:

A "Contact Us" page includes email, phone, fax, and mail information for candidates who may have questions.

Testing Accommodations

Standard 3.9 Test developers and/or test users are responsible for developing and providing test accommodations, when appropriate and feasible, to remove construct-irrelevant barriers that otherwise would interfere with examinees' ability to demonstrate their standing on the target constructs.

Standard 6.2 When formal procedures have been established for requesting and receiving accommodations, test takers should be informed of these procedures in advance of testing. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Alternative testing arrangements for the MTEL are available upon request to candidates who provide appropriate documentation of a disability. The alternative arrangements are designed to provide accommodations for the test and/or administration conditions to remove construct-irrelevant barriers and enable the accurate assessment of the knowledge and skills that are being measured. Construct-irrelevant barriers include those obstacles to accessibility (e.g., text size) that impede a candidate from demonstrating his or her ability on the constructs the test is intended to measure. Candidates are accommodated on a case-by-case basis according to the alternative arrangement(s) needed and are not restricted to a pre-determined list.

Accommodations are requested, reviewed, and provided according to standardized procedures, as described on the MTEL website. Candidates who are granted accommodations are notified in writing of the alternative arrangements. Test administrators are notified of the accommodations and provided with instructions regarding any changes to testing procedures.

Scoring

Standard 6.8 Those responsible for test scoring should establish scoring protocols. Test scoring that involves human judgment should include rubrics, procedures, and criteria for scoring. When scoring of complex responses is done by computer, the accuracy of the algorithm and processes should be documented.

Standard 6.9 Those responsible for test scoring should establish and document quality control processes and criteria. Adequate training should be provided. The quality of scoring should be monitored and documented. Any systematic source of scoring errors should be documented and corrected. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Multiple-Choice Item Scoring

Multiple-choice items for the MTEL are scored by computer using the test form answer key, with a single point awarded for each correct response and no points awarded for an incorrect response. The raw score for the multiple-choice section is the total number of multiple-choice items answered correctly. The raw score is transformed to a scale ranging from 100 to 300.

Open-Response and Short-Answer Item Scoring

Standard 4.20 The process for selecting, training, qualifying, and monitoring scorers should be specified by the test developer. The training materials, such as the scoring rubrics and examples of test takers' responses that illustrate the levels on the rubric score scale, and the procedures for training scorers should result in a degree of accuracy and agreement among scorers that allows the scores to be interpreted as originally intended by the test developer. Specifications should also describe processes for assessing scorer consistency and potential drift over time in raters' scoring. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

MTEL open-response and short-answer items are scored under secure conditions, typically in scoring sessions conducted after designated administration periods. Scorers are unaware of the identity of the individuals whose responses they score. Open-response items (i.e., performance assignments) are scored using a focused holistic scoring process. Short-answer items (items eliciting a short response that is worth up to 2 points) are scored using a model focusing on correct/incorrect elements of the response. Candidate responses are typically scored independently by two scorers according to a scoring scale, with additional scoring by a third scorer or a Chief Reader as needed. For some lower-incidence fields, scorers may first independently score a response, then reach consensus as a group on the assigned score.

Focused holistic scoring. In focused holistic scoring, scorers judge the overall effectiveness of each response using a set of performance characteristics that have been defined as important aspects of a quality response (e.g., focus and unity, organization). The score is holistic in that each score is based on the overall effectiveness of these characteristics working together, focusing on the response as a whole. Scorers use an approved, standardized scoring scale (based on the performance characteristics) and approved marker responses exemplifying the points on the scoring scale to assign scores to candidate responses. The performance characteristics and scoring scale are available to candidates and others in the test information guides on the MTEL program website.

Candidate responses are scored on a scale with a low of "1" and a high of "4" (with a separate code for blank or unscorable responses, such as responses that are not written or spoken in the required language, or that are completely off topic). Candidate responses are independently scored by two scorers, and their two scores are summed for a total possible score range of 2 to 8 for each open-response item. Scores for a response that differ by more than 1 point are considered discrepant and are resolved by further readings, typically either by a third scorer or by the Chief Reader.

Short-answer scoring. In short-answer scoring, scorers use a standardized, approved scoring scale focusing on correct/incorrect elements of the response. This method is used for scoring MTEL items intended to elicit a short response to a specific question or prompt, including language structure items in some language tests, and sentence-correction items in the Communication and Literacy Skills test. Candidate responses are independently scored by two scorers, and item scores that differ are resolved by further readings, typically either by a third scorer or by the Chief Reader. Short-answer items are worth either 1 or 2 points and are included in scores of the multiple-choice section of the test forms.

Scorer selection criteria. Scorers for the open-response items are selected based on established qualification criteria for the MTEL program. While scorers' qualifications may vary depending on the types of items they score, typically scorers have the following qualifications:

  1. a current Massachusetts educator license or endorsement appropriate to the subject area for which they are scoring,
  2. AND

  3. are currently under district contract to teach in Massachusetts public schools or have been under district contract to teach in Massachusetts public schools within the last two years
  4. OR

  5. are educators from colleges and universities who are directly responsible for preparing students in the subject area for which they are scoring.

Individuals are eligible to continue serving as scorers if they have participated successfully as a scorer at a previous scoring session and continue to participate in professional development, including scoring activities.

Scorer training, calibration, and monitoring. Before being allowed to score, scorers must successfully complete training and calibration activities. Scorers receive a scoring manual and are oriented to the task, and they practice scoring training responses to which scores have already been assigned. Scorers must demonstrate accuracy on scoring the responses before proceeding to score operational responses. Scorer performance is monitored throughout scoring sessions through the use of scorer performance reports, which are provided to scorers and supervisory personnel. At points in the scoring process, scorers are recalibrated to the scoring scale, typically through discussions of specific responses. Analyses of scores given by scorers to open-response items are generated and reviewed, including comparisons to previous administrations, in order to monitor the accuracy and consistency of scoring over time.

Test Equating

Standard 5.13 When claims of form-to-form equivalence are based on equating procedures, detailed technical information should be provided on the method by which equating functions were established and on the accuracy of the equating functions.

Standard 5.12 A clear rationale and supporting evidence should be provided for any claim that scale scores earned on alternate forms of a test may be used interchangeably. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Purpose of equating. The central purpose of statistically equating the MTEL tests is to compensate statistically for possible variability from one test form to another that may affect candidates' scores (e.g., differences in the overall difficulty of a new test form compared to a previous test form). Each form of a test is constructed to be as comparable as possible to a previous test form in estimated overall test difficulty. See Relationships Among Test Forms for further information about the construction of test forms. Statistical equating is conducted as an additional step to enable comparability of passing decisions across administrations.

Statistical equating methods adjust a candidate's scaled score for the relative difficulty of the particular test form that was taken. Thus, differences in scores across test forms can be attributed to differences in knowledge or skills, and not differences in the tests.

Equating design. A single-group equating design is utilized for the computer-administered MTEL tests. In a single-group design, one group of candidates takes two alternative forms of the test and these forms are then statistically equated. Typically, a new form is created by selecting the set of scorable multiple-choice items from within the sets of scorable and non-scorable items on the previous form. Because the new scorable set of items is embedded within the previous test form, statistical equating can compare candidate performance on the previous form with what their performance would have been on the new test form (i.e., the new set of scorable items), and an equated passing score can be determined. This pre-equating methodology allows the passing score to be determined before administration of a new test form, eliminating the need to gather performance data on the new form from a sufficient number of candidates before their scores can be released.

Equating method. A linear pre-equating method is used within a classical test theory framework. In linear equating, two scores are equivalent if they are the same number of standard deviation units above or below the mean for some group of candidates (Angoff, 1984). A linear equation is used to equate the cutscores on the two forms by setting their standard deviation scores, or z-scores, to be equal (Kolen & Brennan, 2004).

Only multiple-choice items are included in the statistical equating of the MTEL tests. All of the linking items appear on a previous form as either scorable or non-scorable items and as scorable items on the new form. Response data solely from the group of test takers who took the previous form are used to compute both the means and standard deviations of the scorable items on the previous form and the scorable items on the new form. The z-scores for the two sets of scorable items are set to be equal. The raw score on the new scorable set of items that corresponds to a particular raw score (the cutscore) on the previous set of scorable items is calculated to establish the raw cutscore on the new form. See Formula for Equating more information for further information.

Establishing comparability of open-response items. The open-response items are "calibrated" prior to administration. The following methods are typically used for MTEL tests to establish the comparability of open-response items from test form to test form:

  • Scoring scales. For each type of open-response item, an approved, standardized scoring scale (with an associated set of performance characteristics) is used to assign scores to candidate responses. The scoring scale provides a written, standardized description of the "typical" response at each score point. The same scoring scale is used to score responses to the open-response items of a particular type across test administrations and across different test forms. The use of a standardized scoring scale helps ensure the comparability of scores assigned to different individual open-response items within each item type.
  • Marker responses. Based on the score-point descriptions in the scoring scale, a set of responses is established for each open-response item to serve as exemplars of each point on the scale. These marker responses are used in the training and calibration of scorers to help ensure that the standardized meaning of the approved scoring scale is applied accurately and consistently to candidate responses.
  • Statistical data review. Where pilot test data exist for 25 or more candidates, the comparability of the open-response items is established through statistical analysis of item performance. Item mean scores are compared, and only items whose scores are not significantly statistically different (as determined by a post hoc Tukey HSD analysis) are eligible for operational administration. Thus, the items that make up the open-response section of the test are considered to be interchangeable from a statistical point of view and are not a component of the equating process.

Scaling Scores

Standard 5.2 The procedures for constructing scales used for reporting scores and the rationale for these procedures should be described clearly.

Standard 4.23 When a test score is derived from the differential weighting of items or subscores, the test developer should document the rationale and process used to develop, review, and assign item weights. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

The scores that are reported on the MTEL are "scaled" scores. Candidate scores are converted mathematically to a scale with a lower limit of 100, a passing score of 240, and an upper limit of 300. In this process, raw candidate scores as well as raw passing scores for the tests are converted to scaled scores. The use of scaled scores supports the communication of MTEL program results in the following ways:

  • Candidates, educator preparation institutions, and other stakeholders are able to interpret scores from different tests in a similar manner, regardless of test taken.
  • The meaning of the scaled passing scores is consistent over time, making it possible to compare performance from one administration (or one year) to the next.

Computation of scaled scores. Most MTEL tests consist of two sections: a multiple-choice item section (including any short-answer items scored correct/incorrect) and an open-response item section. For tests with two sections, scaled scores are computed separately for each section and then combined to determine the total test scaled score, according to the weights specified for each section (e.g., 80% for the multiple-choice section, 20% for the open-response section).

With this method, a candidate who answers all questions correctly on the multiple-choice section or receives all possible points on the open-response section receives a scaled score of 300 for that section. A candidate who answers correctly the number of multiple-choice items equal to the just acceptable multiple-choice score or receives the just-acceptable open-response score will receive a scaled score of 240 for that section. See Formula for Determining Section Scores more information for further information.

Combining scaled scores of test sections. For tests with two sections, the candidate's scaled section scores are combined based on the section weights that are approved for the test and communicated to candidates in the test objectives posted on the MTEL program website. For example, if a test has a weight of 80% for the multiple-choice section and 20% for the open-response section, a candidate's scaled scores for the two sections will be weighted accordingly and combined to determine a total test scaled score. See Formula for Combining Scaled Scores of Test Sections more information for further information. A candidate passes a test if the rounded total test scaled score is equal to or greater than 240.

Score Reporting

Standard 6.10 When test score information is released, those responsible for testing programs should provide interpretations appropriate to the audience. The interpretations should describe in simple language what the test covers, what scores represent, the precision/reliability of the scores, and how scores are intended to be used.

Standard 6.16 Transmission of individually identified test scores to authorized individuals or institutions should be done in a manner that protects the confidential nature of the scores and pertinent ancillary information. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Candidate test results are released to the candidate, the Massachusetts Department of Elementary and Secondary Education (the Department), and educator preparation institutions for which explicit permission has been given by the candidate. Results are provided in accordance with predetermined data formats, security procedures, and schedules. Policies regarding the use of candidate information, including abbreviated social security numbers, and candidate privacy measures are communicated on the MTEL program website and are acknowledged by candidates during registration. Interpretive information is provided to each audience with each transmission. A Score Report Explanation is provided to candidates and is also made available on the MTEL website. It includes the following information:

  • General Information
  • Interpreting Your Total Test Score
  • Interpreting Subarea Information
  • Performance on Subareas with Multiple-Choice Items
  • Performance on Open-Response Items

Test Quality Reviews

Test quality reviews are conducted on a regular basis for the MTEL to monitor the psychometric properties of the tests and their items. These include statistical analyses of test items and test forms conducted on a periodic basis and DIF analyses conducted annually, beginning in fall 2014.

Item Analysis

Standard 4.10 When a test developer evaluates the psychometric properties of items, the model used for that purpose (e.g., classical test theory, item response theory, or another mode) should be documented. The sample used for estimating item properties should be described and should be of adequate size and diversity for the procedure. The process by which items are screened and the data used for screening, such as item difficulty, item discrimination, or differential item functioning (DIF) for major examinee groups, should also be documented. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

On a periodic basis, an item analysis of responses to multiple-choice items is conducted as a measure of quality assurance to assess the performance of the item. The item analysis identifies items for review based on the following item statistics:

  • The percent of the candidates who answered the item correctly is less than 30 (i.e., fewer than 30 percent of candidates selected the response keyed as the correct response) (N ≥ 5)
  • Nonmodal correct response (i.e., the response chosen by the greatest number of candidates is not the response keyed as the correct response) (N ≥ 5)
  • Item-to-test point-biserial correlation coefficient is less than 0.10 (if the percent of candidates who selected the correct response is less than 50) (N ≥ 25); or
  • The percent of candidates who answered the item correctly for the most recent period decreased at least 20 points from the percent of candidates who answered the item correctly for all administrations of the item (N ≥ 25 for the most recent period, N ≥ 50 for all administrations)

Should the item analysis indicate an item warrants review, the review may include

  • confirmation that the wording of the item on the test form is the same as the wording of the item as approved by the CAC,
  • a check of content and correct answer with documentary sources, and/or
  • review by a content expert.

Based on the results of the review, items may be deleted, revised, or retained.

Test Quality Assurance Review

On an annual basis, a test quality assurance review of test form and test item statistics is conducted by Evaluation Systems psychometric and program staff for the purpose of monitoring the psychometric properties of the tests. Statistical analyses are generated for each test field regarding the following:

  • Pass rates for the most recent 3 years
  • Test reliability (KR20)
  • Percent of items with p-values ≥ 95
  • Percent of items with p-values less than 30
  • Percent of items with item-to-test point-biserial correlation less than 0.10
  • Percent of items with no response ≥ 5 percent

The statistical analyses are examined to detect possible test quality issues, such as test difficulty, reliability, or speededness. Any issues raised by the test quality assurance review are followed up on (e.g., by removing items from the item bank) or forwarded for further review (e.g., by content specialists).

Additionally, two Test Statistics Reports—the Test Form Statistics Report and the Open-Response Statistics Report—are produced and published annually for the MTEL. These reports are designed to provide information about the statistical properties of the MTEL, including the reliability of the tests.

DIF Analysis

Standard 4.10 When a test developer evaluates the psychometric properties of items, the model used for that purpose (e.g., classical test theory, item response theory, or another model) should be documented. The sample used for estimating item properties should be described and should be of adequate size and diversity for the procedure. The process by which items are screened and the data used for screening, such as item difficulty, item discrimination, or differential item functioning (DIF) for major examinee groups, should also be documented. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Beginning with the 2013–2014 program year, the Mantel-Haenszel approach was used for conducting an annual analysis of Differential Item Functioning (DIF) of multiple-choice items as a component of bias prevention. DIF occurs when individuals of the same ability level but with different characteristics (e.g., ethnicity/gender) have different likelihoods of answering an item correctly. DIF analyses are conducted on multiple-choice items to assess whether they perform differently depending on candidates' ethnic (White/Black and White/Hispanic) or gender (male/female) group membership. Items are assessed for DIF in regard to ethnicity and gender of test takers for items with sufficient numbers of candidates in both the focal (protected) group (e.g., Hispanic) and the reference group (e.g., White).

As of 2015, a Rasch approach to DIF analysis was adopted in place of the traditional Mantel-Haenszel procedure. The Rasch approach assesses each item for DIF across multiple administrations, as long as item overlap exists within a year or across contiguous years and the number of test takers is sufficient. An item is indicated for further review if it is identified more than 50% of the time as displaying DIF, regardless of the direction of the effect (i.e., favoring the reference or the focal group).

The Rasch approach allows for the expansion of the number of test fields that can be evaluated for potential DIF analysis. The Rasch DIF analysis as implemented in 2015 differs from the traditional Mantel-Haenszel procedure by replacing the form-by-form data structure with an incomplete data matrix (IDM) structure and replacing the frequency table DIF approach with the Rasch DIF approach. The Rasch approach is not test-form based; it instead “pools” multiple form data for each item and results in a greater amount of test-taker data being available for comparison. The IDM structure of the Rasch DIF analysis methodology can accommodate analysis with data sets from a single year, two contiguous years, or a number of years, as long as item overlap exists among years. As a result, the need to collapse data across many program years is lessened.

In the Rasch DIF analysis, the Rasch measurement model is fitted to the IDM data with the Joint Maximum Likelihood Estimation (JMLE) to calibrate item parameter difficulties onto a common scale. JMLE brings the flexibility that is required to handle IDM with missing data (Linacre, 2002; Linacre & Wright, 1989). Items for MTEL test fields are calibrated using the WINSTEPS software (Linacre, 2010). Fields with total n-counts of at least 300 test takers in a year would qualify for the process. (Note: Fields with 100 to 200 test takers within a year will benefit from an IDM developed by pooling two or more contiguous years of administrations. Fields with fewer than 100 test takers within a year will not be processed for DIF because n-counts for the focal groups in those areas would remain small despite the pooling efforts.)

As part of the Rasch measurement model, DIF analysis is carried out with the difference on the Rasch estimated b-parameters (i.e., DIF contrast) between the focal and reference groups. Guidelines available from the literature (Paek & Wilson, 2011) suggest a Rasch DIF C flagging rule comparable to the Classical Test Theory DIF C flagging rule. The Rasch DIF rule flags an item with DIF Type C when |DIF-Contrast| ≥ 0.638 logits and Ho: DIF-Contrast = 0 is rejected below a 0.05 level. Results of the DIF analysis are reported to the Department. The number of items per field identified more than 50% of the time as displaying DIF, regardless of the direction of the effect (i.e., favoring the reference or the focal group) is reported to the Department. Items identified as differentially functioning based on ethnicity are reviewed by the Department and the Bias Review Committee (BRC). The BRC reviews the identified items in light of the item statistics, using review criteria that are consistent with those applied during test development activities. Based on committee recommendations and final dispositions by the Department, reviewed items are either retained in the test item banks or deleted.

For further information, see:
2015–2020 DIF Analysis Outcomes PDF
2014–2019 DIF Analysis Outcomes PDF
2013–2018 DIF Analysis Outcomes PDF
2012–2017 DIF Analysis Outcomes PDF
2011–2016 DIF Analysis Outcomes PDF
2010–2015 DIF Analysis Outcomes PDF

Archived DIF Analysis Information

Standard 4.10 When a test developer evaluates the psychometric properties of items, the model used for that purpose (e.g., classical test theory, item response theory, or another model) should be documented. The sample used for estimating item properties should be described and should be of adequate size and diversity for the procedure. The process by which items are screened and the data used for screening, such as item difficulty, item discrimination, or differential item functioning (DIF) for major examinee groups, should also be documented. Standards for Educational and Psychological Testing (AERA, APA & NCME, 2014)

Beginning with the 2013–2014 program year, an annual analysis of Differential Item Functioning (DIF) of multiple-choice items was implemented as a component of bias prevention. DIF occurs when individuals of the same ability level but with different characteristics (e.g., ethnicity/gender) have different likelihoods of answering an item correctly. DIF analyses are conducted on multiple-choice items to assess whether they perform differently depending on candidates' ethnic (White/Black and White/Hispanic) or gender (male/female) group membership. Items are assessed for DIF in regard to ethnicity and gender of test takers for items with sufficient numbers of candidates in both the focal (protected) group (e.g., Hispanic) and the reference group (e.g., White). DIF analyses are generated for items on MTEL test forms that are taken by at least 100 candidates in both the focal and reference groups. Each item is assessed for DIF across all administrations of the item on each test form on which it appeared. An item is indicated for further review if it is identified more than 50% of the time as displaying DIF, regardless of the direction of the effect (i.e., favoring the reference or the focal group). The following DIF-detection procedures were used in 2014:

  • Uniform DIF. The Mantel-Haenszel DIF procedure, designed to detect uniform DIF, is one of two DIF-detection methodologies used. Uniform DIF occurs when the probability of members from one group answering an item correctly is consistently (i.e., uniformly across all ability levels) higher than the probability of members from another group answering the same item correctly. In this procedure, candidates are sorted into focal and reference groups and matched on ability, as determined by total test score. Items were identified as differentially functioning if they met the following criterion designated by Longford, Holland and Thayer (1993): the categorization of the magnitude of DIF, represented by |Δ|, is at least 1.5 and significantly greater than 1 (.05 significance level). Items meeting this criterion, indicating that one group performed significantly better than the group to which it was compared, were designated for further review.
  • Non-uniform DIF. A second DIF detection method, described by Swaminathan and Rogers (1990), was used to detect non-uniform DIF. Non-uniform DIF occurs when individuals of different groups perform similarly on an item if they are at one end of the ability scale and differently on the same item if they are at the other end of the ability scale. For example, high-ability Whites and Hispanics may perform similarly on an item, while low-ability Whites and Hispanics may perform differently. Using this method, items were identified as differentially functioning if they flagged at the .05 significance level. These items were also designated for further review.
  • Review of items displaying DIF. Any items that flagged for DIF were designated for further review to judge if the items contained any bias. Items identified as differentially functioning based on gender were reviewed by the Department, and items identified as differentially functioning based on ethnicity were reviewed by the Department and the BRC. The BRC reviewed the identified items in light of the item statistics, using review criteria that were consistent with those applied during test development activities. Based on committee recommendations and final dispositions by the Department, reviewed items were either retained in the banks or deleted. See 2009–2014 DIF Analysis Outcomes PDF for further information.

Top of Page