Download GradeSignifier from its SourceForge project page

GradeSignifier User's Guide
Version 1.0

Do your examinees' scores differ significantly?

Copyright © 2007 John C. Gunther

Introduction: Testing the Test
Installing and Running GradeSignifier
Tutorial: An Example with Simulated Data
An Example with Real (Coleman study) Data
GradeSignifier's Statistical Model

Introduction: Testing the Test

But there will always be a certain amount of random score variation that has nothing to do with the underlying ability, knowledge, or caloric intake of the examinees. If you are making decisions only on the basis of such variability, your test is a failure.

GradeSignifier provides everything you need to determine if the average scores of two groups of examinees (As vs. A-minuses, passed vs. failed, breakfast vs. no-breakfast, etc.) differ significantly. Are you wondering if you could still reliably distinguish those who passed from those who failed if your test had only half the number of questions? Or how many students must be in the passed or failed groups for the differences to be statistically significant? Or how a lack of "test item independence" might impact your test's precision? GradeSignifier can help you answer such questions.

Significance testing, though very helpful in test assessment, isn't a panacea. If you accidentally give a well designed verbal test to assess math ability, you won't get what you expect. Yet such a test will separate low and high scoring students in a statistically significant manner, so statistical tests alone won't reveal your mistake. On the other hand, a one-question test will never discriminate between passing and failing groups of examinees in a statistically significant manner (there are not enough degrees of freedom). But it might still be a good basis for making the pass/fail decision if that one question were carefully chosen by an expert.

GradeSignifier's simple, non-parametric, Monte-Carlo-simulation-based, statistical model has a number of advantages:

It requires an absolute minimum of statistical knowledge to be fully understood.
Unlike parametric methods, its validity does not require that your data set fit a particular equation, such as a logistic curve.
The method automatically handles any combination of single-right-answer test items, partial credit test items and continuously scored items. And tests with very small numbers of questions and/or examinees, or questions that everyone gets right or wrong, are never a problem.
Any required assumptions about the statistical independence of test items are user configurable and explicit, rather than hard-coded and tacitly assumed.

With its tabbed forms user interface, GradeSignifier organizes your data analysis into a series of easy steps:

On the Data Importer tab:

Choose a line-oriented text file that contains the test item response data (one line per examinee).
Specify the exact layout of your data (fixed column format, as well as space, tab, or comma delimited formats are supported). Both examinee attribute data (sex, race, school district, etc.) as well as test item response data (did they answer A, B, C, or D to question #1, etc.) are supported.
Specify how each examinee's "item responses" are converted into "item scores". Use any combination of simple answer-key based scoring for "one right answer" items, item/score pairs for partial credit multiple choice, and continuously scored essay-question-type items.
Use the weighted average editor to specify test scores as weighted averages of item scores, and, if desired, final grade scores as weighted averages of these test scores.

On the Data Query tab:

Use the query editor to define individual groups of to-be-compared examinees (boys, girls, students who scored from 94 to 100% on the final, etc.)

On the Data Viewer tab:

Review the data and calculated scores associated with all examinees, or examinees that match a particular query.

On the Stat Model tab:

Use the statistical model editor to define compared test score, compared groups (Group1 and Group2) and other parameters of the statistical model.
Calculate one or more of these models by selecting their rows in the model editor and then clicking the "Calculate" button.

On the Stat Chart tab:

Review the statistics and cumulative probability charts resulting from these calculations.

The Tutorial section of this document describes such steps in more detail, using an easy-to-understand, simulated-data-based example. Or, if you prefer, take a look at an example using real data.

Installing and Running GradeSignifier

To install GradeSignifier:

First, verify that you have Java 1.5 (a.k.a. 5.0) or higher installed on your computer. You can both check the version of Java currently installed and download the latest version, at java.com.
Another way to check the version: From your system's command prompt ("Start, All Programs, Accessories, Command Prompt" will bring up a command prompt on Windows XP) you can enter the command:
```
java -version
```
On my computer, which has Java 1.5 installed, this command produced the following output:
```
java version "1.5.0_06"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_06-b05)
Java HotSpot(TM) Client VM (build 1.5.0_06-b05, mixed mode, sharing)
```
Unzip the contents of the gradesignifier1.0.zip file into the folder of your choice (we use C:\gradesignifier ).
One of these files is gradesignifier.jar, an executable jar file that contains everything (except Java 1.5) that you need to run GradeSignifier. Also included: the Java source code, documentation, and the example models (*.gsm files) and data sets (*.txt files) discussed in this Users Guide.
Note: if you do not have an unzip utility I recommend the reliable, easy-to-use, and open source 7-Zip.
Start up the GradeSignifier application by either:
1. From your graphical interface: Double click on the icon associated with the gradesignifier.jar file you unzipped in the previous step.
2. From your command prompt: Alternately, you can issue the following command from you system's command prompt to start up GradeSignifier:
```
java -jar c:\gradesignifier\gradesignifier.jar
```
  The above assumes you placed the GradeSignifier files into a folder called c:\gradesignifier, on Microsoft Windows. You will need to specify a different full file pathname if you placed these files into a different folder or are using a different operating system.

Tutorial: An Example with Simulated Data

A Simple Model and Associated Simulated Data Set

In this tutorial, we will be working with a simulated data set produced by a very simple model.

In this model, there are just two ability levels of students: BRIGHT and STUPID. There are also just two kinds of questions: GOOD and POOR. GOOD questions perfectly separate the BRIGHT from the STUPID students, namely, BRIGHT students always get GOOD questions right, and STUPID students always get them wrong. On the other hand, both BRIGHT and STUPID students have a 50 percent chance of getting a POOR question right.

Note: I'd never use the term stupid to refer to a real student. However, it is OK to refer to simulated students, that, by definition, always get questions wrong and can never improve as STUPID.

Following these rules, we have produced a simulated item response data set in the file GradeSignifierTutorialData.txt (see GradeSignifierTutorialData.java, which generated this file, for more info). The simulated data set contains an answer key record and records corresponding to the simulated responses of 5 BRIGHT students and 5 STUPID students. Each record contains simulated responses (or the correct answers) for 5 GOOD questions and 5 POOR questions. This GradeSignifierTutorialData.txt file is shown below:

BRIGHTABCDEABCDE              
BRIGHTABCDEBBDDA              
BRIGHTABCDEACDDA
BRIGHTABCDEBBCEE
BRIGHTABCDEBBDEE
BRIGHTABCDEABDDA
STUPIDBCDEAABCDA                 
STUPIDBCDEABBDEA
STUPIDBCDEAACDDE
STUPIDBCDEAACDDE
STUPIDBCDEABBCEA

By default, GradeSignifier uses the first record as an answer-key record, which, in our example, contains the correct responses to each test item. Each of the remaining records provides information related to a single (simulated) student.

The first 6 columns contain a label (BRIGHT or STUPID) that identifies the ability level of each student. Because this is a simulated data set, we have the advantage of knowing the exact relationship between these ability levels and scores in advance. However, if this were a real data set, this column might be any possibly-test-performance-related student attribute known before the test begins (e.g. if the student had a good breakfast before taking the test, or an award-winning teacher, etc.).

The next 5 columns (column numbers 7 to 11) contain the responses of the students to the 5 GOOD questions. Note that the BRIGHT students always respond with the same answer as shown in the first, answer key record, whereas the STUPID students are always fooled by these questions and select the "decoy" response.

The last 5 columns (column numbers 12 to 16) contain the responses of the students to the 5 POOR questions. Note that for these questions the correct (same as answer key) response is chosen about 50% of the time by both BRIGHT and STUPID students alike.

In what follows, we will use this data set, and GradeSignifier, to quantify how using different proportions of GOOD and POOR questions in a single test impacts the ability of that test to discriminate between BRIGHT and STUPID students.

The GradeSignifierTutorial.gsm Model file

The file GradeSignifierTutorial.gsm (one of the files you unzipped during installation) contains the final GradeSignifier model that you could build on your own by following the steps below.

You may prefer to take a short-cut by opening this completed model and referring to it as you read the tutorial, instead of building your own model "from scratch".

Defining the Data file and Answer key record

The data file contains responses and an answer key. The first thing we need to do is to define the columnar layout of the data, and to tell GradeSignifier how to convert each column's data into scores (what is the right answer, etc.).

Open up the GradeSignifier application (e.g. by double clicking on gradesignifier.jar, see the Installation section for details). GradeSignifier always starts up with a new, empty, model.

On the Data Importer tab, click Browse... and navigate to the GradeSignfierTutorialData.txt file (one of the files you unzipped during installation) that contains the data set described above.

After you click Open, the Filename field should contain the name of the file you selected, including its path (e.g. C:\gradesignifier\GradeSignifierTutorialData.txt).

Note:

If you are following along using the pre-built GradeSignifierTutorial.gsm, you will see "GS_FOLDER\GradeSignifierTutorialData.txt" in the Filename field. GradeSignifier will dynamically replace the GS_FOLDER keyword with whatever folder the currently opened model is stored in. The GS_FOLDER keyword is convenient if you want to move the model and the data file as a single unit, since you do not have to keep changing the explicit folder reference in the data file whenever both files are relocated.

The "First 100 Rows:" field should also show the first 100 rows of this data file (since our data file contains only 9 rows, it will actually show the entire file).

Note that the "Use first record as answer key" checkbox is, by default, checked. You may optionally enter your own answer key in the Answer key field. If you do, the first record will be used as data. More complex scoring, such as partial credit responses, require editing each column's Response/score pairs field. Because all of our questions are of the "one right answer" variety, we can instead just use the simpler, default, "use answer key" approach.

Defining where each column's data is in the record

In our example, we have a fixed-column-position record layout, so the default selection, "None", of the Cell delimiter field is appropriate. If your data is white space, tab, or comma delimited, simply select "White space", "Tab" or "Comma" from this field's dropdown list.

We will define each column in the left-to-right order in which it appears in the record.

To define the first column, click Add Column. Replace the default Column name of "Col0" with the name "Ability". This column contains the "BRIGHT" or "STUPID" label that defines the "underlying ability" of each student.

For the Description field, enter "BRIGHT=high ability; STUPID=low ability". Whenever this column name is selected in one of GradeSignifier's dropdown lists, the description you enter here will be displayed as a tooltip. In general, it's best to keep column, query, and model names relatively short, and place more detailed descriptive information into the associated Description fields.

For the Column type, select "Examinee" from the drop-down list. Examinee type columns define information about each Examinee known before they take your test (e.g. their name, age, school district, etc.). In the computations, such data is treated very differently from data corresponding to actual responses to test questions (or scores derived from such responses) so it's very important to get this column type field right.

Using the drop-down list, select "1" for the First cell in col field and "6" for the Last cell in col field.

Note how GradeSignifier displays a red cursor just above both the Answer key field and the First 100 lines field so that you can see which parts of the answer key and data records correspond to your selection. In this case, the label defining the ability level (BRIGHT or STUPID) in the first 6 character positions of each record is selected.

Next Click Add Column again. Note that the default values of all fields (except the name, which must be unique) are copied from the current column.

Enter "G1" for the Column name, "Good Question #1" for the Description.

Select "Response" as the Column type. This column contains the students' responses to the first "GOOD" test question.

Using the drop-down lists, select "7" for the First cell in col field and "7" for the Last cell in col field. Note that the red cursor highlights the 7th character of the record.

Repeat the above steps 4 times, to define similar columns G2, G3, G4, and G5 corresponding to the GOOD question data stored in column positions "8", "9", "10", and "11".

Similarly, define columns P1, P2, P3, P4 and P5 corresponding to the 5 POOR question responses, stored in positions, 12, 13, 14, 15, and 16.

Either during, or at the end of this process, you will find it useful to use the Next and Prev buttons to move through the fields in the order of their appearance in the record. When you are finished, you should be able to start from the first, "Ability", field/column, and by repeatedly clicking Next you should see the red cursor move from left-to-right across the record positions, while your field names go through the sequence:

Ability, G1, G2, G3, G4, G5, P1, P2, P3, P4, P5

Use the various fields discussed above or, as needed, the Add Column and Delete Col buttons to correct any mistakes you may have made.

Saving your work

We've done a good deal of work to define these columns. Use File, Save to save your work. GradeSignifier will allow you to select a folder, and enter a file. Select whatever folder you prefer, and then enter "myTutorial" as the file name; GradeSignifier creates a file in the folder you select called "myTutorial.gsm" (the gsm stands for GradeSignifier Model). Note that GradeSignifier changes the title bar of the application to display this name as the currently opened filename.

Be sure to save your work periodically as the tutorial progresses.

Viewing Data and Scores

Click the Data Viewer tab, to see exactly how GradeSignifier breaks up each line, and converts it into scores.

Switch the View mode field from the default "Data View", which simply extracts and displays each column's raw data from each record, to "Score View", which shows the calculated score for each test question.

Unless you made a mistake, on the GOOD questions (G1 through G5), BRIGHT students always score 100% and STUPID students always score 0%. By contrast, on the POOR questions (P1 through P5) the 100% and 0% scores are distributed randomly, without regard to student ability level.

Defining composite test scores as averages of item scores

Click again on the Data Importer tab to define weighted averages that define overall test scores.

We will define several different tests, with varying proportions of GOOD and POOR questions:

exam_0: 0 GOOD questions and 5 POOR questions (0% GOOD)
exam_1: 1 GOOD questions and 4 POOR questions (20% GOOD)
exam_2: 2 GOOD questions and 3 POOR questions (40% GOOD)
exam_3: 3 GOOD questions and 2 POOR questions (60% GOOD)
exam_4: 4 GOOD questions and 1 POOR questions (80% GOOD)
exam_5: 5 GOOD questions and 0 POOR questions (100% GOOD)

First, click Add Column, set Column name to "exam_0". In the Description field, enter "An exam with 0 GOOD and 5 POOR questions".

Next select "Average" from the Column type dropdown list, since this column's score will not be directly defined by the data, but rather will be a weighted average of other column scores.

Next, click the Edit... button to the right of the Weighted avg formula field. In the form that appears, enter a 1 for the weight of each of the POOR questions (P1, P2, ..., P5). Allow the default weight of 0 to remain for the GOOD questions (G1 through G5). Then click OK. Note that the Weighted avg formula field now displays the weighted average formula associated with the weights you entered:

(1*P1+1*P2+1*P3+1*P4+1*P5)/5

Note how GradeSignifier has automatically standardized the weights (this assures that they sum up to 1) by dividing by 5, the sum of the weights you entered.

Again click Add Column, set Column name to "exam_1", Description to "An exam with 1 GOOD and 4 POOR questions".

Again click Edit.... Note that the weights for exam_0 are now the defaults, so all you need to do is change the weight on P1 from "1" to "0" and to change the weight on G1 from "0" to "1" to produce an exam with 1 GOOD and 4 POOR questions. Note that "exam_0" itself appears on the weights table (with a weight of "0"). This is because GradeSignifier also allows you to use "averages of averages" (e.g. a final grade that is an average of final exam and mid-term test scores).

Continuing in this manner, add Average type columns called exam_1, exam_2, exam_3, exam_4, and exam_5. When you finish, the weights on all exams should be as shown on the table below:

        G1 G2 G3 G4 G5 P1 P2 P3 P4 P5
exam_0:  0  0  0  0  0  1  1  1  1  1
exam_1:  1  0  0  0  0  0  1  1  1  1
exam_2:  1  1  0  0  0  0  0  1  1  1
exam_3:  1  1  1  0  0  0  0  0  1  1
exam_4:  1  1  1  1  0  0  0  0  0  1
exam_4:  1  1  1  1  1  0  0  0  0  0

GradeSignifier always orders your columns in the left-to-right order in which they appear in the record. This order is used by the column editor, data viewer, model editor, and in various drop-down lists of column names.

To define the place within this ordering of Average type columns, simply define their First cell in col and Last cell in col fields just as you would for any other field.

Exploit this feature by setting First cell in col to "2" for columns exam_0, exam_1, exam_2, exam_3, exam_4 and exam_5. This assures that the average scores for each exam are displayed right after the "Ability" column on the Data Viewer tab, where it is easier to see them.

If you have done everything right, you should see the weighted average columns you've just defined in columns 2 through 6 on the Data Viewer tab. Note that exam_0 scores don't depend on student ability level. With all questions random (POOR), this is to be expected. But with exam_5, which contains only good questions, BRIGHT students always get 100% and STUPID students always get 0%. In general, as the exams get more GOOD questions, scores show an increasingly clear separation between BRIGHT and STUPID students.

Do BRIGHT students, on average, score significantly higher than STUPID Students?

A Quick Explanation of GradeSignifier's Statistical Model

For exam_5, where BRIGHT students always get 100% and POOR students always get 0%, we can easily guess the answer to the question above. But for exam_2, which involves a combination of GOOD and POOR questions, the answer is more ambiguous.

To use GradeSignifier to answer this question, first click on the Data Queries tab, and then click New to create a new query table row.

Enter "BrightStudents" into the Name column of the new query row.

Next, select "Info" from the Type column's drop-down list. Info type queries can only involve columns that contain information about the examinees known before the test is taken (e.g. their age, sex, race, school district, etc.).

Next, use the dropdown lists to select, or directly enter, in left-to-right order, the five "query chunks" shown below:

(  Ability == "BRIGHT" );

This defines a query that matches the 5 rows that contain data from BRIGHT students.

Next, click the checkbox in front of the query-row you just created, click the Insert... button, and then click the Insert Copies of Selected Rows button to make a copy of the row.

Change the Name column of the newly created query-row to "StupidStudents". Edit the right-hand-side of the query to change "BRIGHT" into "STUPID". The resulting query should look as shown below:

(  Ability == "STUPID" );

Finally, switch back to the Data Viewer tab and use the Subset dropdown list to select the BrightStudents (and then the StupidStudents) queries you just created. If you did everything right, the rows matching the "BrightStudents" and "StupidStudents" queries should contain only the 5 BRIGHT students and the 5 STUPID students, respectively.

Now that the compared groups have been defined, select the Stat Model tab to define a statistical model that compares the averages of these two groups.

Click New on the Stat Model tab to create a new statistical model row.

Select "Info" from the Type dropdown list in the model row, since we will be comparing groups defined by information known before the test begins (each student's BRIGHT or STUPID ability label).

Enter "BrightVsStupid_5" into the model row Name column.

Select "exam_5" from the Score dropdown list in the model row.

Select "ALL_ROWS" from the Common dropdown list in the model row.

Select "BrightStudents" from the Group1 dropdown list in the model row.

Select "StupidStudents" from the Group2 dropdown list in the model row.

Enter 100 in the Reps (replicates) column; this tells GradeSignifier to generate 100 simulated data points; statistics become more precise as you increase the number of replicates, though computing times increase.

Leave all other model columns at their default settings.

Next, click the checkbox in front of the model-row you just created, click the Insert... button, and then click the Insert Copies of Selected Rows button to make a copy of the row.

Perform the above step for a total of 5 times to create (with the original row) 6 model rows in all.

Set the names of these 5 copies to BrightVsStupid_0, BrightVsStupid_1, BrightVsStupid_2, BrightVsStupid_3, and BrightVsStupid_4. Similarly, set each model's score column to exam_0, exam_1, exam_2, exam_3, and exam_4.

When you finish, you should have six models that differ only in which of the six defined exam scores they use as the comparison criterion. Note that the index number on the model name reflects the number of GOOD questions (0, 1, 2, 3, 4, or 5) in the 5-question test who average score the model compares across the BRIGHT and STUPID groups.

Finally, select all 6 of the model row by clicking Select... and then Select All. Then click Calculate. Note that the Conf. Level column for each model now displays the calculated confidence levels instead of the initial "???". By convention, confidence levels greater than 95% are usually considered to be statistically significant.

In my run, these computed confidence levels were:

BrightVsStupid_0 - 34.884%
BrightVsStupid_1 - 95.354%
BrightVsStupid_2 - 99.41%
BrightVsStupid_3 - 99.998%
BrightVsStupid_4 - 100.0%
BrightVsStupid_5 - 100.0%

Note that, due to the effect of random sampling, these percentages will vary each time you compute the model, but the general trend should be about the same. You can copy a single row several times and re-calculate it to get an estimate of how large this sampling related variation is. For example, here are the results when I recalculated all of the above models a second time:

BrightVsStupid_0 - 38.99%
BrightVsStupid_1 - 95.065%
BrightVsStupid_2 - 99.537%
BrightVsStupid_3 - 99.997%
BrightVsStupid_4 - 100.0%
BrightVsStupid_5 - 100.0%

The results are somewhat surprising: even a 5 question test with just a single GOOD question will still produce composite scores that produce statistically significantly (at the 95% confidence level) differences in group averages when when Group1 contains 5 BRIGHT students, and Group2 contains 5 STUPID students.

You might expect that significance levels would decrease if the number of students in the BRIGHT and STUPID groups decreased. As an additional exercise, you could verify this expectation by creating a Query, "Rows4to7" that has just 2 BRIGHT and 2 STUPID students, and then making that query the Common subset of the models we have just evaluated. The Rows4To7 query should be defined as follows:

( ROW% >= "4" ) AND ( ROW% <= "7" );

Note the required double quotes around the row numbers. Without the quotes, the ROW% built-in column interprets 4 and 7 as percentages, rather than as row numbers.

Graphically Viewing the Results

Stat Charts

Select BrightVsStupid_0 from the Viewed Model dropdown list.

A cumulative probability chart associated with the background variability of the "Group1 vs. Group2 contrast" (for this model, this contrast is defined to be the average score of BRIGHT students on exam_0 minus the average score of STUPID students on exam_0) will be displayed.

Next, use your down arrow key to cycle through all of the models. Note that as the number of GOOD questions in the score used as the basis for the comparison increases, the real data (green line) appears more and more to the right of the simulated data, and the confidence levels increase, as you would expect as the number of GOOD questions included in the test increases.

The blue curve shows the 100 simulated average differences GradeSignifier generated. The percentage of simulated points with group-to-group average differences (contrasts) as small or smaller than the real data contrast represents the "Reference Distribution" based confidence level shown on the chart. This chart, called a cumulative probability chart, shows this percentage/confidence level on the y-axis for each value of the contrast shown on the x-axis.

Since we only have 100 simulated data points (we set Reps to 100 in our models), the "Reference Distribution" based confidence level, due to its discrete nature, can never be accurate to more than to about 1%.

The Student's t statistic provides a smooth fit to this discrete reference distribution curve, and is shown in red. The Student's t approximation is especially helpful for estimating confidence levels more accurately without undue computational effort, since computing the large number of replicates that otherwise would be required could take too much time, especially with very large data sets. However, the Student's t statistic requires that the typical "normal distribution approximation is OK assumption" be valid.

Usually, but not always, both methods will return approximately the same results. For example, a test data set with a very small number of questions, or a very small number of examinees, could produce very different confidence levels with the Reference Distribution and Student's t methods. In such cases, the discrepancy is likely because the extra assumptions behind Student's t approach are not correct, and thus the Reference Distribution approach, which does not require these extra assumptions, is preferred.

In other words, the Student's t approach is usually more precise, but the reference distribution approach can, in special cases, be more accurate.

Do Passing Students Have Significantly Higher Scores than Failing Students?

The last example showed how to compare two groups of students selected based on a property known before the test results were collected.

However, suppose you would like to compare groups that are selected by the scores themselves. For example, let's assume the 6 exam scores we have defined (exam_0 through exam_5) are being considered for use in determining who will get a certain certification. Those scoring 50% or higher get certified, those scoring less than 50% do not.

A reasonable requirement for such a categorization is that the difference between the average score of the passing group and the average score of the failing group be statistically significant. Otherwise, we may have merely categorized examinees by random chance, rather than by their abilities.

We can re-use the columns and 5-question exams defined earlier in this tutorial, so we begin by defining queries that return those that passed and those that failed each of these 6 exams (exam_0 through exam_5).

Select the Data Queries tab, and then click New, then enter the Name "Passed_0". This query will return those students that passed exam_0.

Select "Score" as the type of this new query. Score type queries define a single, contiguous, interval of Score type columns. For example, you could use a Score type query to define the conventional 94% to 100% range associated with an "A", etc.

Using the query editor, enter the query shown below:

( exam_0 >= 50 );

So, this query defines passing exam_0 as getting 50% or higher on exam_0.

Select this query via its selection checkbox, click Insert..., then, Insert Copies of Selected Rows. Change the name of the new copy to "Failed_0", and modify the query so that it looks as shown below:

( exam_0 < 50 );

So, this query defines failing exam_0 as getting less than 50% on exam_0.

Next, repeat the above steps (or copy the Passed_0 and Failed_0 rows, via Insert..., and then modify the copies) to create analogous queries called Passed_1, Passed_2, etc. and Failed_1, Failed_2, etc. that return the examines who passed or failed exam_1, exam_2, etc.

Next, click on the Stat Model tab, then click New, enter, "PassedVsFailed_0" as the new model's Name, select "Score" as its Model Type, "exam_0" as its Score, "ALL_ROWS" as its Common subset, "Passed_0" as its Group1, "Failed_0" as its Group2, and enter "100" as its Reps (replicates).

Click the checkbox in front of this new "PassedVsFailed_0" row, and then use Insert... repeatedly to copy this row 5 times. Then enter names "PassedVsFailed_1", "PassedVsFailed_2", etc. and change the Score column to, respectively, "exam_1", "exam_2", etc. and the Group1 columns to Passed_1, Passed_2, etc. and the Group2 columns to Failed_1, Failed_2, etc.

Select all of these rows (by checking the checkbox in front of each row) and then click Calculate.

Note that, as expected, confidence levels increase as the number of GOOD questions in the compared test increases (that is, as you go from PassedVsFailed_0 to PassedVsFailed_5). The confidence levels I obtained (your results will vary somewhat due to the random sampling involved) are tabulated below:

PassedVsFailed_0 - 34.311%
PassedVsFailed_1 - 40.653%
PassedVsFailed_2 - 71.468%
PassedVsFailed_3 - 99.935%
PassedVsFailed_4 - 100.0%
PassedVsFailed_5 - 100.0%

As before, use the Stat Charts tab to review the simulated vs. real differences in more detail for each model. Note that this statistic requires more GOOD questions to produce a significant difference (Confidence level > 95%) than the comparison based directly on the ability level (BRIGHT vs. STUPID) we used in the previous section.

Also note that, simply by the act of selecting those scoring above 50% vs. those scoring less than 50%, we have generated a more than 30% difference between passing and failing Groups--even when the only source of score differences is random variation (c.f. the PassedVsFailed_0 chart on the Stat Chart tab). By contrast, with the BRIGHT vs. STUPID groups, where group membership doesn't depend on a score interval and thus lacks this "naturally segregating, ordered selection, effect", the cumulative probability chart displayed for the model is centered about zero (c.f. the BrightVsStupid_0 chart on the Stat Chart tab).

An Example with Real (Coleman Study) Data

So that you can explore such an example on your own, I have included the GradeSignifier Model file ColemanStudyExample.gsm and the associated data file coleman_sixth_grade_batch_h01.txt in the GradeSignifier zip file.

This example uses data from the 1966 landmark study by James S. Coleman, "Equality of Educational Opportunity (EEOS)", also known as the "Coleman study". The study played an important role in a well known Supreme Court case that declared that "separate but equal" racially segregated school systems were unconstitutional.

The data set contains the 6th grade math item response data from a single "batch" (probably a single school). The GradeSignifier model compares the average math scores (based on a 25 question standardized test) of white vs. black students, male vs. female students, older vs. younger students, A (those scoring 90% to 100%) vs. B (those scoring 80% to 90%) students, etc.

As you explore the model, try to construct other comparisons that may interest you. For example, you might try using the special built-in-function-column RAND% to create a query that selects a random sub-sample of the data. By using this query, instead of ALL_ROWS, in the Common field of the various statistical models you can explore how confidence levels might vary if you had only had access to a smaller subset of this data.

If you would like more information about this data set (including copies of the actual tests given to the students) you can download the complete 1966 Coleman study data set, which contains hundreds of thousands of data records.

GradeSignifier's Statistical Model

A Quick Explanation of GradeSignifier's Statistical Model

GradeSignifier's statistical model always compares the average score of one group of examinees (Group1) to see if it is significantly larger than the average score of another group (Group2).

Central to the model is a debate between the following two competing perspectives:

The null hypothesis: The data provides no evidence that, apart from random fluctuations, Group1 average scores are higher that Group2 average scores.
The alternate hypothesis: The Group1 average score is so much larger than the Group2 average, that it is highly implausible that the difference is just due to random fluctuations.

The key phrase above is random fluctuations. We need a simple, sensible, way to estimate the size of this "background variation" before we can decide if the difference between the Group1 average and the Group2 average that we actually got blends naturally into this background, or sticks out like a sore thumb.

GradeSignifier assumes that a reasonable model of such random fluctuations must satisfy two key requirements:

Empirical consistency: For every test item, the probability of a simulated examinee getting each possible score is simply the fraction of real examinees who actually got that score, on that item.
Indistinguishable (a.k.a. uniform ability) examinees: All simulated examinees have the same probability of getting each possible test item score.

To generate simulated data consistent with these requirements, we sample randomly from the real-data scores associated with examinees in either group, thus generating a collection of simulated scores. These simulated scores are assigned to simulated Group1 and/or Group2 categories so that the number of simulated examinees in Group1, Group2, and in both groups is the same as in the real data set. Using these categorized scores, we then generate a single simulated difference (equal to the average test score of simulated examinees in Group1 minus the average test score of simulated examinees in Group2).

Note: GradeSignifier allows each statistical model to define a common subset, specified by the model's Common field. The compared groups are actually those members of the Group1 subset that are also in the common subset, and those members of the Group2 subset that are also in the common subset. For simplicity, in our discussion we just refer to Group1 and Group2.

This entire process is then repeated thus generating a distribution of such simulated differences that is used to characterize the background variation.

The real difference between the Group1 average and the Group2 average is then checked against these simulated differences to see if it is unusually large or "just of ordinary size". Specifically, the displayed confidence levels are estimates of the percentage of these background variation differences that are less than the real-data difference.

By convention, if 95% or more of these simulated differences are less than the real data difference, the real data difference is said to be "statistically significant at the 95% confidence level". The implication in this case is that "something else is going on" other than random fluctuations, e.g. that Group1 examinees, on average, rather than being indistinguishable from Group2 examinees, actually have higher test-solving ability, etc.

On the other hand, if less than 95% of simulated differences are less than the real data difference, our simple "all examinees have the same ability, and differences are just due to random variation" model is deemed adequate to explain the difference we actually got, so the difference is said to be "not statistically significant at the 95% confidence level".

The above discussion is a summary sufficient for understanding the tutorial; we fill in the details below. If you reached this point by following the link in the middle of the tutorial you can return to that point in the Tutorial now.

Degrees of Dependence (Dod): Simulating Statistically Dependent Items

Note: In this document, the terms "test item" and "test question" have the same meaning.

Choose an examinee at random, and use that examinee's score on the first question to define the simulated examinee's score on the first question. Similarly, choose a second, third, etc. examinee at random to define the simulated examinee's score on the second, third, etc. questions.
Choose a single real examinee at random, and simply use all of that examinee's test item scores as the simulated examinee's test item scores.

Note that, though quite different, both methods are consistent with the two key requirements of a reasonable background variation model described above.

GradeSignifier supports both of these methods through its degrees of dependence (dod) parameter (which is configurable via the dod field on the Stat Model tab). Specifically, GradeSignifier uses the first method when Dod equals 0 (its minimum and default value), and the second method when dod equals the number of test questions minus one (its maximum allowed value).

Before describing how to use this dod parameter, we have to understand the concept of test item independence. To illustrate the idea consider the following simple two question math test:


1) The sum of one and one equals:

a) 1
b) 3
c) 2
e) it cannot be determined from the data

2) What does 1 + 1 equal?

a) 2
b) 3
c) 1
e) an imaginary number

These two questions are not statistically independent because, once you know that a person got the first question right you can be pretty sure that they will also get the second question right. The probability of getting the second question right isn't independent of the probability of getting the first question right. Instead, the two test items share a single degree of freedom between them, and are said to be statistically dependent.

This is not a good test, and this kind of statistical dependency is a sign of a poorly designed test.

The appropriate degrees of dependence (dod) setting for this test is 1, because the score on one of the questions depends on the score on the other. In general dod is the number of such inter-dependencies between test questions that exist in an exam. A dod of 0 corresponds to the best case scenario where every test item is statistically independent of the others, what a good test designer strives for and GradeSignifier's default setting. This setting also corresponds to the tacit assumption behind many traditional significance tests.

Note that, in this example, there is nothing in the data alone that allows us to determine the appropriate dod setting, rather, it represents an assumption about the quality of the test (lower dods mean higher quality). Note also that in the above example, we could just as well have thrown out the redundant second question; the dod setting is most useful when you are not sure of the exact kind of dependency, but you suspect there may be some such flaws in the test, and want to estimate their potential impact on the statistics.

On the other extreme, setting dod to N-1, where N is the number of test questions, means that all of the test questions are statistically dependent on each other (as in our example). Contrary to what this example suggests, this setting is actually quite useful in generating statistics that ignore the item structure of the data set, and instead use only the final scores in the computation. For example, a comparison of boys vs. girls that made no assumptions about the statistical independence of test items, but relied only on the final test score, could be produced by using this setting.

With an arbitrary dod setting, GradeSignifier randomly selects a dependency structure for use in generating the data used to compute each simulated "Group1 average minus Group2 average" statistic as follows:

Randomly select dod "dependent questions" from the N test questions
The remaining N-dod questions are each used to form a separate "independent block".
Randomly distribute the dod dependent question among the N-dod independent blocks.

When finished, the N questions will be distributed across N-dod "independent blocks", each of which contains at least 1, and at most dod+1, questions.

To generate a single simulated examinee's data using this dependency structure, we randomly select a single examinee, and define the simulated examinee's scores for all the questions in the first independent block as this examinee's scores, randomly select a second examinee to define the simulated examinee's scores for all the questions in the second independent block, and so on, until the questions in all N-dod independent blocks (and hence all N questions) have been defined.

GradeSignifier repeatedly uses the same dependency structure to create a sufficient amount of data to compute a single simulated difference between the Group1 average and the Group2 average. However, each new simulated difference is based on a fresh, randomly selected, dependency structure.

The astute reader may have noticed that this isn't the only way to simulate test item dependency. However, this approach is very simple, is unbiased, and moreover satisfies an essential empirical consistency requirement related to intra-item dependency: that the kind of dependency generated be entirely consistent with the kind appearing in the real data. These advantages arise quite naturally from the simple fact that the simulated scores associated with each independent group are selected at random and copied directly from the real data set.

The astute reader may have also speculated that the number of degrees of freedom equals N-dod, where N is the number of test questions. It turns out that this is only correct with certain special data sets, such as the one shown below:

Note: A GradeSignifier model, "dodDemo.gsm", that employs the two data sets shown below to illustrate the points made in this section is included within the GradeSignifier zip file.

           Question1 Question2 Question3
Student1      0%        0%         0%  
Student2      0%        0%         0%  
Student3      0%        0%         0%  
Student4      0%        0%         0%  
Student5     100%      100%       100% 
Student6     100%      100%       100% 
Student7     100%      100%       100% 
Student8     100%      100%       100%

dod

In each of these cases, the number of degrees of freedom associated with the model would be N-dod.

However, for general data sets, all that you can say is that the number of degrees of freedom either decreases, or stays the same, as the dod setting is increased.

An example of a data set where the number of degrees of freedom is a constant (N) regardless of the dod setting is shown below:

           Question1 Question2 Question3
Student1      0%        0%        0% 
Student2      0%        0%       100% 
Student3      0%       100%       0%  
Student4      0%       100%      100%
Student5     100%       0%        0% 
Student6     100%       0%       100%
Student7     100%      100%       0% 
Student8     100%      100%      100%

Observe that even if you know that Question1 has a score of 100%, the probability of getting 100% on question Question2 or Question3 is exactly the same as if you did not know this, due to the "balanced" nature of the data set's structure (which is a so-called "2^3 factorial design").

In general, as dod moves from 0 to N-1, the degrees of freedom associated with the model goes from the largest to the smallest value that is consistent with the dependency structure within the real data set.

Thus, whenever your conclusions are the same regardless of the dod settings, it means that, due to the independent structure of your data set, your conclusions don't depend on unverifiable assumptions about the statistical independence of individual test items. Similarly, if the conclusions are the same for all but the largest dod setting, it means that only the weakest assumptions about item independence (e.g. that there are just two distinct questions, with all other questions copies of these two) are required to make your inferences valid.

On the other hand, if your results are only statistically significant when dod is zero, it means that your conclusions are only valid if every question's results really are statistically independent (e.g. one copied question would invalidate your conclusions).

Another way to look at this is that, when differences are statistically significant, there are often two possible explanations. The first, more optimistic, view is that the test genuinely discriminates between students of high and low ability. The second, more pessimistic, explanation is that the test designer simply copied the same question repeatedly until they created a test of the required length (or that there is some similar test design flaw).

This ability to quantify the degree to which additional assumptions about item independence that cannot be verified by the data set alone are required to make your statistical inferences valid is one of the unique features of GradeSignifier compared to traditional approaches, which tend to use a single, one-size-fits-all assumption about test item independence.

Sampling With and Without Replacement

In the previous discussion we spoke of "randomly selecting an examinee from the real data set" in order to generate the simulated data set.

GradeSignifier provides two different ways of doing this selection. When the wor model parameter is checked, the selection is done without replacement. When unchecked, it is done with replacement.

Imagine a deck of cards each of which contains an examinee. In selection without replacement the deck is well shuffled, and then the sequence is dealt out, one-by-one, until the deck is exhausted. If more random examinees are needed, the cards are collected, shuffled, and dealt out again.

In with replacement we begin with the same shuffled deck, but after each selection, the selected card is immediately replaced back into the deck, and the deck reshuffled, before the next card (= next randomly selected examinee) is selected.

In the vast majority of cases that arise in practice, both methods provide very nearly the same results. However, in certain special cases, typically those involving very few examinees, very few questions, or very high dod settings, the results can be very different.

The two sampling methods actually involve somewhat different statistical models. Sampling with replacement makes the assumption that the empirical data set represents a reasonably good estimate of the underlying score distribution that represents the population from which the original data set was sampled.

On the other hand, sampling without replacement defines the population from which we are sampling so that it is restricted to only those collections of examinees that have exactly the distribution of scores (in all possible orders of occurrence) that we actually got.

Because sampling without replacement doesn't require the additional assumption that the real data set represents the underlying population's distribution, it is more plausible and thus more likely to be a better model.

On the other hand, sampling with replacement runs faster. In addition, since both sampling approaches are widely used in Monte-Carlo simulations, it seemed best to give you both.

If you are a stickler for using the best available model, you will check the wor box. If you'd rather make it run faster (or you have some other reason for preferring sampling with replacement) you can stick with GradeSignifier's default, unchecked, setting. However, if you have good reason to believe that the real data scores are not a good estimate of the underlying distribution of scores (e.g. when there is a very small amount of data) play it safe and check the wor box.

How GradeSignifier Determines the Group Membership of Simulated Examinees

GradeSignifier generates the compared simulated average test score by simply averaging together the simulated test item scores from a single simulated examinee as usual. GradeSignifier generates a number of simulated examinees (and their associated simulated test scores) equal to the total number of real examinees in either Group1 or Group2.

In order to compute the difference between the simulated Group1 average, and the simulated Group2 average, GradeSignifier needs to decide which of these simulated examinees are members of Group1, Group2, or both.

How GradeSignifier makes this decision depends upon if the model is of examinee information type (abbreviated "Info" in the Model type drop-down list), or score interval type (abbreviated "Score"). Each of these methods of deciding group membership is described in a corresponding section below.

Determining Group Membership in Examinee Information ("Info") Type Models

In examinee information type models, a real examinee's membership in Group1, Group2, or both groups is defined by queries that only involve examinee information known before the test data is collected, such as school district, teacher, age, sex, race, annual family income, and similar examinee attributes.

Note that there are actually three possible group membership labels: Group1Only, Group2Only, and BothGroups. (Although most models will likely employ disjoint groups, GradeSignifier can also handle overlapping groups.)

GradeSignifier simply counts the number of real examinees in each of these three categories, and then randomly assigns the same number of simulated examinees to each category.

With simulated examinees so categorized, the test scores of examinees in Group1Only or BothGroups are averaged together to form the simulated Group1 average, and the test scores examinees in Group2Only and BothGroups are averaged together to form the simulated Group2 average.

With this kind of model, each of GradeSignifier's simulated group-to-group average differences can be shown to be a random sample from a so-called randomization reference distribution (c.f. Box, Hunter and Hunter, "Statistics for Experimenters", p. 95). In this kind kind of distribution, under the null hypothesis, group membership is just a meaningless label with no impact on scores. Under this hypothesis, any of the myriad of possible ways of assigning these labels to the examinees is equally likely. So, we generate all of them, by assigning the labels in all possible way to the real data set, computing the group-to-group average difference for each such combination, thus generating a distribution that characterizes the expected background variability.

Determining Group Membership in Score Interval ("Score") Type Models

In score interval type models Group1 and Group2 membership is defined by two corresponding contiguous ranges of the same, single, test score. For example, Group1 might be those who scored 90 to 100%, and Group2 those who scored 80 to 90%, on the final exam.

Note: The test score used to define group membership is usually (but not required to be) also the averaged/compared test score. Also, the two group membership determining score ranges are typically (but are not required to be) disjoint. The unusual cases are handled correctly by GradeSignifier, but for simplicity the discussion below ignores them.

GradeSignifier combines together the test scores all of the examinees that fall into either score interval, and uses these scores to compute the rank-range associated with each group-membership-defining score interval.

For example, let Group1 be those scoring above 50% and Group2 those scoring below 50% on the final exam. Suppose there just happened to be 10 examinees in Group1 and 20 in Group2, then ascending-sorted ordinal positions (a.k.a. ranks) 0 to 19 would be associated with Group2 (because the 20 lowest scoring examinees had failed) and positions 20 to 29 with Group1 (since the 10 largest scoring examinees had passed).

GradeSignifier then generates the same number of simulated test scores as were in either Group1 or Group2 in the real data set. It then averages together those simulated scores whose ranks match the rank-range of the real Group1 data set to determine the simulated Group1 average, and determines the simulated Group2 average analogously.

Continuing with the previous example, 30 simulated test scores would be generated, and the 20 lowest scoring simulated examinees would be averaged together to form the simulated Group2 average, and the 10 highest to form the simulated Group1 average.

You might wonder why we do not just directly apply the original score intervals to each simulated score to determine Group1 and Group2 membership. In fact, I had originally planned to do just that, until I realized that it was impossible to assure that there would be even a single simulated score that fell into either score interval! By translating the original score intervals into their associated rank-ranges, this problem was avoided: Regardless of the actual numerical values of the simulated scores, as long as we generate the same number of simulated scores as are in the real data set, there will always be a one-to-one mapping between the ascending sort-ordered ranks of the simulated and real test scores.

Reference Distribution vs. Student's t: Estimating Confidence Levels

GradeSignifier's Stat Chart tab displays two different confidence level estimates, and their associated cumulative probability distribution curves: the Reference Distribution estimate and the Student's t estimate.

The practical difference between the two estimates/curves can be summarized as follows: the Reference Distribution method is more accurate, but the Student's t method is more precise.

In all but a few special situations, mainly those involving small numbers of examinees and/or small numbers of statistically independent test questions, the lack of accuracy of the Student's t method will be negligible. On the other hand, unless you are willing to wait a very long time while GradeSignifier computes a very large number of replicates (or have a very fast computer) the Reference Distribution method's lack of precision usually will NOT be negligible.

Thus, in most cases, you will want to use the Student's t estimate/curves, and this is the value displayed on the Stat Model tab.

However, the Reference Distribution method has the advantage of extreme simplicity: it simply computes the percentage of simulated differences between Group1 averages and Group2 averages that are less than the corresponding real data difference, and displays that percentage as the confidence level. If you need to determine if something is unusually large compared to its background variability, you cannot get more basic than that.

This method's lack of precision arises because, if you only have 100 simulated differences, any real difference that falls between the 99th and 100th largest simulated differences will be at the 99% confidence level and (what is even worse) any point greater than the 100th simulated point will be at the 100% confidence level. Of course, you could always improve things by using, say, 10,000 simulated differences, but unless you have a very fast computer, this strategy could keep you waiting longer than you'd like.

The Student's t distribution represents a continuous approximation of the discrete distribution that is the basis of the Reference Distribution approach. Beginning with a statistician named W. S. Gosset (a.k.a. "a student") and continuing to the present day, Student's t been generally accepted by experts to be appropriate in almost every case in which statistically independent individual data points are used to form the averages that are in turn used to compute the t statistic (for more about this, Google "central limit theorem"). Because the individual simulated test scores that go into our simulated Group1 and Group2 averages are produced by a random number generator, their statistical independence is assured, and thus the Student's t model fits the discrete distribution quite well in (almost) all cases that arise in practice. You can check this assertion by inspecting the discrete vs. Student's t curves (on the Stat Charts tab) for any of the example model files shipped with the application.

Download GradeSignifier from its SourceForge project page

GradeSignifier User's Guide Version 1.0

Do your examinees' scores differ significantly?

Copyright © 2007 John C. Gunther

Contents

Introduction: Testing the Test

Installing and Running GradeSignifier

Tutorial: An Example with Simulated Data

A Simple Model and Associated Simulated Data Set

Defining the Data file and Answer key record

Defining where each column's data is in the record

Saving your work

Viewing Data and Scores

Defining composite test scores as averages of item scores

Do BRIGHT students, on average, score significantly higher than STUPID Students?

Graphically Viewing the Results

Do Passing Students Have Significantly Higher Scores than Failing Students?

An Example with Real (Coleman Study) Data

GradeSignifier's Statistical Model

A Quick Explanation of GradeSignifier's Statistical Model

Degrees of Dependence (Dod): Simulating Statistically Dependent Items

Sampling With and Without Replacement

How GradeSignifier Determines the Group Membership of Simulated Examinees

Determining Group Membership in Examinee Information ("Info") Type Models

Determining Group Membership in Score Interval ("Score") Type Models

Reference Distribution vs. Student's t: Estimating Confidence Levels

GradeSignifier User's Guide
Version 1.0