This section will guide you through the process of evaluating student learning, starting with designing assignments and exams, then moving to grading and giving feedback, and finally closing with learning from assessment. Within these sections, we’ll address the variety of methods that can be used to assess student learning, including exams, written assignments, performances, and informal activities.
University instructors assess students’ work in order to assign a grade—to make a summative judgment of an individual student’s level of success or failure. However, assigning grades is only one of several reasons to assess student learning. An equally important reason is to provide formative feedback, both to the students as to how well each of them is learning and to the instructor as to how well the class as a whole is doing.
Assessment is also be central to the learning process itself. One of the most important aspects of a successful learning experience is the opportunity for learners to play back to teachers their understanding of the information or processes they are learning. Through this opportunity, they can articulate their growing knowledge and receive correction, if needed, from the teacher. At the same time, teachers can learn how effective they have been in facilitating learning for their students and can use this information to revise their instructional practices.
It’s never too early to think about assessing student learning. As you are planning your course, you will begin by thinking about the learning goals you have for your students. What do you want them to learn in the course? What knowledge will they gain during the course? What skills will they acquire? How will their thinking be different? What will they be able to do?
Once you have articulated those goals, the next question you have to ask yourself is, How will I know if they have achieved those goals? How will they know? This is where assessment comes in.
Educators often stress “authentic” assessment, by which they mean continuously monitoring student performance by seeing what students know or can do while they are learning. These kinds of assessments often involve process skills and are informal, designed to provide ready feedback to both student and teacher. Examples are contained in Angelo and Cross’s Classroom Assessment Techniques (1993). Students, for example, might be invited to apply a category system to a set of data to see if they understand how to do this; they might be asked to list the main reasons why a certain problem could not be solved using a given procedure to see if they realize the limitations; or they may be asked to keep a portfolio of their written work and comment on their progress periodically. All of these procedures are designed to be natural tests of the learning goals for the purpose of improvement.
Unfortunately, assessment is more often used only to justify the assignment of letter grades than to serve as a diagnostic tool. As Svinicki (1976) points out, there are at least two kinds of occasions when assessment for diagnosis is important. One is at the beginning of a course or a given segment of a course, when it is appropriate to assess what the learners already know about what is to be learned. At these times, a pretest can help the instructor know the strengths and weaknesses of the learners and can suggest ways to modify learning activities accordingly. Another use of diagnostic assessment is the administration of frequent short self-tests to enable students to judge their performance while they are learning. If constructed in such a way that the tests force students to become more aware of the thinking process they use, diagnostic tests can help students develop their skills. These tests can also provide the kind of rapid and frequent feedback that is so important to learning.
Methods of evaluation and grading are closely tied to an instructor’s own personal philosophy regarding teaching. Consistent with this, it may be useful, in advance, to consider factors that will influence instructors’ evaluation of students. For example, some instructors make use of the threat of unannounced quizzes to motivate students, while others intentionally do not. Some instructors weigh content more heavily than style. It has been suggested that lower (or higher) grades should be used as a tool to motivate students. Other instructors may use tests diagnostically, administering them during the quarter without grades and using them to plan future class activities. Extra credit options are sometimes offered when requested by students or deemed appropriate by instructors. Some instructors negotiate with students about the methods of evaluation, while others do not. Class participation may be valued more highly in some classes than in others. These and other issues directly affect the instructor’s evaluation of student performance. As personal preference is so much a part of the grading and evaluating of students, a thoughtful examination of one’s own personal philosophy concerning these issues will be very useful. Once you have clarified for yourself what your philosophy is, it’s also important to make your philosophy and methods of evaluation and grading explicit to your students.
Regardless of which purpose is intended, assessment should be be tied to your desired learning outcomes for students. In some cases these outcomes or objectives will be provided for an instructor. If a course is part of a curricular sequence, if it is the prerequisite for another class, many of the items students must learn will be determined in the curriculum planning process. The syllabus for courses and sections taught by teaching associates is also often (although not always) provided by the department or supervising professor rather than determined by each TA.
Whether the learning objectives are developed by the instructor or provided by another, it is important that the instructor be very clear what these outcomes are. It is very difficult to judge performance if one cannot describe success; it is likewise difficult for students to achieve success if they do not know the target.
Once the desired outcomes are clear, effective assessment tools can be developed to determine student achievement. Different kinds of assessments are appropriate in different settings and for different purposes. Performance assessment is very important where the learning goals involve the acquisition of skills that can be demonstrated through action. In areas such as music, theater, art, dance, medicine, and physical education, much of the learning will be demonstrated through assessment of actual performance. Papers and other assignments are also methods for assessing student achievement. Examinations can come in many formats: essay, multiple choice; paper and pencil; online, take-home, in-class; etc.
In their book Effective Grading, Walvoord and Anderson offer six guidelines for creating assignments worth grading that are useful to keep in mind as you design your own assessments:
There are many ways to assess student learning throughout the course. You don’t want the first time you or the students realize they haven’t understood the material to be a midterm exam worth half of their grade.
Many forms of in-class assessment are informal and come quite naturally to teachers. When we ask students questions in class or have them work problems or discuss a concept in their own words, we are assessing their learning. Some more systematic ways methods of in-class assessment are useful as well.
Classroom Assessment Techniques (CATs) are short, ungraded, in-class activities that not only give you feedback about your students’ learning, but also help them self-assess and can even guide them in their learning. The book Classroom Assessment Techniques by Thomas Angelo and K. Patricia Cross (available in the FTAD library) lists many examples of CATs in several categories.
These useful links also provide a brief description of how to use CATs and some examples you may want to adapt for your class.
“Classroom Assessment Techniques” from National Teaching and Learning Forum
Classroom Assessment Technique Examples from The University of Hawaii
When we think of assessment, many of us immediately think of written tests administered at given intervals throughout the course. Such examinations usually follow either an open-response (essay) or limited-response (multiple choice) format. The following sections focus on traditional test formats. However, there are a variety of creative ways with which instructors can approach testing. For example, an extensive collection of samples of instructor-made tests in the sciences is available in Tobias and Raphael (1997).
In areas where written tests are used, some general advice for instructors includes the following:
Compose test items throughout the quarter. Instructors can compose test items as they progress through the term, rather than all in one sitting. Doing so will help avoid fatigue later on, and will result in items that are presented closer to the way in which the information was discussed in class and in a more even distribution.
Mix question types. It is often advantageous to mix types of items (multiple choice, essay) on a written exam or to mix types of exams (a performance component with a written component). Weaknesses connected with one kind of item or component or in students’ test-taking skills will be minimized.
Test early to demonstrate testing style. It is helpful for instructors to test early in the term and consider discounting the first test if results are poor. Students often need a practice test to understand the format each instructor uses and anticipate the best way to prepare for and take particular tests.
Test often to keep students on task. Frequent testing helps students avoid getting behind, provides instructors with multiple sources of information to use in computing the final course grade (thus minimizing the effect of “bad days”), and gives students regular feedback.
Test what you really want students to learn. It is important to test various topics in proportion to the emphasis they have been given in class. Students will expect this practice and will study with this expectation.
Do not let errors in the test create errors in student responses. On written exams, it is important to proofread exams carefully and, when possible, have another person proofread them. Tiny mistakes, such as misnumbering the responses, can cause big problems later. Collation should also be checked carefully, since missing pages can cause a great deal of trouble.
Check borrowed items carefully. Instructors should be cautious about using tests written by others. Often, items developed by a previous instructor, a textbook publisher, etc., can save a lot of time, but they should be checked for accuracy and appropriateness in the given course.
Create a test bank. If enough test items are developed and kept out of circulation between tests, it is possible to develop a test item bank from which known effective items can be reused on multiple versions or offerings of a test.
Avoid items that depend on correctly answering other items. Generally, it is wise to avoid having separate items depend upon answers required in previous items. A student’s initial mistake will be perpetuated over the course of succeeding items, penalizing the student repeatedly for one error.
Start easy to build confidence. Placing less difficult items or tasks at the beginning of an exam can help students who experience test anxiety reduce their preliminary tension and thus provide a more accurate demonstration of their progress.
Get feedback on items. A good way to detect test errors in advance is by pilot testing the exam. Instructors can take the test themselves or ask colleagues and/or former students to critique it.
Make appropriate accommodations. It is important to anticipate special accommodations that students with physical or learning disabilities or nonnative speakers may need. The instructor should anticipate special needs in advance and decide whether or not students will be allowed the use of dictionaries, extra time, separate testing sites, or other special conditions. Students with disabilities registered with the Office for Disability Services are entitled by law to “appropriate accommodations” if they request them. ODS will help determine what is appropriate and how to implement these accommodations.
Bring enough copies. Having too few copies of a written exam can be a disaster. Instructors can avoid problems by bringing more copies of the exam than they think they will need.
Minimize distractions. Instructors can minimize interruptions during the exam by writing on the board any instructions or corrections that need to be made after the exam has begun and calling students’ attention to them. Before the exam, students can be informed that they should check the board periodically for instructions or corrections.
A good test reflects the goals of the course. It is congruent with the cognitive or psychomotor skills that the instructor wants the students to develop and with the content emphasis that has occurred during the instruction. That is to say, it tests student achievement of the central learning objectives of the course. If, for example, the instructor has been mainly concerned with having students memorize a body of factual material, the test should ask for recall of this material. If the instructor has been trying to develop analytic abilities in the students, a test that asks for recall is inappropriate and will cause the students to conclude that memorization is the instructor’s true goal. Similarly, if the instructor has focused on the War of 1812 in the majority of the class sessions and activities, this emphasis should be reflected in the test. A test that covers a much broader period will be regarded as unfair by the students, even if the instructor has told them that they are responsible for material that has not been discussed in class. Students go by instructors’ implicit values more than their stated ones.
To plan a test which is consistent with their goals and their content emphasis, many instructors use a test matrix or blueprint such as the one illustrated below. Arrayed down one column are the learning objectives that the test is to assess and arrayed across the columns are the concepts or content elements to be covered on the test. The instructor uses the matrix by checking those points of intersection that reflect the cognitive and content goals of the course or instructional unit and writing items or tasks accordingly. The matrix is sometimes used after the initial draft of a test has been written or composed to determine if it is unbalanced in its emphasis and needs to be revised.

Limited-Choice vs. Open-Ended Items—Instructors often ask, “Are essay tests better than multiple-choice tests?” While there is, in some disciplines, a feeling that essay tests are morally superior to “multiple-guess,” the answer, of course, depends on the circumstances and on the goals of the test. The advantages and disadvantages of two main types of items are discussed below in terms of the various issues that will often be considered when a test is being developed.
The term “limited-choice” will be used here to describe test questions that require students to choose one or more given alternatives (multiple choice, true/false, matching columns), and “open-ended” will be used to refer to questions that require students to formulate their own answers (sentence completion, short answer, essay). This avoids implying that one type of question is automatically “objective” and the other necessarily “subjective”—a faulty assumption, since bias can occur with either type of test. Following are some comparisons of the two types of items.
Exam construction and grading time—The most obvious difference between open-ended and limited-choice items is the amount of time the instructor spends on them. While it takes time to construct open-ended items well, it generally is much more time consuming to construct limited-choice items both because many more items are needed for the average exam and because it is extremely difficult to write good items. Experienced test constructors report producing as few as one to three “good” limited-choice items per hour.
While it is easier to generate open-ended items, it is much more time consuming to grade them than limited-choice items. One exam consisting of only open-ended items may take as long to grade as an entire set of exams made up of limited-choice items. If the limited-choice exams are mechanically scored, the differences are even more extreme.
Level of learning objectives—In principle, both limited-choice and open-ended items can be used to test a wide range of learning objectives. In practice, most people find it easier to construct limited-choice items to test recall and comprehension and open-ended items to test higher-level learning objectives, but other possibilities exist. Limited-choice items that require students to do such things as apply concepts to new situations or analyze text or pictures using a theory go beyond rote learning. It is also true that overly focused essay questions can easily stay at the recall level.
Content coverage—Since more limited-choice than open-ended items can be used in exams of the same length, it is possible to sample more broadly over a body of subject matter with limited-choice items, while well-constructed, open-ended items can allow students to demonstrate their understanding in depth. A small number of open-ended items that are broad in scope and call for the inclusion of many specifics can also test comprehensively.
Practice and reward of writing and reading skills—A long-term goal of many learning tasks in higher education is the cultivation of students’ reading and writing skills. Limited-choice items give virtually no practice in writing, while open-ended exams, particularly short-answer and essay, provide opportunities to improve writing. Open-ended exams, therefore, give students with good writing skills an advantage over those who do not have these skills, and limited-choice exams do not favor students who write well. They do, however, favor students who read well, since these students have the skills to attend to key words, recognize logical qualifications and cues, and discriminate among close choices.
Practice and reward of creativity and divergent thinking—Open-ended items, especially essay questions, can provide far more opportunity for creative or divergent thinking than limited-choice items. However, this depends on how the item is written since an essay question can call for convergent thinking, such as reaching a set solution to a problem situation. An argument often made about limited-choice exams is that they not only fail to foster, but actually penalize, divergent thinking.
Feedback to teacher and student—Limited-choice exams allow faster feedback than open-ended exams. Open-ended exams, however, are usually more revealing to the teacher about specific student strengths and weaknesses in processes such as comprehension and reasoning and can occasion more dialogue if teacher and student use this possibility.
Length of exam—Many limited-choice items can be answered in the amount of time it takes to answer one open-ended item, particularly essay questions. Limited-choice items or the briefer type of open-ended items (sentence completions, short answers) thus are more appropriate for short quizzes and short exams than essay questions.
Size of class—Unless multiple graders are available, it is very difficult to give frequent open-ended exams and provide timely feedback in a high-enrollment course. Exams that consist mainly of limited-choice items are usually more practical under these circumstances.
Reliability in grading—Open-ended exams are much harder to grade reliably (consistently) than limited-choice exams. However, to enhance reliability, one can use such methods as establishing model answers, holistic scoring, primary trait analysis, and grade-norming to work toward inter-rater agreement with multiple graders (see the section on grading later in this chapter.)
Reusability of exam—Some instructors have serious concerns about students having access to past exams. It is usually safe to assume, regardless of the precautions you take, that some students will have access to your old exams. In general, exams consisting of a large number of limited-choice items are easier to reuse than those consisting of only a few essay questions since it is harder in this case for students to remember and transmit the questions to others who will take the exam after them (if the printed exam does not get into circulation). If a large item bank is built and different exams can be randomly generated from the same pool of questions, limited-choice items are highly reusable. On the other hand, open-response exams can be altered in ways that require each student to do his or her own thinking, without requiring the instructor to start from scratch in designing the question.
Prevention of cheating—Limited-choice exams provide easier conditions for cheating than open-ended exams, since singleletters or numbers are far easier to see or hear than extensive text. Cheating can be minimized in several ways, however, such as by using alternate test forms and controlling seating.
Writing test items—In the discussion of limited-choice items below, the term stem is used to refer to the part of the item that asks the question. The terms responses, choices, options, and alternatives are used to refer to the parts of the item that will be used to answer the question. For example:
Stem: Who is the author of Jane Eyre?
Responses:
A) Emily Brontë
B) Charlotte Brontë
C) Thomas Hardy
D) None of the above
Multiple-choice items are considered to be among the most versatile of all item types. They can be used to test factual recall as well as level of understanding and ability to apply learning. Multiple-choice items can also provide an excellent basis for post-test discussion, especially if the discussion addresses why the incorrect responses were wrong as well as why the correct responses were right. Unfortunately, they are difficult and time consuming to construct well. They may also appear too discriminating (picky) to students, especially when the alternatives are not well constructed, and are open to misinterpretation by students who read more into questions than is there.
Suggestions for constructing multiple-choice items include:
Suggestions for constructing multiple-choice items that measure higher-level objectives include:
The following websites provide additional guidelines and examples:
True/false items are relatively easy to prepare since each item comes rather directly from the content. They offer the instructor the opportunity to write questions that cover more content than most other item types since students can respond to so many in the time allotted. They are easy to score accurately and quickly. True/false items, however, may not give a true estimate of the students’ knowledge since half can be answered correctly simply by chance. They are very poor for diagnosing student strengths and weaknesses, as they are, in effect, multiple-choice items with only two choices. They are also often considered to be “tricky” by students since in order to make them difficult enough, one must often make them obscure.
Since true/false questions tend to be either extremely easy or extremely difficult, they do not discriminate between students of varying ability as well as other types of questions do. If you still want to use true/false items, suggestions for constructing true/false items include:
Matching items are generally quite brief and uninvolved and are especially suitable for who,
what, when, and where questions. They can, however, be used to have students discriminate among and apply concepts. They permit efficient use of space when there are a number of similar types of information to be tested. They are easy to score accurately and quickly. Among the drawbacks of matching items are that they are difficult to use to measure learning beyond recognition of basic factual knowledge, they are usually poor for diagnosing student strengths and weaknesses, they are appropriate in only a limited number of situations, and they are difficult to construct since parallel information is required. Students will also often use process of elimination to guess answers to items they do not really know.
Suggestions for constructing matching items include:
Completion items are especially useful in assessing mastery of factual information when a specific word or phrase is important to know. They preclude the kind of guessing that is possible on limited-choice items since they require a definite response rather than simple recognition of the correct answer. Because only a short answer is required, their use on a test can enable a wide sampling of content. Completion items, however, tend to test only rote, repetitive responses and may encourage a fragmented study style since memorization of bits and pieces will result in higher scores. They are more difficult to score than forced-choice items and scoring often must be done by the test writer since more than one answer may have to be considered correct. On the whole, they have little advantage over other item types unless the need for specific recall is essential; however, they can be effective at collecting wrong answers that students will choose for future use as distractors in multiple-choice items.
Suggestions for constructing completion items include:
The main advantages of essay and short answer items are that they encourage students to strive toward understanding a concept as an integrated whole; permit students to demonstrate achievement of higher level objectives such as analyzing given conditions and critical thinking; allow expression of both breadth and depth of learning; and encourage originality, creativity, and divergent thinking. Written items offer students the opportunity to use their own judgment, writing styles, and vocabularies. They are less time consuming to prepare than any other item type. Unfortunately, tests consisting only of written items permit only a limited sampling of content learning due to the time required for students to respond. Essay items are not efficient for assessing knowledge of basic facts and poorly constructed questions provide students more opportunity for bluffing, rambling, and “snowing” than limited-choice items. They favor students who possess good writing skills and neatness and are pitfalls for students who tend to go off on tangents or misunderstand the main point of the question. The main disadvantage, however, is that essay items are very difficult and time consuming to score and potentially subject to biased and unreliable scoring.
Suggestions for constructing essay questions include:
In some cases the best way to assess student learning is through written assignments, whether they be short reflective exercises, collaborative projects, or traditional research papers. Written assignments can be an excellent way to assess higher-order critical thinking skills, such as evaluation and synthesis. Writing is not only an important skill for students to learn, but it is also a great method for them to learn the course material more fully and deeply.
The Center for the Study and Teaching of Writing has many valuable resources to support writing across the curriculum at Ohio State. Some of their on-line guides that can help you design writing assignments for your students include:
Deciding What Type of Writing Assignments to Use
Creating and Implementing Effective Writing Assignments
Creating Writing Assignments that Encourage Fair Use and Citation Ethics
In many fields (such as dance, studio art, allied medical professions, and sometimes laboratory sciences), student performance is the most appropriate way to judge student progress. Different kinds of measures will be appropriate for different fields, but some general guidelines are listed below:
It is important to base the assessment on the specific skills or competencies that the course is promoting. A course in family therapy, for example, might include performance tests on various aspects that are covered in the course, such as recording client data, conducting an opening interview, and conducting a therapy session. Developing a performance assessment involves isolating particular, demonstrable skills that have been taught and establishing ways in which the level of skill can be assessed for each student. One might, for example, decide that the best way in which a student can demonstrate counseling skills such as active listening would be to have the student play the role of therapist in a simulated session.
It is best to define the task as clearly as possible. Rather than simply alerting the students to the fact that their performance will be observed or rated, it is helpful to give more precise instructions on how the test will be structured, including how long they will have, the conditions under which they will perform the task, and other factors that will allow them to anticipate and prepare for the test, as well as the criteria by which the performance will be assessed. If possible, it is best in setting up a new assessment situation to ask a student or colleague to do a trial run before using the test with students so that unanticipated problems can be detected and eliminated.
Good performance assessments identify criteria on which successful performance will be judged and specify these in advance. For curriculum areas in which it is possible to clearly define mastery, such as, “the student will be able to tread water for five minutes,” it is desirable to do so. In most areas, however, effective performance is a complex blend of art and skill, and particular components are very subtle and hard to isolate. In these cases, it is often useful to try to highlight some observable characteristics and to define what would constitute adequate performance. In a test of teaching, for example, students might be expected to demonstrate clarity, organization, discussion skills, reinforcement of student responses, and the like. Operational definitions for specific components to be evaluated may be phrased like the following excerpt from a teaching observation checklist: “Praises student contributions—The teacher acknowledges that he or she values student contributions by making some agreeable verbal response to the contributions. The teacher may say, ‘That’s a good point,’ ‘Yes, thank you,’ ‘Thanks for raising that,’ ‘Right, well done,’ or the like.” Such information is helpful to the student as well as to the instructors who will be rating the performance. See the section below on Primary Trait Analysis for an example of one method for doing this.
It is important to give the same test or kind of test to each student. When possible, it is best to arrange uniform conditions surrounding a performance testing situation. Students can be given the same materials to work with, or the same task. Often, however, particularly in professional practice situations, it is hard to control the context of a performance testing situation. One nursing student may be evaluated while dealing with an especially troublesome patient while another will be working with a helpful patient. In these situations, documenting and allowing for the contextual influences on the performance are extremely important parts of the evaluation, but the evaluator will be called upon to exercise informed, professional judgment.
In summary, the effectiveness of a given performance assessment is directly related to how appropriate it is, given the course objectives; how clearly the tasks are defined; how well the criteria for successful performance have been identified and conveyed; and how uniform the testing is for all students involved. The section on grading later in this chapter contains a discussion on grading students in a performance situation.Providing feedback to students and collecting feedback from them are important and informative elements of teaching well. Because of the difficulties and emotions involved, most instructors are anxious about providing honest feedback, particularly when it is negative. Yet there are many mutual benefits connected with feedback. Students benefit through getting feedback on how well they are learning. Teachers benefit from course feedback through learning how well their actions are facilitating learning and what changes or additional approaches they might use to further student understanding. Institutions benefit from feedback through obtaining information on how well overall goals for students are being achieved.
Assessment activities that result in feedback range from simple classroom assessments to institutional efforts to measure change across particular program areas. In this chapter, the focus is on efforts by the instructor. The results of two main kinds of assessment activities are formative feedback, which is designed to provide diagnostic information to learners and their teachers, and summative feedback, which is used to determine final grades or other summary information. While many assessment activities provide both, it is important to be clear with students in advance about the purpose of an activity.
Students often complain that the basis for their evaluation is unclear to them. Students’ ability to “guess” what topics will be presented as a part of their evaluation and in what form is hardly indicative of their mastery of course content. Additionally, questions employed for evaluative purposes should be of the same nature and scope as day-today class activities and assignments. This is not to say that the evaluation must be a “regurgitation” of classwork and readings but rather that it should be within the same general framework. Not-for-grade trial tests, given early in the quarter, can be useful tools both to alert the instructor as to the students’ abilities and to provide the students with an understanding of the method of evaluation that will be used.
Formative feedback can be an important part of the ongoing teaching process. Since feedback is essential to learning, frequent diagnostic assessment enhances the learning that will take place in a given course. Many instructors build frequent opportunities for self-checks into their teaching: they punctuate lectures with questions that call upon students to demonstrate their understanding of the topic at hand, they ask students to solve problems after watching a demonstration, or show how they are interpreting the information which they are receiving. Instructors may ask students to keep journals, to demonstrate a particular technique, or to relay their understanding to a fellow student. In most of these cases, the assessment activity is not graded and is solely to help learners understand how well they are doing and to help the teacher know how to proceed.
Bergquist and Phillips (1975) have identified characteristics of feedback for formative purposes:
Constructive feedback communicates caring and honesty. It avoids false praise and global blame.
The two most common ways of evaluating student writing are analytic and holistic scoring.
The analytic approach to grading considers writing to be made up of various features, such as creativity, grammar, succinct expression of concepts, and punctuation, each of which is to be scored separately. An analytic writing score is made up of a sum of the separate scores and is often a weighted sum developed after multiplying each score by numbers representing the relative importance of the features the instructor wishes to emphasize. A recent development of this approach is called primary trait analysis. This method is similar to other types of analytic evaluation in that it scores the important traits of the work separately, but it then aggregates the data to assess the learning of the entire group as well as each individual student.
Holistic scores are arrived at by comparing individual student essays to model essays or descriptive rubrics representing good, fair, and poor responses to the assignment.
A third variation is a type of global scoring, which assumes that writing is the sum of various features, but assigns the final score without the use of a scale. This method, which is most frequently used in casual approaches to grading writing, tends to result in less precise evaluation and less concrete feedback for the student.
Analytic scoring is the traditional approach to grading writing. Instructors who use analytic scoring view writing as a demonstration of many isolated skills which when graded separately and added together will come up with an appropriate evaluation of the piece. Many instructors choose to use analytic scoring because of its strengths, some of which are as follows:
Some of the weaknesses of analytic scoring are as follows:
The following guidelines may be useful to maximize the effectiveness of analytic scoring:

One method of anaylitic scoring is Primary Trait Analysis (PTA). The primary traits of a piece of student work are the instructor-defined criteria which are necessary for the successful completion of the task. Unlike holistic scoring which lumps these together, looks at them simultaneously, and then assigns one grade, PTA scores each trait separately. This allows an instructor to systematically compare student performance for each trait as well as to assign grades to individual students. Instructors can then identify strengths and problems in their preparation of students for their task, in test or activity design, and in student performance. PTA is a way to take what we already do—record grades—and translate that process into both a very effective strategy for responding to student work and an assessment device (Walvoord & McCarthy, 1990).
Advantages of PTA for assessment include (1) using information that is already available, (2) bringing to consciousness the mostly subconscious processes that go into recording grades, and (3) looking at performance strengths and weaknesses in individual pieces of an assignment, course, or curriculum. This section illustrates how PTA works.
Each teaching professor has a view of what he or she wants students to accomplish. The view, even if it is an unconscious one, pictures ideal student achievements at the end of a particular class, a unit of instruction, or an entire curriculum. At the end of an assignment or course, students who achieve the goals and “look like” the ideal tend to get A’s; those who look a bit less like the ideal get B’s, and so on. Because students (and professors) are not perfect, achievement of goals is usually uneven. Students may excel in one area and be merely adequate in another. Nevertheless, most instructors record a single, holistic grade that tends to sum the student’s performance and provide an overall judgment of merit. Primary Trait Analysis (PTA) does not yield a single, holistic grade. Instead, it reveals parts.
The example below outlines the process of scoring a paper using PTA.
Course objective: Students are required to “understand” scientific conclusion and process. The scientific paper is a visible representation of scientific understanding.
Primary traits: The components of the assignment are recognized as primary traits (essential or central components of the discipline) to be learned by the student. These could include:
The instructor constructs rubrics representing the level of achievement for each primary trait. For example:
Introduction--history:
Introduction--hypothesis:
Materials and Methods--procedures:
In the example, the instructor reads Papers #1, 2, 3,… but assigns point values to various parts according to rubrics. By adding points horizontally, the instructor arrives at point values for the two primary traits found in the Introduction (18 and 11), the two primary traits found in the Materials and Methods section (16 and 20), and so on as shown in the right hand column labeled Assessment.
The crucial point is that if the instructor compares grades only, he or she would be unlikely to uncover the following penetrating insight: regardless of their grades, students are having difficulty learning how to phrase or interpret a scientific hypothesis (Intro. IB). By comparing assessments of primary traits, instructors have integrated assessment information to make their curriculum visible.
In addition, PTA can be used to objectively compare multiple sections of the same class.
Paper #1 |
Paper #2 |
Paper #3 |
Paper #4 |
Paper #5 |
Paper #6 |
Assessment |
|
Intro. IA |
3 |
4 |
3 |
3 |
3 |
2 |
18 |
Intro. IB |
1 |
3 |
2 |
2 |
2 |
1 |
11 |
M&M IIA |
2 |
3 |
2 |
3 |
3 |
3 |
16 |
M&M IIB |
3 |
4 |
3 |
4 |
3 |
3 |
20 |
Etc. |
3 |
3 |
3 |
4 |
3 |
2 |
18 |
Grade |
12 |
17 |
13 |
16 |
14 |
11 |
A few notes about Primary Trait Analysis
Holistic evaluation of writing has become more common than any form of analytic scoring in some disciplines. This form of evaluation requires that the instructor develop a set of criteria that describe the whole essay at various levels of success. These criteria are based on the desired outcomes and should be couched in language appropriate to the discipline. The example below demonstrates how criteria developed for one field can be adapted for another.
English Composition |
Evolutionary Biology |
A: Perhaps the most noticeable characteristic of the “A” level paper is rich content—the quality of the information imparted leaves us feeling that we have learned or experienced something of true value. Its compelling content is accompanied by technical expertise, as well. The “A” paper displays evidence of sophisticated critical thinking and offers special insight into the topic discussed. It is also marked by stylistic finesse: the opening section is engaging, the language is precise, fresh, and highly descriptive; the thesis is clear and exciting; the sentence structure is varied; and the tone has a presence that enhances the purpose of the paper. It leaves the reader satisfied, intrigued, and eager to reread it, for a second reading promises to provide new insight and depth. |
A: An “A” level paper is truly outstanding. Its stands out from B papers by its sophisticated treatment of the topic and clear presentation. The “A” paper displays extensive evidence of critical thinking and offers special insight into the topic discussed. In addition to careful and thorough data analysis, the “A” paper synthesizes and applies the information and concepts presented with background material. When references are cited, they are integrated into the paper in a meaningful way which enhances understanding of the subject. The opening section of the “A” paper is engaging and relevant; the language is highly descriptive yet concise; the ideas are clear and exciting. It leaves the reader satisfied and with a better understanding of the material. |
B: A “B” level paper is significantly above average and is noteworthy not only for its being almost free of surface errors, but for its competence in delivering substantial information and a strong perspective. Its content goes beyond the obvious; it is logically ordered, well developed, and unified around a clear organizing principle that is apparent early in the paper. Transitions are for the most part smooth, and sentence structures are pleasingly varied. The diction of the “B” paper makes a memorable and pleasant reading experience, offering high quality substance and few distractions. |
B: A “B” level paper is significantly above average. Its content goes beyond the obvious; it is logically ordered, well developed, and clearly presented. Data is analyzed and used to support conclusions. The writing of the “B” paper is typically more concise than that in the “C” paper; it does not contain irrelevant material. The wording is the author’s own and goes beyond a rearrangement of the words in the lab manual or text. On the whole, a “B” paper demonstrates an understanding and application of the major concepts of the lab exercise. |
C: A “C” paper is generally competent and meets the assignment adequately. It has few mechanical errors and is reasonably well organized. Development can be thin at places, or the information and perspectives it offers can be superficial or common place. Often, readers are left with this impression because ideas are vague or general. Stylistically, a “C” paper has other shortcomings as well: the opening can be flat, doing little to draw the reader into the piece; the conclusion may be vague or inconclusive; the sentences, besides being a bit choppy, tend to fall into predictable (and hence monotonous) and redundant structures. Tone can be marred by occasional repetitions and imprecisions. The “C” paper gets the job done, but it lacks both imagination and intellectual rigor, and hence does not invite a rereading. |
C: A “C” paper is generally competent and meets the assignment adequately. It has few mechanical errors and is reasonably well organized. Often, readers are left with the impression that ideas are vague or general. The discussion and analysis may contain flaws. The conclusion may be vague or inconclusive and may not follow logically from the data. The paper may not hold together as a coherent whole. The organization of the paper may not facilitate understanding. The “C” paper gets the job done, but it lacks both depth of understanding and intellectual rigor. |
D: This paper’s treatment of the subject is rudimentary, or it reveals confusion about the issues at hand. While organization is present, it is neither clear nor effective. Sentences are frequently awkward, ambiguous, and marred by serious mechanical errors. Evidence of careful proofreading is lacking. The whole piece may give the impression of having been conceived and written in haste. |
D: This paper’s treatment of the subject is rudimentary, or it reveals confusion about the issues at hand. While organization is present, it is neither clear nor effective. Major sections of the assignment are missing. The whole piece may give the impression of having been conceived and written in haste. |
E: This score applies to a paper that is completely off track or has few redeeming qualities. Its theme lacks discernible organization; its prose is garbled or stylistically primitive. Mechanical errors are frequent. Ideas, organization and style all fall far below what is acceptable college writing. |
E: This score applies to a paper that is completely off track or has few redeeming qualities. It lacks discernable organization. Mechanical errors are frequent and a substantial part of the required work is missing. Ideas, organization and style all fall far below what is acceptable college writing. |
From Berkelhamer and Miller, University of California at Irvine (1993).
It is often useful to share these criteria with students as part of the assignment. Students find it much easier to “hit a target they can see.”
In addition to allowing an individual instructor to score essays as a whole, rather than piece by piece, holistic grading can allow groups of evaluators to grade large numbers of essays. Model student essays that exhibit high, average, or low achievement are used as the standards by which several graders evaluate a group of essays. Each evaluator reads the student paper quickly and determines whether it is stronger or weaker than its closest equivalent among model essays. This process of multi-reader holistic assessment is used with the English Test, the placement exam that determines which first writing class Ohio State students take.
As with analytic scoring, it is important that students are made aware of the method of evaluation and criteria in advance of their writing.
Holistic scoring has two advantages over other scoring methods:
The Center for the Study and Teaching of Writing has additional resources on holistic grading.
Peer review. Peers can provide useful suggestions on their classmates’ papers before they turn in the final draft. However, it is important to help students learn what to look for. Old essays (with authors’ names deleted) on which common problems have been noted can be provided as examples of representative grading practice.
Multiple drafts. More than one draft of a single paper may be useful for learning. Requiring students to resubmit encourages them to work through problems before submitting the final draft.
Instructor comments. How instructors comment can be as important as what they comment on. Writing specialists prefer comments on content problems phrased as questions, e.g., rather than writing “Confusing” in the margins, one might say, “I was with you until you began discussing ‘active learning.’ What do you mean by ‘active learning’? Why is ‘active learning’ an important point here?” It is best not to use editor’s shorthand when commenting on student papers, e.g., “Awk” for “awkward.” While convenient for the instructor, this type of comment lacks explanatory power for the student. If a passage is awkward or a word choice incorrect, it is more informative to let the student know why.The Center for the Study and Teaching of Writing offers additional guidelines for commenting on student writing.
Error identification. Instructors need not feel as though they must find every error in a student paper. One tool that some writing specialists recommend is putting some mark in the margins next to a line containing a misspelling or other minor error. This places the burden back on the student to discover the error. If the instructor identifies every error, then rather than students learning how to edit their own work, they only learn how to use the services of an editor. (It is important, however, that the students have the technical knowledge to identify the marked errors, or they will become frustrated.)
Completion grade. Not all writing assignments need to be evaluated. For example, instructors who assign journals often “grade” only a small percentage of the journal entries that students have been assigned to write. The rest of the entries are simply marked as completed to make sure that students are keeping up with their work.
In settings such as laboratories, studios, or the field, feedback involves information describing students’ performance. It is a key step in the acquisition of skills, yet feedback is often omitted or handled casually. The importance of feedback in the acquisition of skills follows from the nature of the method. These skills are more easily demonstrated than described. Feedback occurs when students are offered insight into what they actually did, as well as the results of their actions. Insights gained through feedback highlight the difference between the intended result and the actual result, thereby providing motivation for change.
There are many explanations for the problems associated with providing feedback in performance situations. The first and most obvious explanation is the failure to make firsthand observations of students’ performances. Observations are the currency of feedback and without them the process becomes “feedback” in name only. Even if the data are at hand, other factors can confound the feedback process.
Central to many concerns about feedback is that it may have negative effects beyond its positive intent. The capacity of evaluation to elicit an emotional reaction is self-evident. Experiences with feedback that was handled poorly may inhibit giving or receiving feedback in the future. The teacher may be concerned that the student will be hurt by negative feedback, that it will damage the student-teacher relationship or the teacher’s popularity, and that it will result in more harm than good. The student may view feedback as a statement about his or her personal worth or potential. Students may ostensibly want information about their performance but only insofar as it confirms their self-concept.
Such concerns and misconceptions often result in what is called in the field of personnel management “vanishing feedback.” Anxious about the impact of the information on the student, but committed nonetheless to the need for feedback, the well-intentioned teacher may talk around the problem or use such indirect statements as to obfuscate the message entirely. The student, fearing a negative evaluation, supports and reinforces the teacher’s avoidance. The result is that despite the best of intentions, nothing of any real value gets transmitted or received. Even worse, concerns about the impact of feedback may lead to little or no feedback during the course, precisely when the students have the opportunity to improve their performance.
A way to approach the problem of giving feedback when performance is at issue is to construct rubrics for evaluating tasks. Rubrics are checklists or grading charts that can be developed collaboratively so that students are involved in distinguishing criteria that characterize good work and in assigning weights to the criteria. An example of a rubric to assess problem solving is provided by Woods (1994). The items on his checklist identify 12 criteria that he sets forth as characteristic of effective problem solving. Next to each criterion is a narrative description of possible student performance. The rater lists (+) and (-) signs by each item to provide feedback to the student. The first three items on this checklist are shown below:
Attribute |
Description |
Assessment |
Awareness |
+ can describe process, can distinguish “exercise solving” from “problem solving” |
|
Variety of problem-solving skills |
+ can apply a variety of methods and hints |
|
Emphasis on accuracy |
+ checks, double checks, rechecks; concern for accuracy |
Items from Woods’ (1994) Checklist for Assessing Problem Solving
Although creating effective feedback methods for performance areas and for such process skills as critical thinking, oral communication, and the like require creativity and careful thinking, the act of devising feedback methods forces the instructor and student alike to focus on characteristics of success, which aids the student’s self-awareness of what is needed and helps the instructor be clear about what is desirable.
At given intervals during a course and at the completion of a course, instructors often decide or are required to assign grades to students. Many instructors indicate that grading is the most difficult and anxiety-producing part of teaching. Although many construct systems designed to ensure fairness, grades are inevitably subject to value decisions and the relative framework within which knowledge is generated and assessed. Milton, Pollio, and Eison (1986) provide a frank discussion of the difficulties entailed in assigning grades. Despite their limitations, students, instructors, and prospective employers and educators tend to use grades to make decisions. McKeachie (1994) identifies some purposes for which grades are used by these interested groups: students want to be able to use grades to assist them in making decisions about possible majors and careers; instructors advising students use grades to judge whether their students have the motivation, skill, knowledge, and ability to do well in advanced courses; prospective employers and educators want to use grades to tell whether a student is qualified for employment or further education and how well the student will do in his or her future work.
Many different types of summative evaluation may be effective depending on the design of specific course materials and goals. However, good grading methods are characterized by the following attributes.
It is of paramount importance that the method of evaluation employed be able to accurately measure the skill or knowledge that it seeks to measure, that it be valid. It is also important that evaluations exhibit what is known as face validity. Face validity means that elements of the evaluation appear to be related to stated course objectives. It is a common student complaint that they could not perceive the connection between the evaluation and course objectives. It is therefore necessary not only that the instructor be able to make a connection between the evaluation and the course, but that the student be able to do so as well.
In addition to face validity, evaluations must have content validity. The format of an evaluation must conform closely to the course objectives that it seeks to evaluate. If a course objective states that students will be able to apply theories of practice to case studies, then an evaluation should provide them with appropriate cases to demonstrate this ability.
Finally, effective methods of evaluation have certain predictive characteristics. A student who performs well on an evaluation concerning a certain skill might be expected to perform well on similar evaluations on related skills. Additionally, that student might be expected to score consistently when evaluated in the future.
The concept of reliability is closely related to (and often confused with) validity. A reliable method of evaluation will produce similar results (within certain limitations) for the same student across time and circumstances. While it is understood that performances will vary, the goal is to eliminate as many sources of error as possible. Svinicki (n.d.) notes three major sources of error in reliably evaluating students:
There are many methods of grading. They are all based on human judgment, although it is easy to forget this, especially when the method relies on numbers. Numeric methods are not necessarily more “objective” than those that rely on written comments or holistic approaches. Instructors find that thinking through their grading philosophy and purposes before developing a scheme is a very important step. Before selecting a grading method, it is also advisable to check if there are any relevant course or departmental policies.
There are three basic types of grading systems: criterion-referenced or absolute systems, normreferenced or relative systems, and hybrid (combination of criterion- and norm-referenced) systems. Simply stated, norm-referenced systems (often referred to as “grading on a curve”) evaluate students’ performance in relation to one another and rest on the underlying assumption that relative levels of student ability do not vary much from quarter to quarter, and that student achievement is evenly distributed. When using norm-referenced systems, however, there is a danger that the instructor will inappropriately use the grading curve to compensate for poorly constructed tests.
Criterion-referenced systems, on the other hand, apply an absolute scale against which individual student performances are measured. The setting up of such a grading scale ideally requires some knowledge of the levels of student ability likely to be present in the class. With the criterion-referenced system, it is theoretically possible for all students to receive an “A” or for everyone to fail the course. Hybrid systems, probably the most common grading schemes used at Ohio State, contain aspects of both systems. A few examples of each system and their implications follow:
The Simple Curve. In this system the instructor determines beforehand that a certain percentage of students will receive A’s and a similar percentage will receive E’s. The same holds for B’s and D’s. The remainder receive C’s. Cut-offs are based on the number of students in the class and are figured by counting down the distribution of grades until that number is reached. Since this system involves nothing more sophisticated than counting and division, it is easy to use. However, when students know that only a fixed percentage of them can achieve A’s, they often feel a sense of competition with each other. If you intend to do any sort of collaborative or cooperative group work, this form of grading can undermine your ability to get students to work together.
The Normalized Curve. This is a more complex system in which the actual score a student earns is converted into what is called a standard score based on the class average and the distribution of the scores. Then, using standard tables, the instructor converts these standard scores into percentiles based on a normal curve.
The Office of the Registrar’s Test Administration and Scanning Services provides this information on all machine scored tests scanned by their office. The student’s score is reported as being in the 90th percentile or the 50th, with some predetermined percentiles representing each of the letter grades. Percentile scores have some real advantages when it comes to comparing grades from a wide range of activities, but their computation and interpretation can be confusing. This method also has the same issues with competitiveness that the simple curve does.
Percentage of Total Points Possible. In this system, there are a fixed number of points available to be earned. Earning 90% (or some arbitrary percent) of those points will result in an A, while 80% will result in a B and so on. Students are evaluated against a preset criterion, hence the name, and not against their peers. It does not matter how many students reach a given level. Everyone can earn an A or an E. This avoids the issue of placing students in competition with each other, but requires that the instructor have a very clear idea of the level of achievement students are likely to reach in advance, so as to be able to set appropriate grade levels.
Mastery, or Satisfactory/Unsatisfactory. In this case, there is only one preset level of achievement, usually based on a set of specific objectives that must be passed. If these are passed, the student moves on; if not, the student must repeat the evaluation or fail the course. Sometimes the specific requirements for the assessment of mastery refer to a given percent of the total number of skills rather than to the achievement of all given skills.
The mastery approach assigns a basic satisfactory/unsatisfactory grade to students based on their achievement of specified goals. In a mastery system, students are ordinarily allowed to take different amounts of time to accomplish a goal and to repeat tests or assignments without penalty until they achieve the desired outcome.
The advantages of this system are that the grades are meaningfully tied to the performance level, that students may achieve goals faster when they know what they are, that the focus is on success rather than failure, that student performance anxiety may be lowered, and that the system supports cooperation and may raise morale.
Disadvantages include its time consuming nature, the limit of freedom placed on teachers, and the possibility of strict prescription of means to achieve mastery. It may also discourage students from setting and meeting their own goals, and if used in a program where the whole faculty sets up performance criteria, it has the disadvantages inherent in committees.
Contract System. A contract system of grading involves the development of a written contract between the student and the instructor that specifies precisely what will be required to achieve any given grade. The course syllabus is a good place to communicate this possibility.Advantages of grading contracts include reducing anxieties since the student knows what is expected, minimizing the role of personal judgment in grading, and encouraging student-set goals.
The disadvantages of this system are the potential for overemphasis on quantity, possible difficulty in measuring diverse student activity, and that ambiguity may exist in qualitative distinctions between grades.
Percent of Maximum Obtained. This system uses a predetermined set of cut-off percentages for each grade as in a criterion-referenced system, but bases the actual grades on the highest score achieved by a student in the class. This latter characteristic makes the grades somewhat comparative as in a norm-referenced system. The class performance plays a role in determining what is needed for each grade, but the number of students who can earn each grade is not restricted as in the norm-referenced systems. Except on the broadest level the students are not in competition with one another. This system gives neither absolute nor relative performance information, but it is easy to compute and easy for students to understand.
Gap System. This could be labeled the “interocular” system since it involves laying out the score distribution and looking for gaps in the distribution. Sometimes, the distribution of student scores cluster in such a way that obvious breaks show where the cutoff scores for the various grades should be.
One advantage of this system is that the instructor has a practical reason for setting the grade cut-offs where they are. The idea is to identify real differences in performance that will then be reflected in the grades. Under this system, “A” performance really appears to be different from “B” performance because the two groups of students have a gap separating them. All other systems are based on more or less arbitrary cut-offs, even though they may have a sound statistical basis. Like norm-referenced systems, the gap system gives us relative but not absolute performance information. It is also easy to compute and explain.
Instructors can use student self-evaluation to determine part or all of the course grade. A variety of formats can be used. The significant difference in this form of grading is that the source of the evaluation is the student.
Self-evaluation can be a learning experience for the student, one that encourages them to take responsibility for their own learning. When properly coached in how to do self-assessment, students are usually fair, objective, and demanding of themselves. However, this method can be taken less seriously as the novelty wears off and is subject to abuse if students are not taught to be introspective, or if they are under extreme pressure for grades.
Publish, explain, and use explicit criteria in advance. Students are much more likely to accept grades based on criteria that they are made aware of in advance and that they understand.
Document decisions as carefully as possible and be consistent. Keeping gradebooks secure and retaining them for some time after the course is over are also required by some departments. Instructors are advised to check with their departments for specific schedules concerning the maintenance of these records. Some instructors also protect themselves by keeping lines of communication open and taking the opportunity to prevent cheating when possible by making it hard to copy answers during exams, making it difficult to change corrections on returned papers, being careful to check off completed assignments, etc.
How can instructors handle cheating when they think someone else wrote a student’s paper?
First of all, be proactive. Include an academic honesty policy in your syllabus and explain it to students. Also, design assignments that will not welcome cheating. Ask for initial parts of the assignment (i.e., topic proposal, outline, annotated bibliography, drafts, etc.) as students work. This requires that students do their own work. If the same term paper has been assigned in Psychology 100 for five years in a row, some “oldies” with new names are likely to surface. Also, if a topic is very broad—a paper on anything in the field—it is easy to find something to submit that may not be original or intended for that course. Change and focus topics to avoid such misconduct.
If instructors do not have any out-and-out proof of plagiarism or ghostwriting, they are quite limited as to what they can do legally. They can talk with the student in question and ask them how he or she decided on the topic or found the references to determine if the suspicions are warranted, but unless this provokes a confession or demonstrates a patent lack of understanding of the topic, it is hard to take further action. Simply letting the student know that the instructor pays close attention, however, may encourage the submission of original work in the future. Some instructors attempt to avoid this situation through careful attention to assignment design and regular change of topics. Instructors with specific questions may contact the Committee on Academic Misconduct.
There are a variety of resources on the World Wide Web that let you track down suspected text (such as http://www.findsame.com or http://www.plagiarism.org). Ohio State has also purchased a license for Turnitin.com.
How can instructors grade on attitude, attendance, or participation?In all cases, instructors should specify in advance if they will be considering these factors in the final grade. It is essential to make it clear which behaviors are being targeted and what the expectations of the instructor are. Attendance is not difficult to measure, though it may be hard to document in large classes. Some instructors of such classes send around attendance sheets, while others assign very short response papers. Instructors should define “excused” versus “unexcused” absences and the maximum number of absences allowed. Instructors who choose to grade on attitude or effort will be pressed to justify decisions, so it might help to have specific criteria or tasks that will be related to the grade. Having activities that produce the “artifacts” of participation such as quizzes or in-class assignments based on required readings may be used to motivate and document student preparation and attendance.
To grade oral participation, some instructors keep a running record of contributions during discussion sections or ask a student to do so. This puts shy or inarticulate students at a disadvantage; in order to avoid this, an instructor might ask for written comments or questions to be submitted, use electronic mail or web discussion groups, or offer other alternative modes for such participation. The two reasons students most often give for not participating in class are that they do not believe that they know the right information or that they fear that they will embarrass themselves by saying it incorrectly. Brief exercises in class can allow them to overcome both issues. If they talk or write about a concept before being asked to speak, they will be more likely to have something tested and prepared to say.
After a test has been administered and graded, a good way to judge how well it differentiates among students, particularly in the case of a limited-choice test, is to perform an item analysis. It is especially important to do this when test items will be reused or when there is sufficient doubt about the test results to consider dropping some items as invalid when computing the final grade. If machine scannable test forms are used and processed by the Registrar’s Office, the instructor will receive a printout with item analysis results already computed. See below for instructions on how to interpret these statistics or call Faculty & TA Development (292-3644) for a consultation.
If the instructor is scoring the test, standard statistical software packages (such as SPSS or DataDesk) are available for doing item analysis. It is possible to perform an item analysis without a computer, however, especially if the test is short and the class size is small. The information below describes how to compute the two most common item analysis statistics and describes the principles of these as well.
The Difficulty Index of an item tells you the percentage of students who got the item correct.
The Discrimination Index tells how well this item correlates with the entire test: Did the students who did well on the test in general do well on this question?
Follow these steps to compute Difficulty and Discrimination Indexes:

The Discrimination Index for the same item was obtained by first calculating the percent correct for both the upper and lower groups—20% and 90% respectively—then subtracting the percentage for the lower group from that of the upper group (.20 - .90 = -.70). This negative Discrimination Index indicates that the item is probably flawed. Note that the students who scored poorly on the exam as a whole did well on this item and the students who got the top total scores on the exam did poorly—the reverse of what one would expect. A mistake in the answer key or some error in the question that only the more astute students would catch might be the cause. Questions that are inappropriately difficult and those that fail to discriminate effectively should be revised.
Primary Trait Analysis (PTA) is the evaluation method most explicitly structured to offer both evaluation of individual students achievement and the aggregate performance of the class. However, regardless of the system of evaluation, careful review should be made to determine which learning outcomes are met being by most students and which are not met by more than a few. Such analysis can be very useful in focusing effort to revise those parts of a course that are not leading to the level of student achievement that is desired.
No matter how student achievement is evaluated, it is import for instructors to use this process both to determine how well individual students have learned and how they should modify their courses to better support student learning in the future.