Measuring the Effect of Formative Assessment Techniques in Physics
Robert A. Cohen
Physics
East Stroudsburg University of Pennsylvania
Context
Three techniques, all designed to improve formative assessment, were implemented in selected sections of a physics course (Fundamental Physics I) at East Stroudsburg University (ESU). The course is offered every semester and is part of a two-semester sequence taken by life science majors (e.g., biology, pre-pharmacy, etc.), usually in their junior year.
In the study, eight offerings of the course were examined. These will be referred to as either “test” sections (indicated as T1, T2 and T3) or “control” sections (indicated as C1, C2, C3, C4 and C5). Sections T1, T2 and T3 refer to the sections in which the following three formative assessment techniques were “tested” (all three assessments were used in each of the three test sections):
1. Students were required to buy infrared response pads (eInstruction Corporation[1]) that were used during class to assess student understanding and guide instruction. To build consensus, students were encouraged to talk to their peers and explain the rationale for their answers, particularly for those questions where an initial polling did not reveal a consensus. Additional time was then spent on those areas that produced lack of a consensus or low success. This activity was designed to do the following:
a. Allow students to get a sense of where they stood relative to the rest of the class,
b. Provide the instructor with a sense of how well the students were meeting the lesson objectives, and thus allow the instructor to adapt instruction to student needs, and
c. Encourage students to put into words the reasons for their choices, thus forcing students to examine the extent of their understanding (vs. regurgitation of facts and figures).
2. Students were required to email the instructor before each class with questions about the readings. The lessons were then designed to address those questions. This activity was designed to do the following:
a. Force students to focus on the extent of their understanding while reading the material, and
b. Provide the instructor with insight into the areas in which students were experiencing difficulty, so that the instructor could address those areas during class time. This aspect is similar to the Just-in-Time Teaching (JiTT) method discussed by Novak, Patterson, Gavrin, and Christian (1999).
3. An innovative textbook (Cohen, 2004) was developed that encouraged formative assessment by embedding rhetorical questions and homework questions within the text. In addition, the textbook reordered the sequence of topics to clarify how each new concept addressed weaknesses in previous concepts. Developed in-house, the textbook was printed in black and white (no color pictures) with few examples and no pictures (only line drawings) or ancillary materials. The three test sections used different versions of the textbook, as it was revised after each offering to correct errors and clarify concepts (compare Cohen, 2002, 2003, 2004).
The three test sections were taught each fall by the author of the study. The author holds a tenured position in the physics department, is familiar with formative assessment techniques, and holds Pennsylvania teaching certification in physics. Other than the three assessment techniques discussed above, the instructional techniques utilized were mainly lecture, demonstrations and discussion/questioning. Sections T1 and T2 occasionally used group activities also.
For comparison purposes, “control” sections C1, C2 and C3 were taught by a one-year temporary instructor, using none of the three formative assessment techniques. Rather, instruction was limited to the “traditional” methods of lecture and demonstrations, with an emphasis on engineering applications, and no collaborative or group activities. The instructor had a master’s degree in engineering and Pennsylvania teaching certification in physics, but little prior experience teaching a college-level physics class. A popular commercially-available textbook (Cutnell & Johnson, 2004) was used as the required text.
To examine the influence of the instructor, an additional section, C4, was taught by a tenured instructor who was recognized as an exemplary physics instructor by both students and faculty (as evidenced by peer evaluations and student evaluations). This section was also limited to the “traditional” methods of lecture, demonstrations and questioning, with no collaborative or group activities. Based on peer evaluations, the main instructional difference between section C4 and sections C1-C3 was that the C4 instructor utilized questioning more effectively and incorporated explanations that were more easily understood by the students. Section C4 also used a different textbook (Urone, 2001), a commercially-available textbook that is somewhat less popular than the one by Cutnell and Johnson (2004).
One question raised by the study was whether the innovative textbook, since it lacked many of the fancy features of commercially-available textbooks, might inhibit rather than enhance learning. To test this notion, an additional control section, C5, was taught with the innovative textbook but neither of the other two formative assessment techniques. The instructor of section C4 also taught section C5, using the same instructional techniques utilized in section C4. The eight sections are summarized in Table 1.
Table 1
|
Section |
Semester |
Methodology |
Textbook |
Instructor |
|
C1 |
Fall-2003 |
Traditional |
Cutnell and Johnson |
one-year temporary |
|
C2 |
Spr-2004 |
Traditional |
Cutnell and Johnson |
one-year temporary |
|
C3 |
Sum-2004 |
Traditional |
Cutnell and Johnson |
one-year temporary |
|
C4 |
Fall-2002 |
Traditional |
Urone |
tenured, highly regarded |
|
C5 |
Fall-2004 |
Traditional |
Cohen |
tenured, highly regarded |
|
T1 |
Fall-2002 |
Form. Assess. |
Cohen |
tenured |
|
T2 |
Fall-2003 |
Form. Assess. |
Cohen |
tenured |
|
T3 |
Fall-2004 |
Form. Assess. |
Cohen |
tenured |
Measurement of Student Learning
Student performance was measured by a 17-question multiple-choice survey. Eight questions on the survey were taken from the Force Concept Inventory (FCI) (Hestenes, Wells, & Swackhamer, 1992; Halloun, Hake, Mosca, & Hestenes, 1995), a well-tested instrument for measuring conceptual understanding of forces. Six of the questions were taken directly from the FCI, and two other FCI questions were slightly modified. These eight questions were supplemented by nine of our own questions on graphing, kinematics and vectors. The questions selected cover basic ideas of physics that students are expected to answer correctly, but do not. The survey was also constructed with an eye toward making it as short as possible to simplify survey administration. Each class received the instrument twice, at the beginning of the semester and at the end.
While all sections gave the instrument as a non-graded activity the first day of class, the implementation of the instrument at the end of class varied from section to section. In most sections, the questions were split up and administered as part of the exams. In some sections, however, the survey was again given as a non-graded activity, albeit near the last day of class. To determine if the method of implementation affected the results, both techniques were used in one of the sections. It was found that while scores on individual questions varied, the overall scores and the conclusions reached by examining the overall scores, were not significantly dependent on how the instrument was administered.
The actual survey had twenty questions. However, only the first 17 were used for this study. Question 18 was designed to assess whether students understood the difference between laws and theories. Since the distinction is not typically discussed in this type of physics class, it was not used in the comparisons. Questions 19 and 20 diagnose hypothetico-deductive thinking and were selected from the Classroom Test of Scientific Reasoning by Anton E. Lawson at Arizona State University (Lawson, 1978). They were included only to identify whether performance on the survey or in the class was related to performance on this pair of questions (no relationship was observed).
Results
The results of the study are shown in Table 2 and Figure 1:
Table 2
|
Section |
Semester |
N |
Average Pre-test score |
Average Post-test score |
Change |
|
C1 |
Fall-2003 |
8 |
31% |
33% |
+2% |
|
C2 |
Spr-2004 |
16 |
32% |
34% |
+2% |
|
C3 |
Sum-2004 |
5 |
26% |
28% |
+2% |
|
C4 |
Fall-2002 |
20 |
29% |
53% |
+24% |
|
C5 |
Fall-2004 |
39 |
28% |
50% |
+22% |
|
T1 |
Fall-2002 |
27 |
33% |
54% |
+21% |
|
T2 |
Fall-2003 |
17 |
29% |
56% |
+27% |
|
T3 |
Fall-2004 |
26 |
31% |
68% |
+37% |
Figure 1.

Scores in sections C1, C2 and C3 (taught by the one-year temporary instructor using traditional techniques) remained relatively the same (+2%) for all three sections (pre vs. post), reflecting minimal growth in concept understanding.
In comparison, section C4, taught by the exemplary physics instructor (as evidenced by peer evaluations and
student evaluations), experienced modest improvement (24%) in concept understanding (as measured by the instrument), implying that the instructor can make a significant difference in student learning. Section C5, taught by the same exemplary instructor with the same techniques but with the novel textbook (Cohen, 2004) instead of the commercially-available one (Urone, 2001), showed the same modest improvement (22%).
The first two test sections, T1 and T2, experienced about the same modest growth (21% and 27%, respectively) as sections C4 and C5. However, the third test section, T3, outperformed the students in every other section. One test section was taught each fall (see Table 1), and so it is possible that improvements made from year to year contributed to the differences in scores. This is discussed further in the next section.
Because the student population may differ among the sections, it can be difficult to compare pre-post differences among sections. Consequently, an additional parameter was examined. This parameter, called the “normalized gain” or “g” (Hake, 1998), is obtained by dividing the gain (i.e., the difference between the pre- and post-scores) by the maximum possible gain (i.e., the difference between the pre-score and the maximum possible score). This parameter is also known as the "effectiveness index" (Hovland, Lumsdaine, & Sheffield, 1949, pp. 284-289; Lazarsfeld & Rosenburg, 1955, pp. 77-82) and the "gap closing parameter" (Ghery, 1972). Hake (1998) has shown that the normalized gain is a better measure of how effective an instructional methodology is, because unlike the gain, the relative gain is highly uncorrelated with the pre-test score. In other words, whereas students who initially perform poorly should experience a higher gain (because the lower the initial score, the larger the maximum gain possible), they also have a greater maximum gain that is possible. Consequently, dividing the gain by the maximum possible gain, scales the scores such that the relative gain of students tends to be independent of their initial performance.
An additional challenge was to account for the change in student population between the pre- and post-offerings of the survey. As the semester progresses, some students drop the class or may not be present for the initial or final offerings of the survey. Furthermore, some students may inadvertently neglect to answer a question or two. To account for such changes, the average pre- and post-scores for each question were calculated using only those students who answered that particular item on both the pre- and post-offerings. The normalized gain for each question was then calculated using the average pre- and post-scores for the particular question, and the average normalized gain for an entire survey (for a particular section), then was calculated by summing the normalized gains for each question and dividing by seventeen (the number of items).
Figure 2.

Using this technique, questions that are initially answered correctly (high pre-test score) have a greater influence on the normalized gain than questions that are initially answered incorrectly. For example, suppose the survey has two questions with initial scores of 25% and 75% for an average score of 50%. If the post-test scores are 0% and 100%, the average score is still 50% but the first item experienced a normalized gain of -33% (a decrease of one-third the maximum possible gain on that question) while the second item experienced a normalized gain of +100% (an increase of the entire maximum possible gain) for an average normalized gain of +33%.
The average normalized gains for each section are shown in Figure 2. The signal observed previously still remains. Control sections C1, C2 and C3, taught by the one-year temporary instructor, continue to show little improvement (g = 2%, -16% and 1%) while the test sections T1, T2 and T3, show steady improvement (g = 27%, 35% and 52%) beyond that exhibited by control sections C4 and C5 (g = 28% and 29% respectively), taught by the exemplary instructor (as evidenced by student and peerevaluations). For comparison, in Hake’s (1998) study using the Force Concept Inventory for the pre- and post-test, he found that the normalized gains of “traditional” courses tended to be around 23% (±4%sd) whereas courses utilizing “interactive engagement” had normalized gains around 48% (±14%sd), where he defines “interactive engagement” as “heads-on (always) and hands-on (usually) activities which yield immediate feedback through discussion with peers and/or instructors.”
Discussion
The high performance of section T3 could be a fluke, or it could be due to the implementation of the formative assessment techniques that were used, particularly as the instructor gained proficiency in using them. This second interpretation is supported by the growth from section T1 and T2, which was further developed in section T3. The three techniques were not implemented exactly the same way each time. As the utilizations of all three techniques were revised and refined each year, the scores on the evaluation instrument likewise improved.
This interpretation is also supported by the student evaluations of the instructor. Student evaluation scores increased markedly from year to year within the three test sections (T1, T2 and T3). For example, on the 5-point Likert scale used for student evaluations at ESU, student responses to the item, “Overall, I rate this instructor a good teacher,” went from 2.35 to 3.42 to 4.05 during the 3-year study. An in-depth analysis of student reaction to changes in instruction is an issue in itself and is not addressed here. The purpose of mentioning it is to support the contention that within sections T1, T2 and T3, the administration of the techniques improved for each implementation. Overall, student evaluations for section T3 were still lower than those received by the exemplary instructor of section C4, possibly because students in general become frustrated when they receive non-traditional instruction, even when such approaches result in improved conceptual understanding (Meltzer & Manivannan, 1996; Crouch & Mazur, 2001).
Simply switching from a more traditional textbook to the innovative textbook resulted in no change in scores (compare sections C4 and C5, which differed only in the textbook used). On the other hand, upon closer examination of the results, it was found that for those areas stressed by the Cohen textbook, student gains in C5 (with the Cohen textbook) were higher, whereas in areas not stressed by the Cohen textbook or poorly handled by the Cohen textbook, student gains in C5 were lower, leading to no change overall. The Cohen textbook is still under revision and may demonstrate a positive influence in future sections as its presentation in certain areas is strengthened. Conversely, it may be the case that a different approach is necessary to take advantage of the textbook benefits (e.g., an approach that forces students to make better use of the textbook). Regardless, considering that the textbook lacked the color pictures, examples and ancillary materials common to traditional textbooks, it may be significant that it had no detrimental effect on student understanding.
As mentioned previously in the section, Measurement of Student Learning, while many of the questions on the survey were selected from the Force Concept Inventory (Hestenes et al., 1992; Halloun et al., 1995), several others were developed in-house. Due to the in-house development, the reliability and accuracy of the instrument is unknown. Analysis of the results, for example, revealed that two questions utilized misleading language. In one case, ambiguity in wording allowed for two answers to be interpreted as being correct. Since accepting both answers did not change the relative performances of each section, it is felt that the overall conclusions are still valid. However, in the other case, control sections C1-C3 outperformed all other sections. It is unclear whether the ambiguity in wording alone was responsible for the inverted scores.
Conclusions
Based upon this work, it seems we can make the following inferences:
· Traditional physics instruction does not necessarily produce changes in student conceptual understanding (see for example, sections C1-C3 in Table 2).
· Although traditional instruction may limit the extent of student understanding, the extent is influenced by the instructor (compare, for example, sections C1-C3 with section C4 in Table 2).
· Implementing formative assessment can lead to improved student understanding, but it may require several years to master the techniques, a time during which student evaluations may suffer.
References
Cohen, R. A. (2002). The Fundamentals of College Physics: Vol. I (Version 3.0). Unpublished textbook, East Stroudsburg University of PA.
Cohen, R. A. (2003). The Fundamentals of College Physics: Vol. I (Version 4.0). Unpublished textbook, East Stroudsburg University of PA.
Cohen, R. A. (2004). The Fundamentals of College Physics: Vol. 1(Version 5.0). Unpublished textbook, East Stroudsburg University of PA.
Crouch, C. H., & Mazur, E. (2001). Peer instruction: Ten years of experience and results. American
Journal of Physics, 69, 970-977.
Cutnell, J. D., & Johnson, K. W. (2004). Physics (6th ed.). Hoboken, NJ: John Wiley & Sons, Inc.
Ghery, F. W. (1972). Research papers in economic education. In A.Welch (Ed.), Does mathematics
matter? (pp. 142-157). New York: Joint Council on Economic Education.
Hake, R. R. (1998). Interactive-engagement vs. traditional methods: A six-thousand-student survey of
mechanics test data for introductory physics courses. American Journal of Physics, 66(1), 64-74.
Halloun, I., Hake, R. R., Mosca, E. P., & Hestenes, D. (1995). Force concept inventory. Retrieved
October 3, 2001 from Arizona State University Modeling Instruction Program Web site:
http://modeling.la.asu.edu/R&E/Research.html
Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory, The Physics Teacher, 30,
141-158.
Hovland, C. I., Lumsdaine, A. A., & Sheffield, F. D. (1949). Experiments on mass communication.
Princeton, NJ: Princeton University Press.
Lawson, A. E. (1978). Development and validation of the classroom test of formal reasoning. Journal
of Research in Science Teaching, 15(1), 11-24.
Lazarsfeld, P. F., & Rosenberg, M. (Eds.). (1955). The language of social research: A reader in the
methodology of social research. New York: Free Press.
Meltzer, D. E., & Manivannan, K. (1996). Promoting interactivity in physics lecture classes. Physics
Teacher, 34, 72-76.
Novak, G. M., Patterson, E. T., Gavrin, A. D., & Christian, W. (1999). Just-in-time teaching. Upper
Saddle River, NJ: Prentice-Hall.
Urone, P. P. (2001). College physics (2nd ed.). Pacific Grove, CA: Brooks/Cole.
This material is based upon work supported by the National Science Foundation under Grant No. 9986753 and the Pennsylvania State System of Higher Education. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the State System.