The following post summarizes and comments on information contained in the following article:
Butler, A. C. Multiple-Choice Testing in Education: Are the Best Practices for Assessment Also Good for Learning? Journal of Applied Research in Memory and Cognition (2017), https://doi.org/10.1016/j.jarmac.2018.07.002
Multiple-choice (MC) questions are ubiquitous in education. Whether it be a formative assessment to begin a lesson or an end of year exam, MC questions are commonplace. Just last week, my wife and I met with my son’s 1st grade teacher and she discussed how his class would be looking at MC questions and how to properly read, decode, and answer them. From 1st grade through college tests, MC assessments are routine and help students and teachers assess learning in the classroom.
The article points out much research has been devoted to how to best construct MC questions for assessment purposes; “how to improve the reliability and validity of MC tests.” In addition to using MC questions for assessment, these questions can also ‘cause learning’:
“Retrieving information in response to a test question strengthens memory, leading to better retention of that information over time; it can also change the representation of the information in memory, thereby producing deeper understanding. With respect to the MC format in particular, numerous laboratory studies have shown that taking a MC test is beneficial for learning…MC testing has been found to improve retention and transfer on subsequent unit and final exams in middle school, high school, and college courses. In addition, MC testing can enhance the learning of non-tested conceptually related information and restore access to previously acquired knowledge that has become inaccessible.”
As a teacher, I appreciate this delineation. It’s quite easy to only see MC questioning through a lens of assessment, grading, and as a means to an end. However, assessment (retrieval practice) should be an integral part of the learning process. This focus on the learning aspect of MC questioning really drew me to the article. As a teacher who relies on MC questions for some daily formative assessment and summative assessment, I was quite intrigued with the ability to use what this research communicates about constructing better questions.
Below are five evidence-informed best practices for the classroom teacher to create more reliable and valid MC questions for both assessment and learning.
Best Practice #1 – Avoid Using Complex Item Types or Answering Procedures
Complex multiple-choice (CMC) questions are becoming more popular in education because they appear to more deeply assess learning and require more thought. A typical MC question has one stem and three to five primary responses. A CMC has a stem and primary responses, but then also includes secondary choices that may include options of none (or all) of the above, A and B, B and D, A and B, but not C, etc.
Again, the prevailing thought here students will have to engage more with the material to come up with the correct answer and avoid more complex distractors. However, from the viewpoint of MC questions as assessment, CMC should be avoided for these reasons:
- Clueing – CMC items may inadvertently allow students to ‘strategically guess’. If a test-taker can eliminate primary responses, they can then begin to eliminate secondary responses which may also include other primary responses. Due to clueing, CMC items can produce artificially greater performance and reduce reliability when compared to more traditional MC questions.
- Research suggests CMC items are not better at measuring higher-order thinking, which is a main reason these types of items are used.
- CMC items are difficult to write and a usually dropped from potential questions to be used on an assessment during the verification phase.
CMC is also not effective as a tool for learning. The most studied complex answering procedure is known as ‘answer-until-correct’ method. This, as you probably guessed, allows students the ability to continue answering a MC question until they choose the correct answer. The idea here is the student can continue to engage with the material, even after choosing the wrong answer. In theory, this sounds great, but research has shown the answer-until-correct procedure is no better than traditional MC with immediate feedback. This may be due to simple guessing. If a student first chooses incorrectly, they may lack the knowledge to reason through and intelligently choose a second or third choice as the correct answer.
Best Practice #2 – Create Items that Require the Engagement of Specific Cognitive Processes
“When creating a test for the purpose of assessment, each item should tap particular content and engage specific cognitive processes in order to provide broad coverage of the learning objectives across items while minimizing overlap among items.”
There are two methods discussed to think about engagement of a specific cognitive process for assessment:
- A collection of question ‘shells’ or templates that isolate cognitive processes. For example, notice the difference in the cognitive processes required for these two MC question shells:
- Which best defines X?
- Which distinguishes X from Y?
Teachers can then choose among many different shells to find one that best matches their learning objectives.
- Use the verbs of Bloom’s taxonomy (recall, explain, compare, synthesize, etc) to engage specific cognitive processes which can apply to a learning objective.
From the aspect of using MC questions for learning, it is important to isolate and deliberately create questions which engage specific cognitive processes.
“A question could require a student to retrieve a fact (Which of the following buildings is the tallest in the world?), contrast two concepts (Which of the following is a way in which hawks differ from eagles?), or analyze a set of conditions to make a decision (Given that the patient shows symptoms of X, Y, and Z, which of the following diagnoses is most likely?)”
Careful consideration of MC questions enable students to interact with information, using many different cognitive processes. The processing induced by each question should mimic the type of cognitive processing required for future performance (either another assessment or a more practical application).
A common criticism of MC questions is that they only test basic factual information. This is just not true. With proper construction, MC questions can certainly test higher-order thinking. Research has shown these questions can produce learning that improves future performance and performance on MC and constructed-response tests tend to be highly correlated.
Best Practice #3: Avoid Using “None-of-the-Above” and “All-of-the-Above” as Response Options
Adding all-of-the-above (AOTA) and/or none-of-the-above (NOTA) options to simple MC questions essentially transforms them into CMC questions (best practice #1). As discussed before, CMC items are to be avoided. For assessment, NOTA and AOTA reduces discriminability among items and, specifically for AOTA, increases the likelihood for clueing. Another potential problem with NOTA and AOTA is their ability to create bias when they are only sporadically included as MC question responses. Students tend to choose AOTA and NOTA at higher rates than other responses when they are occasionally presented.
Although there has not been much research on the use of NOTA and/or AOTA for learning, what has been done appears to recommend they be avoided. NOTA specifically appears to be harmful to learning when it is the correct answer and has a negligible effect as an incorrect answer. Ideally, when presented with a MC question, students should retrieve the correct answer. However, it is quite possible when NOTA is the correct answer that students do not have to retrieve the right answer and only have to rule out all of the incorrect answers. If this is the case, the students has not strengthened their learning of the correct answer via retrieval of that answer. When NOTA is an incorrect answer, the chance for learning increases because students will hopefully retrieve the correct answer from memory. Nevertheless, it may be the presence of NOTA that creates biases that are mentioned before, which may decrease its effectiveness for learning.
AOTA appears to be beneficial for learning when it is the correct answer response. The student must identify and retrieve all of the correct answers, which should be a goal of learning. When AOTA is the incorrect answer, the effects are less certain. For the same reasons mentioned above with NOTA, depending on the frequency of AOTA as a response, students may be biased to select AOTA when it is incorrect and believe several false alternatives are correct.
Best Practice #4: Use Three Plausible Response Options
A meta-analysis, Rodriguez (2005) surmised that using two incorrect responses and one correct response provides the “best balance between psychometric quality and efficiency of administration.” Essentially, students are able to complete more three-response questions than four-response questions in the same amount of time. Another line of research points to the plausibility of the incorrect answers as playing an important role in the amount of responses used. Using only one plausible incorrect response is better than using two low-quality responses. Some examples for creating high-quality responses includes using both common errors or misconceptions and true statements that do not correctly answer the given stem.
For the purpose of learning, research also suggests using relatively few alternatives on MC questions, but for a different reason. If students are met with many incorrect answers, they run the risk of incorrectly learning the information. Processing distractor items as the correct answer decreases the chance students will choose the correct response on another MC test. This is called the negative suggestions effect. Selecting a distractor as a correct response can interfere with their capability to retrieve the correct information in the future. Studies have also shown that including more responses decreases the positive effects and increases the negative effects on learning; students retain less correct information and run the risk of obtaining more incorrect information.
Processing distractors can be beneficial for learning. If a student can read a plausible incorrect response and rule it out, they are more likely “to discriminate the correct information from related incorrect information in memory, thereby improving accessibility and reducing the potential for interference during subsequent retrieval.”
Best Practice #5: Create Multiple-Choice Tests That Are Challenging, But Not Too Difficult
In order to measure learning successfully, MC items should discriminate between students who have and have not acquired the knowledge. When this is achieved, moderately difficult tests are created and students are challenged but able to succeed. Very easy MC questions can undermine the positive effects on learning because students can succeed without thinking about/with the information and processing in a deliberate manner. “The key to producing desirable difficulty is that the MC item should engage the test-taker in the type of cognitive processing needed to learn the intended knowledge or skill.” This can be produced by using common errors or misconceptions as a distractor choice. Doing so could create a desirable difficulty if the student learns to discriminate between the correct answer choice and the somewhat related incorrect choice.
Bonus Best Practice: Provide Feedback
Feedback after formative assessment is very effective for assessment and learning. It increases the positive effects on learning and reduces the negative effects greatly. Feedback allows students the ability to correct incorrect beliefs; resulting in the avoidance of learning false information. Feedback is effective whether given immediately or after a delay.
*It is important to note the author comments several times that more research is needed in the area of creating MC questions. While there isn’t a great deal of research on certain aspects of best practices for constructing MC questions, using these five tips can create a more valid and reliable test for assessment and learning.
Im a bit confused on what to do with distractor answers. Should they be excluded as they may cause students to learn incorrectly? Should they be included, but only minimally so as to have the students discriminate between incorrect and correct responses? Ie. three choices, 1 correct, 1 distractor (common misconception), 1 blatantly incorrect.
Yeah…it’s not crystal clear. My takeaway is that 3 is best, as long as you’ve got 2 plausible distractors. It would be better to only have 1 plausible distractor than include multiple erroneous distractors.
depends on your level of student — how familiar should they be with the material — i.e., should they recognize and avoid many different, and different types of distractors?
National Medical Board Questions are not considered solid if SOMEONE in the test population doesn’t choose (at some point) every single distractor — so a distractor (or foil, in other cultures) is only as useful as often as it actually does it’s job.
There’s so much depth that one can enter into on MCQ writing — this article is a solid beginning for anyone looking to learn how to use the MCQ for good, and to promote learning.
First off, this article is a useful, readable summary of a big kind-of-technical psychology literature. Thank you!
Now two questions:
* What do you think of the answer “Not enough information to answer this question” ? On the one hand, a lot of the criticisms above of NOTA seem to apply. On the other hand, I teach math/stats and I want my students to be able to recognize situations when they cannot actually get an answer.
* Do you have a rule of thumb for difficulty? 90% item facility seems to easy and 10% is clearly too hard, but what’s an appropriate range?
Thank you again!!
Again — having replied to an early comment — “item difficulty” can be thought of collectively (ie., what is the composition of the TEST — 20% items in the 0.4-0.6 (40%-60% get it right) and x in the 0.6-0.8 range? It depends on how all of this stuff is used to inform the summative assessment. The questions can all be cripplingly difficult, if the purpose of the MCQ “exam” was formative, and to stress the need for more study. Linking testing, test grades (and the performance metrics from the test itself), and all of that to a particular student outcome measure is dicey, given new questions so often being dropped in at the last minute — or not having been run past colleagues for some sort of vetting. Great question — there is no correct answer for it. But lots of potential good answers to it.
I totally agree that good questions for a formative assessment might be quite different from good questions for a summative assessment. I’ve actually been spending a lot of time over the last year writing and refining questions for concept inventories that are neither formative nor summative assessments. The primary purpose is to give an aggregate measure of what a class has learned over a period of time. They are common in a lot of STEM disciplines in higher ed (eg., Physics has the Force Concept Inventory) and they enable faculty to try new things in their classes and see if they actually improves learning. Of course no concept inventory captures everything you’re teaching, but it’s got to be better than relying on a subjective feel for how much students have learned.
Another concept I heard described in the Teaching Section of the American Physiological Society meeting this past April was a modified Likert scale that asked about the likelihood of certain things happening as a way to gauge student conceptual framing and understanding of critical concepts. This was in interesting idea that gets at misconceptions rather quickly, and I’ve adopted it in formative assessments.