CAT Isn’t Magic: Why Item Bank Targeting Matters More Than the Algorithm
Even the best selection strategy fails when the bank is skewed. This simulation compares linear, random, and 54321 adaptive forms to show where each succeeds, where they break down, and why content coverage drives measurement quality.
CAT
IRT
Author
Alex Wainwright
Published
November 24, 2025
Developing an item bank is slow, expensive, and failure-prone. You need enough items. You need calibration data. And the parameters you estimate have to be stable enough to trust. Items will fail. Samples are hard to get. Small banks spike exposure rates because candidates keep seeing the same content. Over time, items drift and get easier, especially in fixed-form tests where everyone sees the same material.
This post compares three delivery strategies: linear selection, random selection, and adaptive selection. The question is simple: when the bank is limited or mismatched to the population, does CAT still offer value?
Simulation Setup
Person abilities: Skewed high.
Item bank: 40 items, skewed low. This creates a mismatch — high-ability people, low-ability items.
Model: Rasch; no discrimination parameters.
Initial θ estimate: −1, so adaptive testing starts with easier items.
We test three selectors:
Linear Selection:
The same 10 items for everyone, spaced evenly across the difficulty range.
Random Selection:
Each examinee gets a random set of items. This increases the chance of getting mostly easy items.
54321 Selection:
At each step k in the 40-item test, the algorithm selects an item from a shrinking “top-informative” bin. The bin size is K − k, where K = 40. Early in the test, the bin is large and includes many of the most informative items for the current θ estimate. As the test progresses, the bin narrows, forcing the selector toward the single most informative item. This creates a smooth transition from semi-random to highly targeted selection.
A miss of ~0.8 logits is the cost of giving the same set of mistargeted items to every examinee. With high-ability people and low-difficulty items, there simply isn’t enough information to recover θ accurately.
The Random Form
The random form has a 25% overlap rate. This is expected: examinees draw different sets of items, so the bank spreads out more.
# Inspecting the Random Formprint(f"Overlap Rate: {random_simulation.overlap_rate *100}%")
Overlap Rate: 25.06852%
Each item was seen about 250 times, with a standard deviation of ~13. The distribution is tight, so randomisation does a good job keeping item usage balanced.
random_item_count = Counter([ item for administration in random_simulation.administered_items for item in administration])random_item_count = pd.DataFrame( random_item_count.values(), index=random_item_count.keys(), columns=["item_count"])random_item_count["item_count"].agg({"mean": "mean", "sd": "std"})
mean 250.000000
sd 13.254897
Name: item_count, dtype: float64
In terms of accuracy, the random form improves slightly over the linear form. Randomness means some examinees get a better-targeted mix of items purely by chance, and that small advantage shows up in the error metrics.
That’s still high. The underlying issue remains the same: the bank doesn’t contain enough difficult items to measure high abilities accurately. Randomising item order can’t solve mistargeting, it can only soften the worst effects of a fixed, poorly matched form.
The 54321 Form
The 54321 form has an overlap rate of 25%, almost identical to the random form.
# Inspecting the 54321 Formprint(f"Overlap Rate: {the_54321_simulation.overlap_rate *100}%")
Overlap Rate: 25.05064%
Each item was seen by roughly 250 examinees, with a standard deviation of ~11. Exposure is slightly tighter than random selection, which makes sense given the shrinking informative bin: early steps allow variation, but later steps converge toward the same high-information items.
the_54321_item_count = Counter([ item for administration in the_54321_simulation.administered_items for item in administration])the_54321_item_count = pd.DataFrame( the_54321_item_count.values(), index=the_54321_item_count.keys(), columns=["item_count"])the_54321_item_count["item_count"].agg({"mean": "mean", "sd": "std"})
mean 250.000000
sd 11.395006
Name: item_count, dtype: float64
This is counter-intuitive if you expect adaptive behaviour to outperform randomisation. But the test starts in a region of the bank that provides little information for high-ability examinees. The initial θ = −1 pushes the selector toward easy items early on, and the shrinking bin makes it harder for the algorithm to escape that trajectory later in the test.
In short: under mistargeting, adaptivity can lock in bad early decisions.
Comparing Across Forms
A single RMSE value is a useful headline number, but it hides where each method succeeds or fails. Splitting examinees into ability bands reveals the structure of the error.
Across (−2, 2], the random form performs best. Randomisation gives some examinees a better-targeted mix of items by chance, and this is enough to reduce error.
The linear form performs consistently worse than random but stays broadly stable across the central ability range.
The 54321 form performs similarly in the middle but is slightly less stable because of the strong influence of early, low-difficulty selections.
At the upper end ((2, 4]), the pattern shifts.
All forms degrade sharply here — the bank simply does not contain enough difficult items to measure high abilities. But within this limitation:
54321 is the most precise in (2, 3]
All forms become very inaccurate in (3, 4], with errors of ~2 logits or more
Linear remains the weakest at the extremes
Random does not outperform 54321 at high abilities because randomness alone cannot generate difficult items that don’t exist
When the bank is mistargeted, all three strategies struggle, and no amount of adaptivity or randomisation can substitute for appropriate item difficulty coverage.
Conclusion
This simulation shows that no selection method can compensate for a poorly targeted item bank.
The linear form performs worst because everyone receives the same low-information items.
Random selection improves accuracy by injecting variety, which gives some examinees a better-matched set of items by chance.
The 54321 selector behaves like a soft adaptive system, but when the bank is mismatched, its adaptivity locks it into easy items early and can’t recover. That makes it slightly worse than random overall, although it stabilises somewhat at higher abilities.
Across all conditions, the limiting factor is the item bank.
With too few difficult items, every method collapses at the upper end of the ability scale. The RMSE jumps above 1.0 for (2, 3] and past 2.0 for (3, 4]. It’s not a failure of CAT, but a failure of content coverage.
The takeaway is practical:
CAT helps when the bank contains items that span the full ability range.
Random selection can outperform a fixed form when targeting is poor.
Adaptive algorithms aren’t magic; they amplify whatever information the bank contains.
If the bank is skewed, your estimates will be skewed.