Alex Wainwright - CAT Isn’t Magic: Why Item Bank Targeting Matters More Than the Algorithm

Developing an item bank is slow, expensive, and failure-prone. You need enough items. You need calibration data. And the parameters you estimate have to be stable enough to trust. Items will fail. Samples are hard to get. Small banks spike exposure rates because candidates keep seeing the same content. Over time, items drift and get easier, especially in fixed-form tests where everyone sees the same material.

This post compares three delivery strategies: linear selection, random selection, and adaptive selection. The question is simple: when the bank is limited or mismatched to the population, does CAT still offer value?

Simulation Setup

Person abilities: Skewed high.
Item bank: 40 items, skewed low. This creates a mismatch — high-ability people, low-ability items.
Model: Rasch; no discrimination parameters.
Initial θ estimate: −1, so adaptive testing starts with easier items.

We test three selectors:

Linear Selection:
The same 10 items for everyone, spaced evenly across the difficulty range.
Random Selection:
Each examinee gets a random set of items. This increases the chance of getting mostly easy items.
54321 Selection:
At each step k in the 40-item test, the algorithm selects an item from a shrinking “top-informative” bin. The bin size is K − k, where K = 40. Early in the test, the bin is large and includes many of the most informative items for the current θ estimate. As the test progresses, the bin narrows, forcing the selector toward the single most informative item. This creates a smooth transition from semi-random to highly targeted selection.

from catsim.estimation import NumericalSearchEstimator
from catsim.initialization import FixedPointInitializer
from catsim.irt import see
from catsim.selection import LinearSelector, RandomSelector, The54321Selector
from catsim.simulation import Simulator
from catsim.stopping import MaxItemStopper
from collections import Counter
from IPython.display import Markdown, display
import numpy as np
import pandas as pd
from scipy.stats import skewnorm

np.random.seed(2010)

# Initialise Person Estimates
theta_vector = skewnorm.rvs(a=1, size=1000, random_state=2010)

# Create Item Bank
item_difficulty = skewnorm.rvs(a=-.5, size=40, random_state=2010)
item_difficulty.sort()

item_bank = np.column_stack(
  (
    np.repeat([1], 40),
    item_difficulty,
    np.repeat([0], 40),
    np.repeat([1], 40)
  )
)

# Initial Theta Estiamte
initialiser = FixedPointInitializer(start=-1)

# Approaches to Item Selection
linear_selector = LinearSelector(np.arange(0, len(item_difficulty), 4))
random_selector = RandomSelector()
the_54321_selector = The54321Selector(test_size=40)

# Theta Estimation
estimator = NumericalSearchEstimator()

# Stopping Criteria
max_item = MaxItemStopper(max_itens=10)

# Setup Simulations
linear_simulation = Simulator(
  items=item_bank, 
  examinees=theta_vector, 
  initializer=initialiser, 
  selector=linear_selector, 
  estimator=estimator, 
  stopper=max_item)
  
random_simulation = Simulator(
  items=item_bank, 
  examinees=theta_vector, 
  initializer=initialiser, 
  selector=random_selector, 
  estimator=estimator, 
  stopper=max_item)

the_54321_simulation = Simulator(
  items=item_bank, 
  examinees=theta_vector, 
  initializer=initialiser, 
  selector=the_54321_selector, 
  estimator=estimator, 
  stopper=max_item)
  
# Run Simulations
linear_simulation.simulate()
random_simulation.simulate()
the_54321_simulation.simulate()

The Linear Form

The linear form has a 100% overlap rate. Expected. If it wasn’t, we’d worry.

# Inspecting the Linear Form
print(f"Overlap Rate: {linear_simulation.overlap_rate * 100}%")

Overlap Rate: 100.0%

theta_bins = pd.IntervalIndex.from_tuples([(-5, -4), (-4, -3), (-3, -2), (-2, -1), (-1, 0), (0, 1), (1, 2), (2, 3), (3, 4), (4, 5)])

linear_results = pd.DataFrame(
  {
    "true_ability": theta_vector, 
    "estimated_ability": linear_simulation.latest_estimations
  }
)

linear_results["theta_banding"] = pd.cut(linear_results["true_ability"], theta_bins, right = True)
linear_results["bias"] = (linear_results["estimated_ability"] - linear_results["true_ability"])**2

The RMSE gives us a clean measure of accuracy:

print(f"Linear RMSE: {np.sqrt(linear_results['bias'].mean()).round(3)}")

Linear RMSE: 0.769

A miss of ~0.8 logits is the cost of giving the same set of mistargeted items to every examinee. With high-ability people and low-difficulty items, there simply isn’t enough information to recover θ accurately.

The Random Form

The random form has a 25% overlap rate. This is expected: examinees draw different sets of items, so the bank spreads out more.

# Inspecting the Random Form
print(f"Overlap Rate: {random_simulation.overlap_rate * 100}%")

Overlap Rate: 25.06852%

Each item was seen about 250 times, with a standard deviation of ~13. The distribution is tight, so randomisation does a good job keeping item usage balanced.

random_item_count = Counter([
    item 
    for administration in random_simulation.administered_items 
    for item in administration])

random_item_count = pd.DataFrame(
  random_item_count.values(), 
  index=random_item_count.keys(), 
  columns=["item_count"])

random_item_count["item_count"].agg({"mean": "mean", "sd": "std"})

mean    250.000000
sd       13.254897
Name: item_count, dtype: float64

random_results = pd.DataFrame(
  {
    "true_ability": theta_vector, 
    "estimated_ability": random_simulation.latest_estimations
  }
)

random_results["theta_banding"] = pd.cut(random_results["true_ability"], theta_bins, right = True)
random_results["bias"] = (random_results["estimated_ability"] - random_results["true_ability"])**2

In terms of accuracy, the random form improves slightly over the linear form. Randomness means some examinees get a better-targeted mix of items purely by chance, and that small advantage shows up in the error metrics.

print(f"Random RMSE: {np.sqrt(random_results['bias'].mean()).round(3)}")

Random RMSE: 0.73

That’s still high. The underlying issue remains the same: the bank doesn’t contain enough difficult items to measure high abilities accurately. Randomising item order can’t solve mistargeting, it can only soften the worst effects of a fixed, poorly matched form.

The 54321 Form

The 54321 form has an overlap rate of 25%, almost identical to the random form.

# Inspecting the 54321 Form
print(f"Overlap Rate: {the_54321_simulation.overlap_rate * 100}%")

Overlap Rate: 25.05064%

Each item was seen by roughly 250 examinees, with a standard deviation of ~11. Exposure is slightly tighter than random selection, which makes sense given the shrinking informative bin: early steps allow variation, but later steps converge toward the same high-information items.

the_54321_item_count = Counter([
    item 
    for administration in the_54321_simulation.administered_items 
    for item in administration])

the_54321_item_count = pd.DataFrame(
  the_54321_item_count.values(), 
  index=the_54321_item_count.keys(), 
  columns=["item_count"])

the_54321_item_count["item_count"].agg({"mean": "mean", "sd": "std"})

mean    250.000000
sd       11.395006
Name: item_count, dtype: float64

the_54321_results = pd.DataFrame(
  {
    "true_ability": theta_vector, 
    "estimated_ability": the_54321_simulation.latest_estimations
  }
)

the_54321_results["theta_banding"] = pd.cut(the_54321_results["true_ability"], theta_bins, right = True)
the_54321_results["bias"] = (the_54321_results["estimated_ability"] - the_54321_results["true_ability"])**2

In terms of accuracy, the 54321 form performs slightly worse than random selection:

print(f"The 54321 RMSE: {np.sqrt(the_54321_results['bias'].mean()).round(3)}")

The 54321 RMSE: 0.778

This is counter-intuitive if you expect adaptive behaviour to outperform randomisation. But the test starts in a region of the bank that provides little information for high-ability examinees. The initial θ = −1 pushes the selector toward easy items early on, and the shrinking bin makes it harder for the algorithm to escape that trajectory later in the test.
In short: under mistargeting, adaptivity can lock in bad early decisions.

Comparing Across Forms

A single RMSE value is a useful headline number, but it hides where each method succeeds or fails. Splitting examinees into ability bands reveals the structure of the error.

Across (−2, 2], the random form performs best. Randomisation gives some examinees a better-targeted mix of items by chance, and this is enough to reduce error.
The linear form performs consistently worse than random but stays broadly stable across the central ability range.
The 54321 form performs similarly in the middle but is slightly less stable because of the strong influence of early, low-difficulty selections.

At the upper end ((2, 4]), the pattern shifts.
All forms degrade sharply here — the bank simply does not contain enough difficult items to measure high abilities. But within this limitation:

54321 is the most precise in (2, 3]
All forms become very inaccurate in (3, 4], with errors of ~2 logits or more
Linear remains the weakest at the extremes
Random does not outperform 54321 at high abilities because randomness alone cannot generate difficult items that don’t exist

linear_results["form_type"] = "Linear Form"
random_results["form_type"] = "Random Form"
the_54321_results["form_type"] = "The 54321 Form"

  
simulation_results = pd.concat([
  linear_results,
  random_results,
  the_54321_results
])

simulation_results = (
  simulation_results
  .groupby(["form_type", "theta_banding"])['bias']
  .mean()
  .rename("MSE")
  .reset_index()
)

simulation_results["RMSE"] = simulation_results["MSE"].apply(np.sqrt)
simulation_results = simulation_results.rename(columns={
  "form_type": "Form Type",
  "theta_banding": "Theta Banding"
})

simulation_results = simulation_results.dropna()

simulation_results.pivot_table(
  columns=["Form Type"], 
  values=["MSE", "RMSE"], 
  index="Theta Banding").round(3)

	MSE			RMSE
Form Type	Linear Form	Random Form	The 54321 Form	Linear Form	Random Form	The 54321 Form
Theta Banding
(-2, -1]	0.496	0.310	0.681	0.705	0.557	0.825
(-1, 0]	0.587	0.519	0.664	0.766	0.720	0.815
(0, 1]	0.578	0.504	0.622	0.760	0.710	0.788
(1, 2]	0.413	0.436	0.370	0.643	0.661	0.609
(2, 3]	1.357	1.115	1.078	1.165	1.056	1.038
(3, 4]	5.335	4.978	4.405	2.310	2.231	2.099

When the bank is mistargeted, all three strategies struggle, and no amount of adaptivity or randomisation can substitute for appropriate item difficulty coverage.

Conclusion

This simulation shows that no selection method can compensate for a poorly targeted item bank.

The linear form performs worst because everyone receives the same low-information items.
Random selection improves accuracy by injecting variety, which gives some examinees a better-matched set of items by chance.
The 54321 selector behaves like a soft adaptive system, but when the bank is mismatched, its adaptivity locks it into easy items early and can’t recover. That makes it slightly worse than random overall, although it stabilises somewhat at higher abilities.

Across all conditions, the limiting factor is the item bank.
With too few difficult items, every method collapses at the upper end of the ability scale. The RMSE jumps above 1.0 for (2, 3] and past 2.0 for (3, 4]. It’s not a failure of CAT, but a failure of content coverage.

The takeaway is practical:

CAT helps when the bank contains items that span the full ability range.
Random selection can outperform a fixed form when targeting is poor.
Adaptive algorithms aren’t magic; they amplify whatever information the bank contains.
If the bank is skewed, your estimates will be skewed.