Time-on-Task Is Not Self-Regulation: Measurement Validity in Learning Analytics

Examining how trace-based SRL measures conflate measurement scale with psychological construct, using a multi-text writing study to illustrate broader problems in the field.
Measurement
SRL
Learning Analytics
Author

Alex Wainwright

Published

December 15, 2025

Self-regulated learning (SRL) has increasingly been used in Learning Analytics (LA) as a post-hoc explanatory device. The field has moved beyond simple event counts (e.g., student X clicked PDF Y times) toward trace-based measures assumed to reflect regulatory processes. In practice, however, these traces are typically reduced to time-on-task metrics combined with an SRL mapping. From a measurement perspective, this is problematic. The mappings are weakly justified, often derived from small and unrepresentative samples, and operationalised through arbitrary heuristics. This post examines a recent paper to show how these issues undermine the validity of the resulting claims.

The Study

A total of 253 students participated; 163 were retained after exclusions (81 participants were dropped due to insufficient or missing responses).

The study comprised three components:

  • Pre-task measurement. Participants completed a questionnaire intended to measure metacognitive strategy use, followed by a 15-item prior-knowledge test.

  • Multi-text writing session. Over 120 minutes, participants studied 16 texts covering three topics and were instructed to write a 200–400 word essay integrating those topics.

  • Post-task measurement. The prior-knowledge test was administered again.

Traces from the learning environment were mapped onto an SRL framework. Each SRL process was operationalised as the total duration spent in that activity over the session. Time was treated as a proxy for both the quality and quantity of SRL processes.

For each essay, the following variables were computed:

  • Coverage of Reading Topics: Semantic overlap between essay sentences and source texts, aggregated across the three topics (range 0–3).

  • Essay Cohesion: Average semantic similarity across sentences in the essay.

  • Word Count: Total number of words, normalised to the 200–400 range.

The outcome variable, Achievement, was derived from two human raters scoring essays on Coverage of Reading Topics, Essay Cohesion, and Word Count. Scores ranged from 0 to 21 and were subsequently dichotomised (0: <10.5; 1: ≥10.5).

When modelling the data, word count was excluded as a predictor due to its high correlation with the SRL measure labelled Elaboration/Organisation.

Problems with the Study

Word Count and Elaboration/Organisation

Word count was removed from the model because it correlated highly with Elaboration/Organisation. Conceptually, however, these variables reflect the same underlying construct. Elaboration/Organisation refers to the process of producing and organising text; producing text necessarily involves generating words. The distinction between the two variables arises solely from their unit of measurement: word count captures how much text was produced, whereas elaboration captures how long text production took. Their imperfect overlap reflects typing speed, pauses, and off-task time—not distinct regulatory processes.

Treating elaboration as an independent SRL process therefore confuses measurement scale with construct. The variance associated with writing volume is retained, but reinterpreted as evidence of self-regulation.

Overlap Between Predictors and Outcome

The two most important predictors—Coverage of Reading Topics and Elaboration/Organisation—correspond directly to criteria used in human scoring. Achievement is partially composed of the same constructs operationalised by these predictors. This introduces criterion contamination: the model is, in part, learning to reproduce the scoring rubric rather than independently predicting performance.

From this perspective, the results demonstrate the feasibility of approximating human scoring using automated features, not that SRL processes have been meaningfully modelled.

Essay cohesion did not emerge as an important predictor. Descriptive statistics show high average semantic similarity (M ≈ 78%, SD ≈ 2%), indicating limited variability. Given the short length of the essays and shared source material, large semantic differences would not be expected. This further limits the interpretability of cohesion as a discriminating feature.

Predictive accuracy alone does not establish that the constructs used by the model correspond to meaningful learning processes, particularly when predictors operationalise the same criteria as the outcome. That is, there are many other forms of validity (e.g. Face, Content, Criterion) that need to be established before such statements can be made.

The achievement score was also dichotomised at 50%, despite being derived from a 21-point scale. Dichotomisation discards information, inflates apparent classification performance, and obscures whether the model distinguishes genuinely different levels of performance or merely separates extreme cases. For a study framed around formative feedback, this design choice is particularly limiting.

Conclusion

The issues raised here are not unique to this study. They reflect a broader pattern in learning analytics, where trace-based features are treated as psychologically meaningful constructs without sufficient validation, and where predictive success is taken as evidence of explanatory power. Without clearer separation between product, process, and outcome, such models risk reproducing scoring practices rather than illuminating learning.