Alex Wainwright - Prompt Engineering and Reliability in AI Scoring of Candidate Essays

Overview

AI scoring has been positioned as an alternative mechanism to sift candidates within the recruitment process. If the scoring uses a LLM (Large Language Model) wrapper with a prompt, then there is a need to consider how certain decisions could impact scores returned. Here, we consider the effects of both the prompt used and the temperature of the model. To examine these effects, I designed a simulation study.

The Setup

The study works as follows. We have the question posed to candidates: ‘Describe a time when you overcame a challenge at work’. We also have three prompt types: Basic, Interim, and Advanced. They are differentiated by the level of detail contained within the prompt. The Basic asks the model to provide a score out of 40 based on the scoring criteria. Interim expands on this by providing further score breakdown for each criteria. Advanced breaks this down further by providing sub-criteria and associated scoring bands. Each prompt expects the number of words per essay to equal 40 or else return a score of 0. Finally, we have three essay types: weak, medium, and strong. These categories represent the transition from surface-level responses with minimal detail to comprehensive answers that demonstrate deep reflection and specific examples. In addition, temperature values for the model are changed. We use the following: [0, .2, .5, .7, 1]. This allows for an increased degree of randomness into the model. Together, the combination of prompt, essay, and temperature were each run 50 times, which results in 2250 data points.

Prompt Type	Prompt Details
Basic	You are a recruiter who is scoring responses to the following question: “{question}”. You will provide a score out of 40, which will cover clarity, relevance, professional tone, and communication quality. Make sure the response is exactly 40 words in length — if not, assign a score of 0. Return your response in this JSON format: {{ “word_count”: , “clarity”: , “relevance”: , “professional_tone”: , “communication_quality”: , “total_score”: , “rationale”: “” }}
Interim	You are a recruiter assessing candidate responses to the question: “{question}”. You must assign a score out of 40 using the following categories: Clarity (10 points) Relevance (10 points) Professional tone (10 points) Communication quality (10 points) If the answer is not exactly 40 words, assign a score of 0 for each category. Respond in this JSON format: {{ “word_count”: , “clarity”: , “relevance”: , “professional_tone”: , “communication_quality”: , “total_score”: , “rationale”: “” }}
Advanced	You are a professional recruiter evaluating candidate responses to the question: “{question}”. The answer must be exactly 40 words. If it is not, assign 0 for all categories. Otherwise, score the response using the rubric below: Clarity (10 points) ‣ 0–3: Unclear situation or resolution ‣ 4–7: Mostly clear ‣ 8–10: Very clear Relevance (10 points) ‣ 0–3: Not related or vague ‣ 4–7: Work-related but shallow ‣ 8–10: Relevant and specific Professional tone (10 points) ‣ 0–3: Informal/inappropriate ‣ 4–7: Mostly professional ‣ 8–10: Fully professional Communication quality (10 points) ‣ 0–3: Disjointed or poorly written ‣ 4–7: Some issues, mostly fine ‣ 8–10: Well-structured and polished Return your score in the following JSON format: {{ “word_count”: , “clarity”: , “relevance”: , “professional_tone”: , “communication_quality”: , “total_score”: , “rationale”: “” }}

Here is the script to run this study.

from dotenv import load_dotenv
from itertools import product
import json
from openai import OpenAI
import os
from pathlib import Path
import random
import re
from time import sleep

def generate_prompt(question: str, prompt_path: Path) -> dict:
    """Generate Prompt used to Score Essay

    Args:
        question (str): Question candidate is answering
        prompt (str): The prompt used to score the essay

    Returns:
        dict: 
    """
    with open(prompt_path, "r") as f:
        prompt = f.read()
    prompt = prompt.replace("{question}", question)
    prompt_details = {
        "prompt_type": prompt_path.name.replace(".txt", ""),
        "prompt": prompt
    }
    return prompt_details


if __name__ == "__main__":

    # Create scoring prompts
    candidate_question = "Describe a time when you overcame a challenge at work."
    scoring_prompts = Path("data/essay_scoring/prompts").glob("*.txt")
    
    scoring_prompts = [
        generate_prompt(candidate_question, file)
        for file in scoring_prompts]
    
    # Read Essays
    candidate_essay_fps = Path("data/essay_scoring/essays").glob("*.txt")
    candidate_essays = []
    
    for essay_fp in candidate_essay_fps:
        with open(essay_fp, "r") as f:
            essay = f.read()
        essay_details = {
            "essay_type": essay_fp.name.replace(".txt", ""),
            "essay": essay
        }
        candidate_essays.append(essay_details)
    
    # Setup DeepSeek Client
    load_dotenv()
    
    # Initialize client
    client = OpenAI(
        api_key=os.environ.get("API_KEY"), 
        base_url="https://api.deepseek.com")
    
    temperatures = [0, .2, .5, .7, 1]
    
    essay_scoring_records = []
    
    for prompt, essay, temperature, run_idx in product(scoring_prompts, candidate_essays, temperatures, range(50)):
        sleep(random.uniform(1, 2))
        response = client.chat.completions.create(
            model="deepseek-chat",
            messages=[
                {"role": "system", "content": prompt.get("prompt")},
                {"role": "user", "content": essay.get("essay")}
            ],
            temperature=temperature,
            stream=False
        )
                
        essay_score = json.loads(re.sub("^```json|```$", "", response.choices[0].message.content))
        essay_scoring = {
            "prompt": prompt,
            "essay": essay,
            "temperature": temperature,
            "essay_score": essay_score,
            "run_idx": run_idx
        }
        
        essay_scoring_records.append(essay_scoring)
        print(essay_score)

    with open("data/essay_scoring/scores/ai_essay_scores.json", "w") as f:
        json.dump(essay_scoring_records, f)

Results

We could end the post here as each essay was short of the 40 word expectation. Therefore, in no instance should the LLM have score the essay. That’s a concern.

library(data.table)
library(flextable)
library(jsonlite)

ai_essay_scores <-
  read_json("ai_essay_scores.json")

full_essay_scores <- lapply(ai_essay_scores, function (x) {
  essay_scoring <- data.table(
    prompt_type = x$prompt$prompt_type,
    essay_type = x$essay$essay_type,
    essay_word_count = x$essay_score$word_count,
    temperature = x$temperature,
    essay_clarity = x$essay_score$clarity,
    essay_relevance = x$essay_score$relevance,
    essay_professional_tone = x$essay_score$professional_tone,
    essay_communication_quality = x$essay_score$communication_quality,
    essay_total_score = x$essay_score$total_score
  )
  return(essay_scoring)
}) |>
  rbindlist()

For the time being, we’ll disregard this problem.

Here’s what we observe:

Temperature effects are minimal but present: Most scores remain constant across temperature settings, but there are subtle variations. This suggests temperature has a small but measurable impact on scoring consistency.
Prompt complexity affects discrimination: The Advanced prompt shows the most variation in the Medium essay category, while Basic and Interim are more rigid. This suggests the detailed rubric in Advanced allows for more nuanced scoring.
Ceiling and floor effects: Strong essays consistently hit the maximum score across all conditions, while Weak essays cluster at the bottom. The Medium essays show the most variation between prompt types.

Scoring patterns differ by prompt type:

Basic: Very consistent, almost mechanical
Interim: Slight variations emerge at higher temperatures
Advanced: Most variation in the middle range, suggesting the detailed rubric enables more granular assessment

As would be expected, the scores increase as the essay strength goes up. Prompt changes do not appear to affect the scoring of the advanced essay. However, for weak and medium essays, the prompt does affect the score.

essay_score_summary <-
  full_essay_scores[, .(
    scoring_criteria = names(.SD),
    average_score = apply(.SD, 2, function (x)
      mean(x, na.rm = T))
  ), by = .(prompt_type, essay_type, temperature), .SDcols = c(
    "essay_clarity",
    "essay_relevance",
    "essay_professional_tone",
    "essay_communication_quality",
    "essay_total_score"
  )] |>
  dcast(prompt_type + essay_type + temperature ~ scoring_criteria,
        value.var = "average_score")

essay_score_summary[, prompt_type := factor(
  prompt_type,
  labels = c(
    "Basic",
    "Interim",
    "Advanced"
  ),
  levels = c(
    "basic_scoring_prompt",
    "interim_scoring_prompt",
    "advanced_scoring_prompt"
  )
)]

essay_score_summary[, essay_type := factor(
  essay_type,
  labels = c(
    "Weak",
    "Medium",
    "Strong"
  ),
  levels = c(
    "weak_essay", 
    "medium_essay", 
    "strong_essay"
  )
)]

setorder(essay_score_summary, prompt_type, essay_type)

essay_score_summary |>
  as_flextable(
    max_row = 45, 
    do_autofit = T, 
    show_coltype = F
  ) |>
  set_header_labels(
    values = c(
      "Prompt",
      "Essay",
      "Temperature",
      "Clarity",
      "Communication Quality",
      "Professional Tone",
      "Relevance",
      "Total Score"
    )
  ) |>
  align_nottext_col(align = "center") |>
  merge_at(i = 1:15, j = 1) |>
  merge_at(i = 1:5, j = 2) |>
  merge_at(i = 6:10, j = 2) |>
  merge_at(i = 11:15, j = 2) |>
  hline(i = 15, j = 1:8, border = fp_border_default()) |>
  merge_at(i = 16:30, j = 1) |>
  merge_at(i = 16:20, j = 2) |>
  merge_at(i = 21:25, j = 2) |>
  merge_at(i = 26:30, j = 2) |>
  hline(i = 30, j = 1:8, border = fp_border_default()) |>
  merge_at(i = 31:45, j = 1) |>
  merge_at(i = 31:35, j = 2) |>
  merge_at(i = 36:40, j = 2) |>
  merge_at(i = 41:45, j = 2)

Prompt	Essay	Temperature	Clarity	Communication Quality	Professional Tone	Relevance	Total Score
Basic	Weak	0.0	6.0	6.0	5.0	7.0	24.0
		0.2	6.0	6.0	5.0	7.0	24.0
		0.5	6.0	6.0	5.0	7.0	24.0
		0.7	6.0	6.0	5.0	7.0	24.0
		1.0	6.0	6.0	5.0	7.0	24.1
	Medium	0.0	9.0	9.0	9.0	10.0	37.0
		0.2	9.0	9.0	9.0	10.0	37.0
		0.5	9.0	9.0	9.0	10.0	37.0
		0.7	9.0	9.0	9.0	10.0	37.0
		1.0	9.0	9.0	9.0	10.0	37.1
	Strong	0.0	10.0	10.0	10.0	10.0	40.0
		0.2	10.0	10.0	10.0	10.0	40.0
		0.5	10.0	10.0	10.0	10.0	40.0
		0.7	10.0	10.0	10.0	10.0	40.0
		1.0	10.0	10.0	10.0	10.0	40.0
Interim	Weak	0.0	6.0	6.0	5.0	7.0	24.0
		0.2	6.0	6.0	5.0	7.0	24.0
		0.5	6.0	6.0	5.0	7.1	24.1
		0.7	6.0	6.0	5.0	7.1	24.1
		1.0	6.0	6.0	5.0	7.0	23.9
	Medium	0.0	8.0	8.0	8.0	9.0	33.0
		0.2	8.0	8.0	8.0	9.0	33.0
		0.5	8.0	8.0	8.0	9.0	33.0
		0.7	8.0	8.0	8.0	9.0	33.0
		1.0	8.0	8.0	8.1	9.0	33.3
	Strong	0.0	10.0	10.0	10.0	10.0	40.0
		0.2	10.0	10.0	10.0	10.0	40.0
		0.5	10.0	10.0	10.0	10.0	40.0
		0.7	10.0	10.0	10.0	10.0	39.9
		1.0	9.9	9.9	10.0	10.0	39.8
Advanced	Weak	0.0	5.0	5.0	5.0	6.0	21.0
		0.2	5.0	5.0	5.0	6.0	21.0
		0.5	5.0	5.0	5.0	6.0	21.0
		0.7	5.0	5.0	5.0	6.0	21.0
		1.0	5.0	5.0	5.0	6.0	21.0
	Medium	0.0	8.0	8.0	8.6	9.0	33.6
		0.2	8.0	8.0	8.4	9.0	33.4
		0.5	8.0	8.1	8.4	9.0	33.5
		0.7	8.0	8.0	8.6	9.0	33.6
		1.0	8.0	8.1	8.5	9.0	33.6
	Strong	0.0	10.0	10.0	10.0	10.0	40.0
		0.2	10.0	10.0	10.0	10.0	40.0
		0.5	10.0	10.0	10.0	10.0	40.0
		0.7	10.0	10.0	10.0	10.0	40.0
		1.0	10.0	10.0	10.0	10.0	40.0
n: 45

Takeaway

This data reveals some important implications for AI scoring in recruitment:

The use of word counts within the prompt, based on current LLMs, is dubious. Given the way such models approach tokens, specifying a minimum number of words is problematic. There could be instances where a candidate’s response is penalised as the model could not determine the right word count.

It is clear that, based on the data, prompt engineering is critical. The results show that the prompt design does impact outcomes: the advanced prompt

Reliability concerns: The fact that identical essays can receive different scores based on temperature settings and prompt design suggests AI scoring isn’t as consistent as traditional methods. Even small variations (0.1-0.3 points) could affect candidate rankings when margins are tight.

Prompt engineering becomes critical: Your results show that how you design the scoring prompt significantly impacts outcomes. The Advanced prompt’s ability to better discriminate between medium-quality responses suggests that detailed rubrics improve assessment quality, but they also introduce more variability.

The “good enough” problem: Strong essays consistently max out scores regardless of conditions, while weak essays cluster at the bottom. This suggests AI scoring works well for obvious cases but struggles with nuanced distinctions - precisely where human judgment is most valuable.

Practical trade-offs: Organizations using AI scoring face a choice between consistency (Basic prompts) and discrimination (Advanced prompts). The Basic approach is more reliable but potentially less insightful, while Advanced prompts offer better assessment but with more variability.

Audit and calibration needs: Since temperature and prompt design affect outcomes, organizations would need to carefully test and standardize their AI scoring systems. What works for one type of role or question might not work for another.

Human oversight remains essential: The variability suggests AI scoring works best as a screening tool rather than a definitive assessment method. Human review becomes crucial for borderline cases where small score differences matter.