from dotenv import load_dotenv
from itertools import product
import json
from openai import OpenAI
import os
from pathlib import Path
import random
import re
from time import sleep
def generate_prompt(question: str, prompt_path: Path) -> dict:
"""Generate Prompt used to Score Essay
Args:
question (str): Question candidate is answering
prompt (str): The prompt used to score the essay
Returns:
dict:
"""
with open(prompt_path, "r") as f:
= f.read()
prompt = prompt.replace("{question}", question)
prompt = {
prompt_details "prompt_type": prompt_path.name.replace(".txt", ""),
"prompt": prompt
}return prompt_details
if __name__ == "__main__":
# Create scoring prompts
= "Describe a time when you overcame a challenge at work."
candidate_question = Path("data/essay_scoring/prompts").glob("*.txt")
scoring_prompts
= [
scoring_prompts file)
generate_prompt(candidate_question, for file in scoring_prompts]
# Read Essays
= Path("data/essay_scoring/essays").glob("*.txt")
candidate_essay_fps = []
candidate_essays
for essay_fp in candidate_essay_fps:
with open(essay_fp, "r") as f:
= f.read()
essay = {
essay_details "essay_type": essay_fp.name.replace(".txt", ""),
"essay": essay
}
candidate_essays.append(essay_details)
# Setup DeepSeek Client
load_dotenv()
# Initialize client
= OpenAI(
client =os.environ.get("API_KEY"),
api_key="https://api.deepseek.com")
base_url
= [0, .2, .5, .7, 1]
temperatures
= []
essay_scoring_records
for prompt, essay, temperature, run_idx in product(scoring_prompts, candidate_essays, temperatures, range(50)):
1, 2))
sleep(random.uniform(= client.chat.completions.create(
response ="deepseek-chat",
model=[
messages"role": "system", "content": prompt.get("prompt")},
{"role": "user", "content": essay.get("essay")}
{
],=temperature,
temperature=False
stream
)
= json.loads(re.sub("^```json|```$", "", response.choices[0].message.content))
essay_score = {
essay_scoring "prompt": prompt,
"essay": essay,
"temperature": temperature,
"essay_score": essay_score,
"run_idx": run_idx
}
essay_scoring_records.append(essay_scoring)print(essay_score)
with open("data/essay_scoring/scores/ai_essay_scores.json", "w") as f:
json.dump(essay_scoring_records, f)
Overview
AI scoring has been positioned as an alternative mechanism to sift candidates within the recruitment process. If the scoring uses a LLM (Large Language Model) wrapper with a prompt, then there is a need to consider how certain decisions could impact scores returned. Here, we consider the effects of both the prompt used and the temperature of the model. To examine these effects, I designed a simulation study.
The Setup
The study works as follows. We have the question posed to candidates: ‘Describe a time when you overcame a challenge at work’. We also have three prompt types: Basic, Interim, and Advanced. They are differentiated by the level of detail contained within the prompt. The Basic asks the model to provide a score out of 40 based on the scoring criteria. Interim expands on this by providing further score breakdown for each criteria. Advanced breaks this down further by providing sub-criteria and associated scoring bands. Each prompt expects the number of words per essay to equal 40 or else return a score of 0. Finally, we have three essay types: weak, medium, and strong. These categories represent the transition from surface-level responses with minimal detail to comprehensive answers that demonstrate deep reflection and specific examples. In addition, temperature values for the model are changed. We use the following: [0, .2, .5, .7, 1]. This allows for an increased degree of randomness into the model. Together, the combination of prompt, essay, and temperature were each run 50 times, which results in 2250 data points.
Prompt Type | Prompt Details |
---|---|
Basic | You are a recruiter who is scoring responses to the following question: “{question}”. You will provide a score out of 40, which will cover clarity, relevance, professional tone, and communication quality. Make sure the response is exactly 40 words in length — if not, assign a score of 0. Return your response in this JSON format: {{ “word_count”: |
Interim | You are a recruiter assessing candidate responses to the question: “{question}”. You must assign a score out of 40 using the following categories: Clarity (10 points) Relevance (10 points) Professional tone (10 points) Communication quality (10 points) If the answer is not exactly 40 words, assign a score of 0 for each category. Respond in this JSON format: {{ “word_count”: |
Advanced | You are a professional recruiter evaluating candidate responses to the question: “{question}”. The answer must be exactly 40 words. If it is not, assign 0 for all categories. Otherwise, score the response using the rubric below: Clarity (10 points) ‣ 0–3: Unclear situation or resolution ‣ 4–7: Mostly clear ‣ 8–10: Very clear Relevance (10 points) ‣ 0–3: Not related or vague ‣ 4–7: Work-related but shallow ‣ 8–10: Relevant and specific Professional tone (10 points) ‣ 0–3: Informal/inappropriate ‣ 4–7: Mostly professional ‣ 8–10: Fully professional Communication quality (10 points) ‣ 0–3: Disjointed or poorly written ‣ 4–7: Some issues, mostly fine ‣ 8–10: Well-structured and polished Return your score in the following JSON format: {{ “word_count”: |
Here is the script to run this study.
Results
We could end the post here as each essay was short of the 40 word expectation. Therefore, in no instance should the LLM have score the essay. That’s a concern.
library(data.table)
library(flextable)
library(jsonlite)
<-
ai_essay_scores read_json("ai_essay_scores.json")
<- lapply(ai_essay_scores, function (x) {
full_essay_scores <- data.table(
essay_scoring prompt_type = x$prompt$prompt_type,
essay_type = x$essay$essay_type,
essay_word_count = x$essay_score$word_count,
temperature = x$temperature,
essay_clarity = x$essay_score$clarity,
essay_relevance = x$essay_score$relevance,
essay_professional_tone = x$essay_score$professional_tone,
essay_communication_quality = x$essay_score$communication_quality,
essay_total_score = x$essay_score$total_score
)return(essay_scoring)
|>
}) rbindlist()
For the time being, we’ll disregard this problem.
Here’s what we observe:
Temperature effects are minimal but present: Most scores remain constant across temperature settings, but there are subtle variations. This suggests temperature has a small but measurable impact on scoring consistency.
Prompt complexity affects discrimination: The Advanced prompt shows the most variation in the Medium essay category, while Basic and Interim are more rigid. This suggests the detailed rubric in Advanced allows for more nuanced scoring.
Ceiling and floor effects: Strong essays consistently hit the maximum score across all conditions, while Weak essays cluster at the bottom. The Medium essays show the most variation between prompt types.
Scoring patterns differ by prompt type:
Basic: Very consistent, almost mechanical
Interim: Slight variations emerge at higher temperatures
Advanced: Most variation in the middle range, suggesting the detailed rubric enables more granular assessment
As would be expected, the scores increase as the essay strength goes up. Prompt changes do not appear to affect the scoring of the advanced essay. However, for weak and medium essays, the prompt does affect the score.
<-
essay_score_summary
full_essay_scores[, .(scoring_criteria = names(.SD),
average_score = apply(.SD, 2, function (x)
mean(x, na.rm = T))
= .(prompt_type, essay_type, temperature), .SDcols = c(
), by "essay_clarity",
"essay_relevance",
"essay_professional_tone",
"essay_communication_quality",
"essay_total_score"
|>
)] dcast(prompt_type + essay_type + temperature ~ scoring_criteria,
value.var = "average_score")
:= factor(
essay_score_summary[, prompt_type
prompt_type,labels = c(
"Basic",
"Interim",
"Advanced"
),levels = c(
"basic_scoring_prompt",
"interim_scoring_prompt",
"advanced_scoring_prompt"
)
)]
:= factor(
essay_score_summary[, essay_type
essay_type,labels = c(
"Weak",
"Medium",
"Strong"
),levels = c(
"weak_essay",
"medium_essay",
"strong_essay"
)
)]
setorder(essay_score_summary, prompt_type, essay_type)
|>
essay_score_summary as_flextable(
max_row = 45,
do_autofit = T,
show_coltype = F
|>
) set_header_labels(
values = c(
"Prompt",
"Essay",
"Temperature",
"Clarity",
"Communication Quality",
"Professional Tone",
"Relevance",
"Total Score"
)|>
) align_nottext_col(align = "center") |>
merge_at(i = 1:15, j = 1) |>
merge_at(i = 1:5, j = 2) |>
merge_at(i = 6:10, j = 2) |>
merge_at(i = 11:15, j = 2) |>
hline(i = 15, j = 1:8, border = fp_border_default()) |>
merge_at(i = 16:30, j = 1) |>
merge_at(i = 16:20, j = 2) |>
merge_at(i = 21:25, j = 2) |>
merge_at(i = 26:30, j = 2) |>
hline(i = 30, j = 1:8, border = fp_border_default()) |>
merge_at(i = 31:45, j = 1) |>
merge_at(i = 31:35, j = 2) |>
merge_at(i = 36:40, j = 2) |>
merge_at(i = 41:45, j = 2)
Prompt | Essay | Temperature | Clarity | Communication Quality | Professional Tone | Relevance | Total Score |
---|---|---|---|---|---|---|---|
Basic | Weak | 0.0 | 6.0 | 6.0 | 5.0 | 7.0 | 24.0 |
0.2 | 6.0 | 6.0 | 5.0 | 7.0 | 24.0 | ||
0.5 | 6.0 | 6.0 | 5.0 | 7.0 | 24.0 | ||
0.7 | 6.0 | 6.0 | 5.0 | 7.0 | 24.0 | ||
1.0 | 6.0 | 6.0 | 5.0 | 7.0 | 24.1 | ||
Medium | 0.0 | 9.0 | 9.0 | 9.0 | 10.0 | 37.0 | |
0.2 | 9.0 | 9.0 | 9.0 | 10.0 | 37.0 | ||
0.5 | 9.0 | 9.0 | 9.0 | 10.0 | 37.0 | ||
0.7 | 9.0 | 9.0 | 9.0 | 10.0 | 37.0 | ||
1.0 | 9.0 | 9.0 | 9.0 | 10.0 | 37.1 | ||
Strong | 0.0 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | |
0.2 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
0.5 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
0.7 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
1.0 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
Interim | Weak | 0.0 | 6.0 | 6.0 | 5.0 | 7.0 | 24.0 |
0.2 | 6.0 | 6.0 | 5.0 | 7.0 | 24.0 | ||
0.5 | 6.0 | 6.0 | 5.0 | 7.1 | 24.1 | ||
0.7 | 6.0 | 6.0 | 5.0 | 7.1 | 24.1 | ||
1.0 | 6.0 | 6.0 | 5.0 | 7.0 | 23.9 | ||
Medium | 0.0 | 8.0 | 8.0 | 8.0 | 9.0 | 33.0 | |
0.2 | 8.0 | 8.0 | 8.0 | 9.0 | 33.0 | ||
0.5 | 8.0 | 8.0 | 8.0 | 9.0 | 33.0 | ||
0.7 | 8.0 | 8.0 | 8.0 | 9.0 | 33.0 | ||
1.0 | 8.0 | 8.0 | 8.1 | 9.0 | 33.3 | ||
Strong | 0.0 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | |
0.2 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
0.5 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
0.7 | 10.0 | 10.0 | 10.0 | 10.0 | 39.9 | ||
1.0 | 9.9 | 9.9 | 10.0 | 10.0 | 39.8 | ||
Advanced | Weak | 0.0 | 5.0 | 5.0 | 5.0 | 6.0 | 21.0 |
0.2 | 5.0 | 5.0 | 5.0 | 6.0 | 21.0 | ||
0.5 | 5.0 | 5.0 | 5.0 | 6.0 | 21.0 | ||
0.7 | 5.0 | 5.0 | 5.0 | 6.0 | 21.0 | ||
1.0 | 5.0 | 5.0 | 5.0 | 6.0 | 21.0 | ||
Medium | 0.0 | 8.0 | 8.0 | 8.6 | 9.0 | 33.6 | |
0.2 | 8.0 | 8.0 | 8.4 | 9.0 | 33.4 | ||
0.5 | 8.0 | 8.1 | 8.4 | 9.0 | 33.5 | ||
0.7 | 8.0 | 8.0 | 8.6 | 9.0 | 33.6 | ||
1.0 | 8.0 | 8.1 | 8.5 | 9.0 | 33.6 | ||
Strong | 0.0 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | |
0.2 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
0.5 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
0.7 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
1.0 | 10.0 | 10.0 | 10.0 | 10.0 | 40.0 | ||
n: 45 |
Takeaway
This data reveals some important implications for AI scoring in recruitment:
The use of word counts within the prompt, based on current LLMs, is dubious. Given the way such models approach tokens, specifying a minimum number of words is problematic. There could be instances where a candidate’s response is penalised as the model could not determine the right word count.
It is clear that, based on the data, prompt engineering is critical. The results show that the prompt design does impact outcomes: the advanced prompt
Reliability concerns: The fact that identical essays can receive different scores based on temperature settings and prompt design suggests AI scoring isn’t as consistent as traditional methods. Even small variations (0.1-0.3 points) could affect candidate rankings when margins are tight.
Prompt engineering becomes critical: Your results show that how you design the scoring prompt significantly impacts outcomes. The Advanced prompt’s ability to better discriminate between medium-quality responses suggests that detailed rubrics improve assessment quality, but they also introduce more variability.
The “good enough” problem: Strong essays consistently max out scores regardless of conditions, while weak essays cluster at the bottom. This suggests AI scoring works well for obvious cases but struggles with nuanced distinctions - precisely where human judgment is most valuable.
Practical trade-offs: Organizations using AI scoring face a choice between consistency (Basic prompts) and discrimination (Advanced prompts). The Basic approach is more reliable but potentially less insightful, while Advanced prompts offer better assessment but with more variability.
Audit and calibration needs: Since temperature and prompt design affect outcomes, organizations would need to carefully test and standardize their AI scoring systems. What works for one type of role or question might not work for another.
Human oversight remains essential: The variability suggests AI scoring works best as a screening tool rather than a definitive assessment method. Human review becomes crucial for borderline cases where small score differences matter.