The Evaluations framework in iOS 27 introduces a hill-climbing workflow that lets developers iteratively improve AI-powered features by measuring prompt quality over time. A key part of this is computing Cohen's kappa to quantify alignment between a human expert rater and the model judge, catching score drift before it undermines your evaluation pipeline.
⢠Drift between human and model judge ratings silently undermines AI feature quality ā Cohen's kappa gives you a principled, statistically robust alignment score to detect and fix it
⢠The hill-climbing loop (develop ā run ā analyze) gives teams a repeatable scientific process for improving prompts and other AI feature parameters with measurable confidence
⢠Xcode's new Evaluations report surfaces per-sample scores and aggregate charts directly in the IDE, cutting the feedback loop from days to minutes
Demonstrates how to compute Cohen's kappa coefficient between a human expert's usefulness ratings and those produced by a ModelJudgeEvaluator, surfacing drift in the Xcode Evaluations report.
import Evaluations
import FoundationModels
import Testing
// MARK: - Score dimensions shared with the production evaluation
struct TagScoreDimensions: ScoreDimensions {
@ScoreDimension(range: 1...4, description: "How well tags represent plot, theme, and book info")
var relevance: Score
@ScoreDimension(range: 1...4, description: "How useful the tags are as search terms")
var usefulness: Score
}
// MARK: - Dataset row: pre-generated tags + human expert ratings
struct TagAlignmentEntry: EvaluationInput {
let bookSummary: String
let generatedTags: [String]
let expertRelevance: Int // human ground-truth rating 1-4
let expertUsefulness: Int // human ground-truth rating 1-4
}
// MARK: - Subject simply surfaces the already-generated tags
struct TagSubject: EvaluationSubject {
typealias Input = TagAlignmentEntry
typealias Output = String
func run(input: TagAlignmentEntry) async throws -> String {
// Tags were produced in a prior evaluation run; return them as-is
input.generatedTags.joined(separator: ", ")
}
}
// MARK: - Cohen's kappa aggregation
struct KappaAggregation: EvaluationAggregation {
typealias Dimensions = TagScoreDimensions
func aggregate(scores: [TagScoreDimensions],
inputs: [TagAlignmentEntry]) -> [String: Double] {
// Pair judge scores with expert scores
let judgeUsefulness = scores.map(\.usefulness.value)
let expertUsefulness = inputs.map { Double($0.expertUsefulness) }
let kappa = cohensKappa(rater1: judgeUsefulness, rater2: expertUsefulness, levels: 4)
let meanJudge = judgeUsefulness.reduce(0, +) / Double(judgeUsefulness.count)
let meanExpert = expertUsefulness.reduce(0, +) / Double(expertUsefulness.count)
return [
"usefulness_kappa": kappa,
"judge_mean_usefulness": meanJudge,
"expert_mean_usefulness": meanExpert
]
}
// Standard Cohen's kappa for ordinal data
private func cohensKappa(rater1: [Double], rater2: [Double], levels: Int) -> Double {
guard rater1.count == rater2.count, !rater1.isEmpty else { return 0 }
let n = Double(rater1.count)
// Observed agreement (po)
let agreed = zip(rater1, rater2).filter { Int($0.0) == Int($0.1) }.count
let po = Double(agreed) / n
// Expected agreement (pe) weighted by marginal frequencies
var pe = 0.0
for level in 1...levels {
let p1 = rater1.filter { Int($0) == level }.count
let p2 = rater2.filter { Int($0) == level }.count
pe += (Double(p1) / n) * (Double(p2) / n)
}
guard (1.0 - pe) != 0 else { return 1.0 }
return (po - pe) / (1.0 - pe)
}
}
// MARK: - Evaluation definition
struct TagAlignmentEvaluation: Evaluation {
typealias Input = TagAlignmentEntry
typealias Subject = TagSubject
typealias Evaluators = ModelJudgeEvaluator<TagScoreDimensions>
typealias Aggregation = KappaAggregation
var dataset: [TagAlignmentEntry] = [
TagAlignmentEntry(
bookSummary: "Treasure Island: loyalty, betrayal, and a treacherous sea voyage.",
generatedTags: ["adventure", "pirates", "treasure"],
expertRelevance: 4, expertUsefulness: 2),
TagAlignmentEntry(
bookSummary: "Little Women: four sisters navigate love and ambition in Civil War America.",
generatedTags: ["poignant", "quiet-steadiness", "family"],
expertRelevance: 4, expertUsefulness: 2)
]
var subject: TagSubject { TagSubject() }
var evaluators: ModelJudgeEvaluator<TagScoreDimensions> {
ModelJudgeEvaluator(
prompt: "Rate the tags for a book given its summary.",
dimensions: TagScoreDimensions()
)
}
var aggregation: KappaAggregation { KappaAggregation() }
}
// MARK: - Swift Testing test with kappa expectation
@Suite struct AlignmentTests {
@Test func judgeAlignmentMeetsThreshold() async throws {
let evaluation = TagAlignmentEvaluation()
let results = try await evaluation.run()
let kappa = try #require(results.aggregated["usefulness_kappa"])
// 0.6 = "substantial agreement" per Cohen (1960)
#expect(kappa >= 0.6, "Judge drifted: kappa=\(kappa). Refine the judge prompt.")
}
}Cohen's kappa requires both the human expert ratings and the model judge ratings to evaluate the exact same pre-generated outputs ā you must serialize evaluation outputs from a prior run and load them as a dataset rather than re-running inference. An alignment score below 0.6 is considered insufficient by convention; targeting ā„0.6 before shipping is the recommended threshold. The Evaluations framework runs inside Swift Testing / XCTest targets only, not in production app targets.
Apple Intelligence device required for model judge evaluator; Cohen's kappa aggregation itself has no hardware restriction
More iOS 27 APIs land every week.
Get notified when new capabilities are published ā no noise, just signal.