The Evaluations framework is a new Swift testing system designed to measure the quality and reliability of intelligent features powered by generative AI models. It provides protocols, types, and Xcode test report integration to quantify how often your app's AI-driven features behave correctly across diverse input datasets.
⢠Traditional unit tests break with generative AI because the same input can produce different outputs ā Evaluations gives you statistical confidence instead of brittle assertions
⢠Built-in Swift Testing integration via the .evaluates trait and evaluation results bundles lets you enforce quality thresholds (e.g. "correct behavior ā„80% of the time") as CI-friendly test assertions
⢠Xcode's new Evaluations test report surfaces per-sample pass/fail details, model prompts, and full responses so you can diagnose regressions without manual inspection
Demonstrates how to build a basic Evaluation that measures whether an AI-powered book-tagging service generates an appropriate number of tags, then asserts a quality threshold using Swift Testing integration.
import Evaluations
import FoundationModels
import Testing
// MARK: - Model types
@Generable
struct BookTags {
@Guide("Between 3 and 8 short, single-word tags describing the book", count: 3...8)
var tags: [String]
}
// MARK: - The feature under evaluation
struct BookTaggingService {
let session = LanguageModelSession()
func generateTags(for review: String) async throws -> BookTags {
let prompt = "Generate tags for this book review: \(review)"
return try await session.respond(to: prompt, generating: BookTags.self).content
}
}
// MARK: - Evaluation definition
struct BookTaggingEvaluation: Evaluation {
typealias Input = String
typealias Output = BookTags
// Step 1: Define the code under measurement
func subject(from input: String) async throws -> BookTags {
try await BookTaggingService().generateTags(for: input)
}
// Step 2: Define input samples with expected outputs
var dataset: [ModelSample<String, BookTags>] {
[
ModelSample(
input: "A sweeping Regency-era romance about manners, marriage, and wit.",
expectedOutput: BookTags(tags: ["romance", "regency", "classic", "witty", "social"])
),
ModelSample(
input: "A gothic horror epistolary novel about a Transylvanian count who preys on the living.",
expectedOutput: BookTags(tags: ["horror", "gothic", "vampire", "classic", "suspense"])
),
ModelSample(
input: "A mother reads a swashbuckling treasure hunt adventure to her young son every night.",
expectedOutput: BookTags(tags: ["adventure", "pirates", "treasure", "classic", "family"])
)
]
}
// Step 3: Define per-sample metrics
var metrics: [any Evaluator<String, BookTags>] {
[
Metric(name: "TagCount") { sample, output in
let count = output.tags.count
return (3...8).contains(count) ? .passing : .failing
},
Metric(name: "NoMultiWordTags") { sample, output in
let hasMultiWord = output.tags.contains { $0.contains(" ") }
return hasMultiWord ? .failing : .passing
}
]
}
// Step 4: Summarise across all samples
func aggregateMetrics(using results: EvaluationResults<String, BookTags>) -> [AggregateMetric] {
[
AggregateMetric(name: "TagCountPassRate", value: results.averageScore(forMetric: "TagCount")),
AggregateMetric(name: "NoMultiWordPassRate", value: results.averageScore(forMetric: "NoMultiWordTags"))
]
}
}
// MARK: - Swift Testing integration
@Suite(.serialized)
struct BookTaggingEvaluationTests {
let evaluation = BookTaggingEvaluation()
let notes: [String: String] = ["model": "on-device", "guideVersion": "v2"]
@Test(.evaluates(BookTaggingEvaluation(), notes: ["model": "on-device"]))
func tagCountMeetsQualityBar(_ bundle: EvaluationResultsBundle) async throws {
let tagCountRate = bundle.aggregateValue(forMetric: "TagCountPassRate")
#expect(tagCountRate >= 0.8, "TagCount pass rate \(tagCountRate) is below the 80% threshold")
let noMultiWordRate = bundle.aggregateValue(forMetric: "NoMultiWordPassRate")
#expect(noMultiWordRate >= 0.9, "Multi-word tag rate \(noMultiWordRate) is below the 90% threshold")
}
}⢠Evaluations is a new framework ā import Evaluations must be added explicitly alongside Swift Testing ⢠The .evaluates test trait is only available inside Swift Testing @Test/@Suite contexts, not XCTest ⢠SampleGenerator uses a language model to synthesize samples, so results are non-deterministic and should be reviewed before committing to your dataset ⢠Optimization targets (e.g. 80%) are project-specific ā there is no universal threshold; set them based on your feature's risk tolerance
Apple Intelligence-capable device or simulator required when judging with on-device language models; dataset synthesis via SampleGenerator also requires Apple Intelligence hardware
More iOS 27 APIs land every week.
Get notified when new capabilities are published ā no noise, just signal.