The Evaluations framework in Xcode 27 lets you generate synthetic test data and build robust evaluations for intelligence-powered and agentic Swift app features, enabling scalable dataset creation and quality tracking over time.
โข Synthetic data generation via makeSamples/SampleGenerator removes the bottleneck of hand-writing evaluation examples, scaling from a handful of seed samples to hundreds automatically.
โข Built-in validation closures let you enforce structural rules (tag counts, character limits, casing) on every generated sample, catching bad data before it pollutes your evaluation.
โข The new Xcode 27 Evaluations Report provides side-by-side comparison of evaluation runs so you can detect regressions or improvements as you iterate on prompts and models.
Seeds a small set of book reviews, uses SampleGenerator to synthesize 100 total samples, and validates each generated sample for review length, tag count, and tag casing before appending it to an expanded dataset.
import Evaluations
import FoundationModels
// MARK: โ Domain model
@Generable
struct BookSample: Sendable {
@Guide(description: "A written review of the book, at least 100 characters long.")
var review: String
@Guide(description: "3โ8 lowercase genre/mood tags, e.g. [\"gothic\", \"suspense\"]")
var tags: [String]
}
// MARK: โ Seed dataset
let seedSamples: [ModelSample<String, [String]>] = [
ModelSample(
prompt: "A sweeping love story set in 19th-century England.",
expectedOutput: ["romance", "classic", "social-commentary"]
),
ModelSample(
prompt: "A scientist creates life and faces the consequences.",
expectedOutput: ["gothic", "horror", "philosophy"]
),
]
// MARK: โ Generator configuration
let generator = SampleGenerator<String, [String]>(
prompt: """
Generate diverse book review samples covering a wide range of genres,
moods, and tones. Reviews must be at least 100 characters long and
vary in length. Each sample must include between 3 and 8 lowercase tags.
""",
dataset: seedSamples,
targetCount: 100,
sessionProvider: {
// Swap in PrivateCloudComputeLanguageModel for larger context windows.
let model = SystemLanguageModel.default
let session = LanguageModelSession(model: model)
return session
},
samplingStrategy: .random,
validator: { sample in
// Rule 1: review must be at least 100 characters
guard sample.prompt.count >= 100 else { return .invalid(reason: "Review too short") }
// Rule 2: between 3 and 8 tags
guard (3...8).contains(sample.expectedOutput.count) else { return .invalid(reason: "Tag count out of range") }
// Rule 3: all tags must be lowercase
guard sample.expectedOutput.allSatisfy({ $0 == $0.lowercased() }) else { return .invalid(reason: "Tags must be lowercase") }
return .valid
}
)
// MARK: โ Run generation
func buildExpandedDataset() async throws -> [ModelSample<String, [String]>] {
var expandedDataset = seedSamples
for try await sample in generator.run() {
expandedDataset.append(sample)
}
let invalidCount = generator.invalidSamples.count
print("Generation complete. Valid: \(expandedDataset.count), Invalid: \(invalidCount)")
return expandedDataset
}targetCount is the size of the full resulting dataset including your seed samples, not the number of new samples to add. The sessionProvider may be called more than once if the context window is exhausted mid-run, so instructions must be self-contained. The validator closure runs per-sample in isolation and cannot reason across the full dataset for aggregate properties like diversity.
On-device generation uses Apple Intelligence hardware; PrivateCloudCompute sessions require entitlements and network access.
More iOS 27 APIs land every week.
Get notified when new capabilities are published โ no noise, just signal.