πŸ€– AI Explained
Emerging area 5 min read

AI Procurement and Vendor Evaluation

Choosing an AI vendor on benchmark performance alone is one of the most reliable ways to end up with the wrong vendor. This module gives you a complete evaluation framework covering quality, pricing, data handling, SLAs, and exit planning.

Layer 1: Surface

When you procure AI capabilities, you are making a decision that will be very difficult to reverse. The vendor’s model will become embedded in your prompts, your retrieval system, your integrations, and your team’s mental model of how things work. Switching vendors 18 months in is not a simple API swap: it is a project.

The evaluation criteria that matter most are not the ones vendors lead with. Published benchmarks show how a model performs on standard academic tasks. Your use case is not an academic task. The first rule of AI procurement: test on your data before you commit.

Beyond model quality, there are five dimensions that determine whether a vendor relationship will work:

  1. Pricing model: How are you charged? Per token, per seat, enterprise contract? What happens to cost as your volume grows?
  2. Data handling: Does the vendor train on your data? Where is it stored? What data retention and deletion policies apply?
  3. SLA: What uptime, latency, and support tier are guaranteed? What are the financial consequences of SLA breaches?
  4. Model stability: How does the vendor manage model versioning? How much notice do they give before deprecating a model? Can you pin to a specific version?
  5. Vendor viability: How financially stable is the vendor? What is the concentration risk if they are your only AI provider?

Why it matters

Organisations that evaluate vendors primarily on benchmark performance and price sign contracts they later regret: either because production quality is lower than expected, data policies are problematic, or the relationship has no exit path.

Production Gotcha

Common Gotcha: Vendor benchmarks measure average-case performance on general tasks. Your use case is not average. Before signing an enterprise contract, run your own evaluation on a representative sample of your production queries: the performance gap between benchmark and production is routinely 15–30 percentage points.

The assumption: β€œIf this model is top-rated on the leaderboard, it will be the best for our use case.” The reality: leaderboards measure general tasks; your specific use case may favour a different model entirely.


Layer 2: Guided

The full vendor evaluation framework

Dimension 1: Model quality on your task

Do not rely on published benchmarks. Run your own evaluation:

  1. Collect 100–500 representative examples of your actual production queries
  2. Include your hardest cases, not just representative cases
  3. Define a clear scoring rubric: what does β€œgood” look like for your task?
  4. Run the evaluation on every vendor you are considering, under the same conditions
  5. Calculate quality scores per vendor, including failure distribution (where does each fail?)

Budget for this evaluation. It typically takes one to two weeks of engineering time and some inference cost. This is far less than the cost of six months into a contract with the wrong vendor.

Dimension 2: Pricing model

Pricing elementWhat to ask
Unit costPer-token, per-request, or per-seat? What is the all-in cost at your current volume and projected volume?
Volume discountsAt what volume does the price break? How does this align with your growth trajectory?
Burst pricingWhat happens during usage spikes? Are there rate limits that affect your application?
Contract minimumsWhat is the minimum commitment? What happens if you exceed it or fall short?
Price stabilityCan the vendor change pricing during the contract period? What notice is required?

Dimension 3: Data handling

This is the dimension most organisations under-scrutinise during procurement and most regret later.

Data handling checklist:
β”œβ”€β”€ Training: Does the vendor use your data to train or fine-tune models?
β”‚   └── Acceptable if: you consent, it's anonymised, you receive benefit
β”‚   └── Unacceptable if: without notice or consent; for general model training
β”‚
β”œβ”€β”€ Retention: How long does the vendor retain your prompts and outputs?
β”‚   └── Ask for: specific retention periods; opt-out for logging
β”‚   └── Red flag: indefinite retention without deletion capability
β”‚
β”œβ”€β”€ Residency: Where is your data processed and stored?
β”‚   └── Ask for: specific regions; commitment not to transfer outside agreed regions
β”‚   └── Red flag: no region guarantee; data processed in legally incompatible jurisdictions
β”‚
β”œβ”€β”€ DPA: Does the vendor offer a Data Processing Agreement?
β”‚   └── Require for: any personal data; any regulated data; any confidential business data
β”‚   └── Push back on: DPAs that exclude liability; broad purpose clauses
β”‚
└── Subprocessors: Who else has access to your data?
    └── Ask for: subprocessor list; notification of changes
    └── Red flag: no subprocessor disclosure

Dimension 4: SLA

SLA elementAcceptable minimumWhat to push for
Uptime99.5% monthly99.9% monthly with credits
Latencyp95 definedp99 defined; degradation alerts
SupportEmail with SLADedicated support with response SLA
Incident communicationStatus pageProactive notification to affected customers
SLA remedyService creditsService credits plus exit right on repeated breach

Dimension 5: Model stability

Model versions change. Without version pinning, your application can change behaviour overnight when the vendor updates their model.

Stability elementWhat to verify
Version pinningCan you pin to a specific model version? How long is that version guaranteed?
Deprecation noticeHow much advance notice is given before a model version is discontinued? (8–12 weeks is typical; push for 6 months for production-critical uses)
Rollback capabilityIf a new version causes quality regression, can you roll back?
ChangelogDoes the vendor publish model change logs so you know what changed?

Lock-in vectors and exit planning

Every vendor relationship creates lock-in. Identify yours and build portability in from the start:

Lock-in typeMitigation
Proprietary APIAbstract the vendor behind an internal interface layer; keep vendor-specific code isolated
Custom fine-tuned modelStore training data in portable format; document the training process independently
Proprietary data formatExport your data regularly in standard formats; test the export
Prompt optimised for one providerMaintain eval suites; test prompts against alternative providers quarterly

Exit planning question to answer before signing: if this vendor became unavailable tomorrow, how many weeks would it take to switch to an alternative? If the answer is β€œmore than a few months,” you have a concentration risk that needs mitigation.


Layer 3: Deep Dive

The benchmark vs production gap: why it happens

Published benchmarks use carefully curated datasets designed to test specific capabilities in a controlled way. Production AI use cases differ in several dimensions that benchmarks do not capture:

  • Input noise: Production inputs contain typos, unconventional formatting, mixed languages, incomplete information, and ambiguity. Benchmark inputs are clean.
  • Distribution shift: Your use case has a specific input distribution that may not match the benchmark distribution. A model that excels at academic reasoning may underperform on your technical support queries.
  • Failure mode relevance: Benchmarks measure average performance; your use case may care disproportionately about specific failure modes that are rare in the benchmark distribution but common in yours.
  • Latency under load: Benchmarks measure quality, not latency. Your production requirement is both.

The performance gap between benchmark and production is consistently larger than organisations expect. Running your own evaluation before committing is not optional diligence: it is the only way to know what you are actually buying.

Data processing agreement: what to push back on

A well-structured DPA should give you:

  • Clear description of processing purposes (narrow, specific)
  • Data residency commitments with named regions
  • Retention periods with automated deletion commitments
  • Subprocessor list with notification of changes
  • Audit rights (at least the right to receive audit reports)
  • Liability clause that is not entirely excluded

Push back on:

  • Broad purpose clauses that allow the vendor to use your data for any β€œproduct improvement”
  • Liability exclusions that eliminate all vendor accountability for data breaches
  • No right to terminate for material data policy changes

Financial stability and concentration risk

Enterprise AI vendors range from well-capitalised public companies to pre-revenue startups. For production-critical use cases, vendor financial stability matters:

  • Startups can fail, pivot, or be acquired, taking your integration with them
  • Even well-funded vendors can discontinue products or change pricing dramatically
  • Concentration in a single vendor means a single point of failure for your AI-dependent features

Practical mitigations: diversify across vendors for critical use cases; maintain a list of alternatives that could substitute at reasonable effort; structure contracts with termination rights if the vendor is acquired or materially changes pricing.

Further reading

✏ Suggest an edit on GitHub

AI Procurement and Vendor Evaluation: Check your understanding

Q1

Before signing an enterprise AI contract, you test the vendor's model on 20 example queries from your use case. The model performs well on all 20. Is this a sufficient vendor evaluation?

Q2

A vendor's data processing agreement states that they may use customer data for 'product improvement and model development.' Your application processes customer support tickets that may contain personal information. Should you sign this DPA as-is?

Q3

You evaluate three AI vendors on published benchmarks and choose the one with the highest score. Six months into production, your team's quality metrics are disappointing. What most likely explains the gap?

Q4

Your organisation's AI application is directly dependent on a vendor's proprietary API, with no abstraction layer. The vendor announces a significant API change in 6 weeks. What is the cost of not having built an abstraction layer?

Q5

A vendor guarantees 99.5% monthly uptime in their SLA. Your AI feature is used in a customer-facing workflow. What should you verify about this guarantee beyond the percentage?