I’m happy to share that you can now evaluate, compare, and select the best foundation models (FMs) for your use case in Amazon Bedrock. Model Evaluation on Amazon Bedrock is available today in preview.
Amazon Bedrock offers a choice of automatic evaluation and human evaluation. You can use automatic evaluation with predefined metrics such as accuracy, robustness, and toxicity. For subjective or custom metrics, such as friendliness, style, and alignment to brand voice, you can set up human evaluation workflows with just a few clicks.
Model evaluations are critical at all stages of development. As a developer, you now have evaluation tools available for building generative artificial intelligence (AI) applications. You can start by experimenting with different models in the playground environment. To iterate faster, add automatic evaluations of the models. Then, when you prepare for an initial launch or limited release, you can incorporate human reviews to help ensure quality.
Let me give you a quick tour of Model Evaluation on Amazon Bedrock.
Automatic model evaluation
With automatic model evaluation, you can bring your own data or use built-in, curated datasets and pre-defined metrics for specific tasks such as content summarization, question and answering, text classification, and text generation. This takes away the heavy lifting of designing and running your own model evaluation benchmarks.
To get started, navigate to the Amazon Bedrock console, then select Model evaluation under Assessment & deployment in the left menu. Create a new model evaluation and choose Automatic.
Next, follow the setup dialog to choose the FM you want to evaluate and the type of task, for example, text summarization. Select the evaluation metrics and specify a dataset—either built-in or your own.
If you bring your own dataset, make sure it’s in JSON Lines format, and each line contains all of the key-value pairs that you want to evaluate your model with for the model dimension that you want to evaluate. For example, if you want to evaluate the model on a question-answer task, you would format your data as follows (with category being optional):
{“referenceResponse”:”Cantal”,”category”:”Capitals”,”prompt”:”Aurillac is the capital of”} {“referenceResponse”:”Bamiyan Province”,”category”:”Capitals”,”prompt”:”Bamiyan city is the capital of”} {“referenceResponse”:”Abkhazia”,”category”:”Capitals”,”prompt”:”Sokhumi is the capital of”} …
Then, create and run the evaluation job to understand the model’s task-specific performance. Once the evaluation job is complete, you can review the results in the model evaluation report.
Human model evaluation
For human evaluation, you can have Amazon Bedrock set up human review workflows with a few clicks. You can bring your own datasets and define custom evaluation metrics, such as relevance, style, or alignment to brand voice. You also have the choice to either leverage your own internal teams as reviewers or engage an AWS managed team. This takes away the tedious effort of building and operating human evaluation workflows.
To get started, create a new model evaluation and select Human: Bring your own team or Human: AWS managed team.
If you choose an AWS managed team for human evaluation, describe your model evaluation needs, including task type, expertise of the work team, and the approximate number of prompts, along with your contact information. In the next step, an AWS expert will reach out to discuss your model evaluation project requirements in more detail. Upon review, the team will share a custom quote and project timeline.
If you choose to bring your own team, follow the setup dialog to choose the FMs you want to evaluate and the type of task, for example, text summarization. Then, select the evaluation metrics, upload your test dataset, and set up the work team.
For human evaluation, you would format the example data shown before again in JSON Lines format like this (with category and referenceResponse being optional):
{“prompt”:”Aurillac is the capital of”,”referenceResponse”:”Cantal”,”category”:”Capitals”} {“prompt”:”Bamiyan city is the capital of”,”referenceResponse”:”Bamiyan Province”,”category”:”Capitals”} {“prompt”:”Senftenberg is the capital of”,”referenceResponse”:”Oberspreewald-Lausitz”,”category”:”Capitals”}
Once the human evaluation is completed, Amazon Bedrock generates an evaluation report with the model’s performance against your selected metrics.
Things to know
Here are a couple of important things to know:
Model support – During preview, you can evaluate and compare text-based large language models (LLMs) available on Amazon Bedrock. During preview, you can select one model for each automatic evaluation job and up to two models for each human evaluation job using your own team. For human evaluation using an AWS managed team, you can specify custom project requirements.
Pricing – During preview, AWS only charges for the model inference needed to perform the evaluation (processed input and output tokens for on-demand pricing). There will be no separate charges for human evaluation or automatic evaluation. Amazon Bedrock Pricing has all the details.
Join the preview
Automatic evaluation and human evaluation using your own work team are available today in public preview in AWS Regions US East (N. Virginia) and US West (Oregon). Human evaluation using an AWS managed team is available in public preview in AWS Region US East (N. Virginia). To learn more, visit the Amazon Bedrock Developer Experience web page and check out the User Guide.
Get started
Log in to the AWS Management Console and start exploring model evaluation in Amazon Bedrock today!
— Antje