Build a Chatbot#

Now you’ve made your first request, you’re ready to build a simple chatbot in Python, on top of our unified endpoints!

The Agent#

Under the hood, chatbots are very simple to implement. All LLM endpoints are stateless, and therefore the entire conversation history is repeatedly fed as input to the model. All that is required of the local agent is to store this history, and correctly pass it to the model.

We define a simple chatbot class below, with the only public function being run. The example assumes that your API key is stored in the environment variable UNIFY_KEY.

import os
import sys
import openai
import requests

class Agent:

    def __init__(self, model: str):
        self._message_history = []
        self._model = model
        self._paused = False
            self._key = os.environ["UNIFY_KEY"]
        except KeyError:
            raise Exception("Please set your UNIFY_KEY environment variable")
        self._base_url = ""
        self._headers = {
            "accept": "application/json",
            "Authorization": "Bearer " + self._key,
        self._oai_client = openai.OpenAI(

    def _get_credits(self):
        response = requests.get(self._base_url + "get_credits", headers=self._headers)
        return eval(response.content.decode())["credits"]

    def _process_input(self, inp: str, show_credits: bool, show_provider: bool):
        response =
        words = ''
        model = ''
        for tok in response:
            delta = tok.choices[0].delta
            model = tok.model
            if not delta:
                    'role': 'assistant',
                    'content': words
            elif delta.content:
                words += delta.content
                yield delta.content
        if show_credits:
            sys.stdout.write("\n(spent {:.6f} credits)".format(tok.usage['cost']))
        if show_provider:
            sys.stdout.write("\n(provider: {})".format(model))

    def _update_message_history(self, inp):
            'role': 'user',
            'content': inp

    def model(self):
        return self._model

    def model(self, value):
        error_message = "model string must use the OpenAI format: <uploaded_by>/<model_name>@<provider_name>"
        assert type(value) is str, error_message
        model_name, provider_name = value.split('/')[-1].split("@")
        assert len(model_name) and len(provider_name), error_message
        self._model = value

    # Public Methods

    def clear_chat_history(self):

    def run(self, show_credits: bool = False, show_provider: bool = False):
        if not self._paused:
            sys.stdout.write("Let's have a chat. (Enter `pause` to pause and `quit` to exit)\n")
            sys.stdout.write("Welcome back! (Remember, enter `pause` to pause and `quit` to exit)\n")
        self._paused = False
        while True:
            sys.stdout.write('> ')
            inp = input()
            if inp == 'quit':
            elif inp == 'pause':
                self._paused = True
            for word in self._process_input(inp, show_credits, show_provider):

Let’s Chat!#

Provided our environment variable UNIFY_KEY is set correctly, we can now simply instantiate this agent and chat with it, using the format <model_name>@<provider_name>, like so:

agent = Agent("llama-2-70b-chat@anyscale")

This will start an interactive session, where you can converse with the model:

Let's have a chat. (Enter `pause` to pause and `quit` to exit)
> Hi, nice to meet you. My name is Foo Barrymore, and I am 25 years old.
 Nice to meet you too, Foo! I'm just an AI, I don't have a personal name, but I'm here to help you with any questions or concerns you might have. How has your day been so far?
> How old am I?
 You are 25 years old, as you mentioned in your introduction.
> Your memory is astounding
 Thank you! I'm glad to hear that.

You can also see how many credits your prompt used. This option is set in the constructor, but it can be overwritten during the run command, as follows:

Each response from the chatbot will then be appended with the credits spent:

Let's have a chat. (Enter `pause` to pause and `quit` to exit)
> What is the capital of Spain?
 The capital of Spain is Madrid.
(spent 0.000014 credits)

Finally, you can switch providers half-way through the conversation easily. This can be useful to handle prompt of varying complexity.

For examplen we can start with a small model for answering simple questions, such as recalling facts, and then move to a larger model for a more complex task, such as creative writing.

agent = Agent("llama-2-7b-chat@anyscale")
Let's have a chat. (Enter `pause` to pause and `quit` to exit)
> What is the capital of Portugal?
 The capital of Portugal is Lisbon.
(spent 0.000002 credits)
> My name is José Mourinho.
 Okay. Nice to meet you José.
(spent 0.000002 credits)
> pause
agent.model = "gpt-4-turbo@openai"
Welcome back! (Remember, enter `pause` to pause and `quit` to exit)
> Please write me a poem about my life in Lisbon, using my name in the poem.
 Sure, here's a short poem about your life in Lisbon, using your name:

José Mourinho, a man of great renown,
In Lisbon, a city of ancient crown.
You walked its streets, with purpose and drive,
A man on a mission, with a burning fire inside.

You breathed in the scent of the ocean's spray,
And felt the warmth of the sun on your face each day.
You marveled at the beauty of the city's old town,
And felt at home, in this place you'd found.

You savored the flavors of the local cuisine,
And quenched your thirst with a glass of fine wine.
You listened to the music of the street performers,
And felt the rhythm of the city's vibrant pulse.

José Mourinho, a man of great ambition,
In Lisbon, a city of endless tradition.
You found your place, in this ancient land,
And made your mark, with a helping hand.

For in Lisbon, you found your home,
A place where your heart would forever roam.
José Mourinho, a man of great fame,
In Lisbon, a city that would forever bear your name.
(spent 0.000260 credits)

Switching between providers mid-conversation makes it much easier to maximize quality and runtime performance based on the latest metrics, and also save on costs!

In fact, you can automatically optimize for a metric of your choice with our dynamic routing modes.

For example, you can optimize for speed as follows:

agent.model = "llama-2-70b-chat@highest-tks-per-sec"

The flag show_provider ensures that the specific provider is printed at the end of each response. For example, sometimes anyscale might be the fastest, and at other times it might be together-ai or fireworks-ai. This flag enables you to keep track of what provider is being used under the hood.