How to build a Llama 2 chatbot

Generative AI has been widely adopted, and the development of new, larger, and improved LLMs is advancing rapidly, making it an exciting time for developers.

You may have heard of the recent release of Llama 2, an open source large language model (LLM) by Meta. This means that you can build on, modify, deploy, and use a local copy of the model, or host it on cloud servers (e.g., Replicate).

While it’s free to download and use, it’s worth noting that self-hosting the Llama 2 model requires a powerful computer with high-end GPUs to perform computations in a timely manner. An alternative is to host the models on a cloud platform like Replicate and use the LLM via API calls. In particular, the three Llama 2 models (llama-7b-v2-chat, llama-13b-v2-chat, and llama-70b-v2-chat) are hosted on Replicate.

In this post, we’ll build a Llama 2 chatbot in Python using Streamlit for the frontend, while the LLM backend is handled through API calls to the Llama 2 model hosted on Replicate. You’ll learn how to:

  1. Get a Replicate API token
  2. Set up the coding environment
  3. Build the app
  4. Set the API token
  5. Deploy the app

What is Llama 2?

Meta released the second version of their open-source Llama language model on July 18, 2023. They’re democratizing access to this model by making it free to the community for both research and commercial use. They also prioritize the transparent and responsible use of AI, as evidenced by their Responsible Use Guide.

Here are the five key features of Llama 2:

  1. Llama 2 outperforms other open-source LLMs in benchmarks for reasoning, coding proficiency, and knowledge tests.
  2. The model was trained on almost twice the data of version 1, totaling 2 trillion tokens. Additionally, the training included over 1 million new human annotations and fine-tuning for chat completions.
  3. The model comes in three sizes, each trained with 7, 13, and 70 billion parameters.
  4. Llama 2 supports longer context lengths, up to 4096 tokens.
  5. Version 2 has a more permissive license than version 1, allowing for commercial use.

App overview

Here is a high-level overview of the Llama2 chatbot app:

  1. The user provides two inputs: (1) a Replicate API token (if requested) and (2) a prompt input (i.e. ask a question).
  2. An API call is made to the Replicate server, where the prompt input is submitted and the resulting LLM-generated response is obtained and displayed in the app.

Let's take a look at the app in action:

  1. Go to https://llama2.streamlit.app/
  2. Enter your Replicate API token if prompted by the app.
  3. Enter your message prompt in the chat box, as shown in the screencast below.

1. Get a Replicate API token

Getting your Replicate API token is a simple 3-step process:

  1. Go to https://replicate.com/signin/.
  2. Sign in with your GitHub account.
  3. Proceed to the API tokens page and copy your API token.

2. Set up the coding environment

Local development

To set up a local coding environment, enter the following command into a command line prompt:

pip install streamlit replicate

🦙

NOTE: Make sure to have Python version 3.8 or higher pre-installed.

Cloud development

To set up a cloud environment, deploy using the Streamlit Community Cloud with the help of the Streamlit app template (read more here).

Add a requirements.txt file to your GitHub repo and include the following prerequisite libraries:

streamlit
replicate

3. Build the app

The Llama 2 chatbot app uses a total of 68 lines of code to build:

import streamlit as st
import replicate
import os
# App title
st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")
# Replicate Credentials
with st.sidebar:
    st.title('🦙💬 Llama 2 Chatbot')
    if 'REPLICATE_API_TOKEN' in st.secrets:
        st.success('API key already provided!', icon='✅')
        replicate_api = st.secrets['REPLICATE_API_TOKEN']
    else:
        replicate_api = st.text_input('Enter Replicate API token:', type='password')
        if not replicate_api.startswith('r8_'):
            st.warning('Please enter your credentials!', icon='⚠️')
        else:
            st.success('Proceed to entering your prompt message!', icon='👉')
    st.markdown('📖 Learn how to build this app in this [blog](#link-to-blog)!')
os.environ['REPLICATE_API_TOKEN'] = replicate_api
# Store LLM generated responses
if "messages" not in st.session_state.keys():
    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
# Display or clear chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])
def clear_chat_history():
    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
st.sidebar.button('Clear Chat History', on_click=clear_chat_history)
# Function for generating LLaMA2 response
# Refactored from <https://github.com/a16z-infra/llama2-chatbot>
def generate_llama2_response(prompt_input):
    string_dialogue = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'."
    for dict_message in st.session_state.messages:
        if dict_message["role"] == "user":
            string_dialogue += "User: " + dict_message["content"] + "\\n\\n"
        else:
            string_dialogue += "Assistant: " + dict_message["content"] + "\\n\\n"
    output = replicate.run('a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5', 
                           input={"prompt": f"{string_dialogue} {prompt_input} Assistant: ",
                                  "temperature":0.1, "top_p":0.9, "max_length":512, "repetition_penalty":1})
    return output
# User-provided prompt
if prompt := st.chat_input(disabled=not replicate_api):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.write(prompt)
# Generate a new response if last message is not from assistant
if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            response = generate_llama2_response(prompt)
            placeholder = st.empty()
            full_response = ''
            for item in response:
                full_response += item
                placeholder.markdown(full_response)
            placeholder.markdown(full_response)
    message = {"role": "assistant", "content": full_response}
    st.session_state.messages.append(message)

Import necessary libraries

First, import the necessary libraries:

  • streamlit - a low-code web framework used for creating the web frontend.
  • replicate - an ML model hosting platform that allows interfacing with the model via an API call.
  • os - the operating system module to load the API key into the environment variable.
import streamlit as st
import replicate
import os

Define the app title

The title of the app displayed on the browser can be specified using the page_title parameter, which is defined in the st.set_page_config() method:

# App title
st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")

Define the web app frontend for accepting the API token

When designing the chatbot app, divide the app elements by placing the app title and text input box for accepting the Replicate API token in the sidebar and the chat input text in the main panel. To do this, place all subsequent statements under with st.sidebar:, followed by the following steps:

1. Define the app title using the st.title() method.

2. Use if-else statements to conditionally display either:

  • A success message in a green box that reads API key already provided! for the if statement.
  • A warning message in a yellow box along with a text input box asking for the API token, as none were detected in the Secrets, for the else statement.

Use nested if-else statement to detect whether the API key was entered into the text box, and if so, display a success message:

# Replicate Credentials
with st.sidebar:
    st.title('🦙💬 Llama 2 Chatbot')
    if 'REPLICATE_API_TOKEN' in st.secrets:
        st.success('API key already provided!', icon='✅')
        replicate_api = st.secrets['REPLICATE_API_TOKEN']
    else:
        replicate_api = st.text_input('Enter Replicate API token:', type='password')
        if not replicate_api.startswith('r8_'):
            st.warning('Please enter your credentials!', icon='⚠️')
        else:
            st.success('Proceed to entering your prompt message!', icon='👉')
    st.markdown('📖 Learn how to build this app in this [blog](#link-to-blog)!')
os.environ['REPLICATE_API_TOKEN'] = replicate_api

Store, display, and clear chat messages

  1. The first code block creates an initial session state to store the LLM generated response as part of the chat message history.
  2. The next code block displays messages (via st.chat_message()) from the chat history by iterating through the messages variable in the session state.
  3. The last code block creates a Clear Chat History button in the sidebar, allowing users to clear the chat history by leveraging the callback function defined on the preceding line.
# Store LLM generated responses
if "messages" not in st.session_state.keys():
    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
# Display or clear chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])
def clear_chat_history():
    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
st.sidebar.button('Clear Chat History', on_click=clear_chat_history)

Create the LLM response generation function

Next, create the generate_llama2_response() custom function to generate the LLM’s response. It takes a user prompt as input, builds a dialog string based on the existing chat history, and calls the model using the replicate.run() function.

The model returns a generated response:

# Function for generating LLaMA2 response
# Refactored from <https://github.com/a16z-infra/llama2-chatbot>
def generate_llama2_response(prompt_input):
    string_dialogue = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'."
    for dict_message in st.session_state.messages:
        if dict_message["role"] == "user":
            string_dialogue += "User: " + dict_message["content"] + "\\n\\n"
        else:
            string_dialogue += "Assistant: " + dict_message["content"] + "\\n\\n"
    output = replicate.run('a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5', 
                           input={"prompt": f"{string_dialogue} {prompt_input} Assistant: ",
                                  "temperature":0.1, "top_p":0.9, "max_length":512, "repetition_penalty":1})
    return output

Accept prompt input

The chat input box is displayed, allowing the user to enter a prompt. Any prompt entered by the user is added to the session state messages:

# User-provided prompt
if prompt := st.chat_input(disabled=not replicate_api):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.write(prompt)

Generate a new LLM response

If the last message wasn’t from the assistant, the assistant will generate a new response. While it’s formulating a response, a spinner widget will be displayed. Finally, the assistant's response will be displayed in the chat and added to the session state messages:

# Generate a new response if last message is not from assistant
if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            response = generate_llama2_response(prompt)
            placeholder = st.empty()
            full_response = ''
            for item in response:
                full_response += item
                placeholder.markdown(full_response)
            placeholder.markdown(full_response)
    message = {"role": "assistant", "content": full_response}
    st.session_state.messages.append(message)

And that’s a wrap for the standard Llama 2 chatbot app. Next up, the Pro version...

Building the Pro version

Let's explore how we can add input widgets to the app, allowing users to experiment with different models and parameters to customize the chatbot for their specific use case.

To try the Pro version, go here or play with the app below:

Here is the code:

import streamlit as st
import replicate
import os
# App title
st.set_page_config(page_title="🦙💬 Llama 2 Chatbot")
# Replicate Credentials
with st.sidebar:
    st.title('🦙💬 Llama 2 Chatbot')
    if 'REPLICATE_API_TOKEN' in st.secrets:
        st.success('API key already provided!', icon='✅')
        replicate_api = st.secrets['REPLICATE_API_TOKEN']
    else:
        replicate_api = st.text_input('Enter Replicate API token:', type='password')
        if not replicate_api.startswith('r8_'):
            st.warning('Please enter your credentials!', icon='⚠️')
        else:
            st.success('Proceed to entering your prompt message!', icon='👉')
    # Refactored from <https://github.com/a16z-infra/llama2-chatbot>
    st.subheader('Models and parameters')
    selected_model = st.sidebar.selectbox('Choose a Llama2 model', ['Llama2-7B', 'Llama2-13B', 'Llama2-70B'], key='selected_model')
    if selected_model == 'Llama2-7B':
        llm = 'a16z-infra/llama7b-v2-chat:4f0a4744c7295c024a1de15e1a63c880d3da035fa1f49bfd344fe076074c8eea'
    elif selected_model == 'Llama2-13B':
        llm = 'a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5'
    else:
        llm = 'replicate/llama70b-v2-chat:e951f18578850b652510200860fc4ea62b3b16fac280f83ff32282f87bbd2e48'
    
    temperature = st.sidebar.slider('temperature', min_value=0.01, max_value=5.0, value=0.1, step=0.01)
    top_p = st.sidebar.slider('top_p', min_value=0.01, max_value=1.0, value=0.9, step=0.01)
    max_length = st.sidebar.slider('max_length', min_value=64, max_value=4096, value=512, step=8)
    
    st.markdown('📖 Learn how to build this app in this [blog](#link-to-blog)!')
os.environ['REPLICATE_API_TOKEN'] = replicate_api
# Store LLM generated responses
if "messages" not in st.session_state.keys():
    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
# Display or clear chat messages
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.write(message["content"])
def clear_chat_history():
    st.session_state.messages = [{"role": "assistant", "content": "How may I assist you today?"}]
st.sidebar.button('Clear Chat History', on_click=clear_chat_history)
# Function for generating LLaMA2 response
def generate_llama2_response(prompt_input):
    string_dialogue = "You are a helpful assistant. You do not respond as 'User' or pretend to be 'User'. You only respond once as 'Assistant'."
    for dict_message in st.session_state.messages:
        if dict_message["role"] == "user":
            string_dialogue += "User: " + dict_message["content"] + "\\n\\n"
        else:
            string_dialogue += "Assistant: " + dict_message["content"] + "\\n\\n"
    output = replicate.run(llm, 
                           input={"prompt": f"{string_dialogue} {prompt_input} Assistant: ",
                                  "temperature":temperature, "top_p":top_p, "max_length":max_length, "repetition_penalty":1})
    return output
# User-provided prompt
if prompt := st.chat_input(disabled=not replicate_api):
    st.session_state.messages.append({"role": "user", "content": prompt})
    with st.chat_message("user"):
        st.write(prompt)
# Generate a new response if last message is not from assistant
if st.session_state.messages[-1]["role"] != "assistant":
    with st.chat_message("assistant"):
        with st.spinner("Thinking..."):
            response = generate_llama2_response(prompt)
            placeholder = st.empty()
            full_response = ''
            for item in response:
                full_response += item
                placeholder.markdown(full_response)
            placeholder.markdown(full_response)
    message = {"role": "assistant", "content": full_response}
    st.session_state.messages.append(message)

You may notice that we've only slightly modified the app by adding a code block to the sidebar. This allows users to select a specific model using st.sidebar.selectbox() and customize parameters using st.sidebar.slider().

4. Set the API token

Option 1. Set the API token in Secrets

If you want to provide your users with free access to your chatbot, you'll need to cover the costs as your credit card is tied to your account.

To set the API token in the Secrets management on Streamlit Community Cloud, click on the expandable menu at the far right, then click on Settings:

To define the REPLICATE_API_TOKEN environment variable, click on the Secrets tab and paste your Replicate API token:

Once the API token is defined in Secrets, users should be able to use the app without needing to use their own API key:

Option 2. Set the API token in the app

An alternative to setting the API token in Secrets is to prompt users to specify it in the app. This way, users will be notified to provide their own Replicate API token to proceed with using the app:

5. Deploy the app

Once the app is created, deploy it to the cloud in three steps:

  1. Create a GitHub repository for the app.
  2. In Streamlit Community Cloud, click on the New app button, then choose the repository, branch, and app file.
  3. Click Deploy! and the app will be live!

Wrapping up

Congratulations! You’ve learned how to build your own Llama 2 chatbot app using the LLM model hosted on Replicate.

It’s worth noting that the LLM was set to the 7B version and that model parameters (such as temperature and top_p) were initialized with a set of arbitrary values. This post also includes the Pro version, which allows users to specify the model and parameters. I encourage you to experiment with this setup, adjust these parameters, and explore your own variations. This can be a great opportunity to see how these modifications might affect the LLM-generated response.

For additional ideas and inspiration, check out the LLM gallery. If you have any questions, let me know in the comments below or find me on Twitter at @thedataprof or on LinkedIn at Chanin Nantasenamat. You can also check out the Streamlit YouTube channel or my personal YouTube channel, Data Professor.

Happy chatbot-building! 🦙


This is a companion discussion topic for the original entry at https://blog.streamlit.io/how-to-build-a-llama-2-chatbot
1 Like

Nice one @dataprofessor !

2 Likes

Hello. Thanks for the tutorial. If using your own API for the app, how will it cost you since streamlit and Llama is free?

1 Like

Hi! Running the Pro version on Streamlit Cloud requires an API password from replicate, right?
Llama 2 is free to run on local hardware.

os.environ['REPLICATE_API_TOKEN'] = replicate_api

Regards, Sergiu

2 Likes

This is perfect – we love Streamlit … will adapt this repo right now to other LLM models. Thanks.

3 Likes

Hello, is it possible to run this using DeepInfra API key rather than Replicate?

2 Likes

great one ! @dataprofessor I have even seen your video it was absolutely nice and on point .

2 Likes

Yes that would be possible, you just need to switch the API call.

1 Like

Thanks @Krishay_Nair glad to hear that the content are helpful.

1 Like

Hi @S_I

Yes that is correct, the Llama2 model is free to use through Meta’s release, however to actually run the LLM inference from such models, we’ll need access to a powerful computer which in our case are the one hosted on the Replicate’s platform which does come at a cost.

1 Like

Hi @SheriffMudasir

As also mentioned below in a related comment, Llama2 is free to use, however to run the LLM inference calls, you’ll need to host them on a powerful computer as they are quite computationally intensive. Running such API calls from a platform such as Replicate comes at a cost.

1 Like

@dataprofessor i recently came across the :llama::speech_balloon: Llama 2 Chatbot app on streamlit and am curious on how does the input query refresh once we enter the text, coz i am trying to work with chatbot (not llama) but the query is still displayed on the query input after the submission as well.

def submit():
record_timing() # Record time before submitting message
st.session_state.something = st.session_state.widget # Save user input
st.session_state.widget = ‘’ # Clear user input
if user_prompt := st.text_input(“Your message here”, on_change=submit, key=“text_input”):
# Code for handling user input

here even after user submits the query, the text_input is not being cleared out

1 Like