Chat2VIS: AI-driven visualisations with Streamlit and natural language

Hey, everyone! šŸ“£

I'm Paula, an AI researcher and data scientist in New Zealand. Many great research projects are born out of our universities, and Iā€™ve been privileged to be involved with some of them. But what really drives me is bringing my research to life.

The release of ChatGPT in late 2022 inspired me to research how large language models (LLMs) could generate data visualisations using natural language text. Nothing is more frustrating than hunting through menu items trying to find a command to change some plot element. Wouldnā€™t it be nice to use everyday language to graph what you want to see?

So I decided to build Chat2VIS, to bring my research to you.

In this post, Iā€™ll cover:

  • What is Chat2VIS?
  • How to use Chat2VIS
  • How to build Chat2VIS

What is Chat2VIS?

Chat2VIS is an app that generates data visualisations via natural language using GPT-3, ChatGPT-3.5, and GPT-4 LLMs. You can ask it to visualise anything from movies to cars to clothes, to even energy production.

Let me show how it works by using a fun example.

Have you heard of speedcubing? In speedcubing competitions, competitors race to solve the Rubikā€™s Cube puzzle and beat their own personal best times. There are events for solving 3x3, 4x4, 5x5, 6x6, and 7x7 Rubikā€™s Cubesā€”sometimes even solving them blindfolded!

The competition results database is publicly available,* so I created a subset of it with results up to 23 June 2023. I took each competitorā€™s fastest best-solve time (as opposed to average-solve time) and I used the results from 2x2, 3x3, 4x4, 5x5, Clock, Megaminx, Pyraminx, Skewb, Square-1, and 3x3 blindfolded events. Thatā€™s 195,980 competitors totalā€”a dataset of 585,154 rows. Each row listed the competitorā€™s WCA ID, event name, best-solve time (in centiseconds), country, country ranking, continent ranking, and world ranking.

Here is what it looked like:

App overview

Let's see how the app works:

  1. Choose a pre-loaded dataset or upload one of your own.
  2. Write the query in the language of your preference (no worries about spelling or grammar!)
  3. Chat2VIS builds a unique prompt tailored to your dataset (the prompt template is generic enough so that each LLM understands the requirements without the customization).
  4. Submit the promptā€”the beginnings of your Python scriptā€”to each LLM and get a continuation of your script (read more about it here).
  5. Build the Python script by amalgamating the beginnings of the script from your initial prompt and the continuation script from the LLM.
  6. Create the visualisationā€”render the script on the Streamlit interface. If you get no plot or a plot of something unexpected, it means the code has syntax errors (kind of like the code from the human programmers!). Just change your wording a bit and resubmit the request.

How to use Chat2VIS

To begin, follow these steps:

  1. Load the dataset.
  2. Enter your OpenAI API key (if you don't have one, get it here and add some credit).

Now youā€™re ready!

Example 1

Let's start with a simple example.

Type in this query: ā€œShow the number of competitors who have competed in the 3x3 event by country for the top 10 countries.ā€

Both GPT-3 and ChatGPT-3.5 performed well in understanding the query text and displaying the results, complete with axis labels and a title. They even correctly identified the "3x3 event" as the "3x3x3 Cube" value in the "Event Name" column. The USA had the highest number of speedcubers at approximately 38,000. However, ChatGPT could improve readability by changing the orientation of the x-axis bar labels. You can let the model know the preferred label orientation.

Example 2

Let's try a more challenging example.

Type in this query: ā€œFor each event, show the fastest best single time and put the value above the bar line. The results are in centiseconds. Please convert them to seconds.ā€

The LLMs are primarily trained in the English language but have knowledge of other languages as well.

Let's add some multilingual text:

  • "Dessinez le tracĆ© horizontalement" ("Draw the plot horizontal" in French)
  • "Whakamahia nga tae whero, kikorangi" (ā€Use red and blue for the plotā€ in te reo Māori, one of New Zealand's official languages)

How did Chat2VIS do? Pretty good. The values are above the bar lines, the results are converted to seconds, the plot is turned horizontal, and the colours are red and blue. It even got the axis labels and the title right. Just look at that 3x3 time ā€¦ 3.13 seconds! šŸ‘

šŸ‘€

For more multilingual examples, queries with spelling mistakes, and plot elements refining, read this article.

How to build Chat2VIS

Here is how to set up the front end:

  • To center the titles and change the font, use st.markdown:
st.markdown("<h1 style='text-align: center; font-weight:bold; font-family:comic sans ms; padding-top: 0rem;'>Chat2VIS</h1>", unsafe_allow_html=True)
st.markdown("<h2 style='text-align: center; padding-top: 0rem;'>Creating Visualisations using Natural Language with ChatGPT </h2>", unsafe_allow_html=True)
  • Create a sidebar and load the available datasets into a dictionary. Storing them in the session_state object avoids unnecessary reloading. Use radio buttons to select the chosen dataset, but also include any manually uploaded datasets in the list. To do this, add an empty container to reserve the spot on the sidebar, add a file uploader, and add the uploaded file to the dictionary. Finally, add the dataset list of radio buttons to the empty container (I like to use emoji shortcodes on the labels!). If a dataset has been manually uploaded, ensure that the radio button is selected:
if "datasets" not in st.session_state:
    datasets = {}
    # Preload datasets
    datasets["Movies"] = pd.read_csv("movies.csv")
    datasets["Housing"] = pd.read_csv("housing.csv")
    datasets["Cars"] = pd.read_csv("cars.csv")
    st.session_state["datasets"] = datasets
else:
    # use the list already loaded
    datasets = st.session_state["datasets"]
with st.sidebar:
    # First we want to choose the dataset, but we will fill it with choices once we've loaded one
		dataset_container = st.empty()
    # Add facility to upload a dataset
    uploaded_file = st.file_uploader(":computer: Load a CSV file:", type="csv")
		# When we add the radio buttons we want to default the selection to the first
		index_no = 0
    if uploaded_file:
        # Read in the data, add it to the list of available datasets. Give it a nice name.
        file_name = uploaded_file.name[:-4].capitalize()
        datasets[file_name] = pd.read_csv(uploaded_file)
				# We want to default the radio button to the newly added dataset
				index_no = len(datasets)-1
    # Radio buttons for dataset choice
    chosen_dataset = dataset_container.radio(":bar_chart: Choose your data:", datasets.keys(), index=index_no)
  • Add checkboxes in the sidebar to choose which LLM to use. The label will display the model name with the OpenAI model version in brackets. The models and their selected status will be stored in a dictionary:
available_models = {"ChatGPT-4": "gpt-4", "ChatGPT-3.5": "gpt-3.5-turbo", "GPT-3": "text-davinci-003"}
with st.sidebar:
		st.write(":brain: Choose your model(s):")
		# Keep a dictionary of whether models are selected or not
		use_model = {}
		for model_desc,model_name in available_models.items():
        label = f"{model_desc} ({model_name})"
        key = f"key_{model_desc}"
        use_model[model_desc] = st.checkbox(label,value=True,key=key)
  • In the main section, add a password input widget for the OpenAI API key. šŸ”‘ The help parameter provides information to ensure success when calling the LLMs. Additionally, a text area for the query šŸ‘€ and a "Go" button are included.
my_key = st.text_input(label = ":key: OpenAI Key:", help="Please ensure you have an OpenAI API account with credit. ChatGPT Plus subscription does not include API access.", type="password")
question = st.text_area(":eyes: What would you like to visualise?", height=10)
go_btn = st.button("Go...")
  • Finally, display the datasets using a tab widget.
tab_list = st.tabs(df_list.keys())
for dataset_num, tab in enumerate(tab_list):
    with tab:
        dataset_name = list(df_list.keys())[dataset_num]
        st.subheader(dataset_name)
        st.dataframe(df_list[dataset_name], hide_index=True)

To initiate the process, click on ā€œGoā€¦ā€!

Your communication with each model is facilitated through the openai Python library. With GPT-3, the prompt is presented as a sequence of tokens using the text completion endpoint API. ChatGPT models require the chat-completion endpoint and submission of a message sequence, which is then converted to tokens using ChatML (Chat Markup Language).

The following function illustrates this process, taking parameters for the prompt (question_to_ask), the model type (gpt-4, gpt-3.5-turbo, or text-davinci-003), and your OpenAI key. This function is placed within a try block with except statements to capture any errors returned from the LLMs (read more here):

def run_request(question_to_ask, model_type, key):
    openai.api_key = key
    if model_type == "gpt-4" or model_type == "gpt-3.5-turbo":
        # Run ChatGPT API
        response = openai.ChatCompletion.create(
            model=model_type,
            messages=[
                {"role":"system", "content":"Generate Python Code Script."},
                {"role":"user", "content":question_to_ask}])
        res = response["choices"][0]["message"]["content"]
    else:
        response = openai.Completion.create(
            engine=model_type,
            prompt=question_to_ask,
            temperature=0,
            max_tokens=500,
            top_p=1.0,
            frequency_penalty=0.0,
            presence_penalty=0.0,
            stop=["plt.show()"]
            )
        res = response["choices"][0]["text"] 
    return res

Dynamically create as many columns on the interface as you have models selected:

model_list = [model_name for model_name, choose_model in use_model.items() if choose_model]
if len(model_list) > 0:
		plots = st.columns(len(model_list))

After executing the final scripts, the results for each model are passed to the st.pyplot chart elements for rendering in the columns on the interface.

Wrapping up

You learned how to create a natural language interface that displays data visualisations using everyday language requests on a set of data. I didnā€™t cover the details of engineering the prompt for the LLMs, but the referenced articles should give you more guidance. Since the development of Chat2VIS in January 2023, there have been significant advancements leveraging generative AI for visualisations and prompt engineering. There is so much more to explore!

Thank you to Streamlit for helping me build this app and to those of you who have contacted me to show me how you have used it with your own datasets. It's awesome to see! Iā€™d love to answer any questions you have. Please post them in the comments below or connect with me on LinkedIn.

Happy Streamlit-ing! šŸŽˆ

*This information is based on competition results owned and maintained by the World Cube Association, published at https://worldcubeassociation.org/export/results as of June 23, 2023.


This is a companion discussion topic for the original entry at https://blog.streamlit.io/chat2vis-ai-driven-visualisations-with-streamlit-and-natural-language
2 Likes

Any tips to use open source models instead of open ai in this usecase?

At the time of development in Jan this year the OpenAI models were the only contenders for building this. At some point I may have a look at some of the open source models available now and see how they perform. Thanks for your question. :smiling_face:

Thanks for replying.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.