LangChain tutorial #5: Build an Ask the Data app

Large language models (LLMs) have revolutionized how we process and understand text data, enabling a diverse array of tasks spanning text generation, summarization, classification, and much more. Combining LangChain and Streamlit to build LLM-powered applications is a potent combination for unlocking an array of possibilities, especially for developers interested in creating chatbots, personal assistants, and content creation apps.

In the previous four LangChain tutorials, you learned about three of the six key modules: model I/O (LLM model and prompt templates), data connection (document loader, text splitting, embeddings, and vector store), and chains (summarize chain and question-answering chain).

This tutorial explores the use of the fourth LangChain module, Agents. Specifically, we'll use the pandas DataFrame Agent, which allows us to work with pandas DataFrame by simply asking questions.

We'll build the pandas DataFrame Agent app for answering questions on a pandas DataFrame created from a user-uploaded CSV file in four steps:

  1. Get an OpenAI API key
  2. Set up the coding environment
  3. Build the app
  4. Deploy the app

What are Agents?

According to Harrison Chase, agents "use an LLM to determine which actions to take and in what order." An action can refer to using tools, observing their output, or returning a response to the user. Tools are entities that take a string as input and return a string as output. Examples of tools include APIs, databases, search engines, LLMs, chains, other agents, shells, and Zapier.

Agents are comprised of two types:

  1. Action agents
  2. Plan-and-execute agents

Using Agents in LangChain

To use an agent in LangChain, you need to specify three key elements:

  1. LLM. LLM is responsible for determining the course of action that an agent would take to fulfill its task of answering a user query. If you're using the OpenAI LLM, it's available via OpenAI() from langchain.llms.
  2. Tools. These are resources that an agent can use to accomplish its task, such as querying a database, accessing an API, or searching Google. You can load them via load_tools() from langchain.agents.
  3. Agent. The available agent types are action agents or plan-and-execute agents. You can access them via AgentType() from langchain.agents.

In this tutorial, we'll be using the pandas DataFrame Agent, which can be created using create_pandas_dataframe_agent() from langchain.agents.

App overview

Let's take a look at the general flow of the app.

Once the app is loaded, the user should perform the following steps in sequential order:

  1. Upload a CSV file. You can also tweak the underlying code to read in tabular formats such as Excel or tab-delimited files.
  2. Select an example query from the drop-down menu or provide your own custom query by selecting the "Other" option.
  3. Enter your OpenAI API key.

That's all for the frontend! As for the backend, the pandas DataFrame Agent will work its magic on the data and return an LLM-generated answer.

Now let's take a look at the app in action:

Step 1. Get an OpenAI API key

You can find a detailed walkthrough on obtaining an OpenAI API key in LangChain Tutorial #1.

Step 2. Set up the coding environment

Local development

To set up a local coding environment with the necessary libraries, use pip install as shown below (make sure you have Python version 3.7 or higher):

pip install streamlit openai langchain pandas tabulate

Cloud development

In addition to using a local computer to develop apps, you can deploy them on the cloud using Streamlit Community Cloud. You can use the Streamlit app template to do this (read more here).

Next, add the following Python libraries to the requirements.txt file:

streamlit
openai
langchain
pandas
tabulate

Step 3. Build the app

App overview

The entire app consists of 47 lines of code, as shown below:

import streamlit as st
import pandas as pd
from langchain.chat_models import ChatOpenAI
from langchain.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType
# Page title
st.set_page_config(page_title='🦜🔗 Ask the Data App')
st.title('🦜🔗 Ask the Data App')
# Load CSV file
def load_csv(input_csv):
  df = pd.read_csv(input_csv)
  with st.expander('See DataFrame'):
    st.write(df)
  return df
# Generate LLM response
def generate_response(csv_file, input_query):
  llm = ChatOpenAI(model_name='gpt-3.5-turbo-0613', temperature=0.2, openai_api_key=openai_api_key)
  df = load_csv(csv_file)
  # Create Pandas DataFrame Agent
  agent = create_pandas_dataframe_agent(llm, df, verbose=True, agent_type=AgentType.OPENAI_FUNCTIONS)
  # Perform Query using the Agent
  response = agent.run(input_query)
  return st.success(response)
# Input widgets
uploaded_file = st.file_uploader('Upload a CSV file', type=['csv'])
question_list = [
  'How many rows are there?',
  'What is the range of values for MolWt with logS greater than 0?',
  'How many rows have MolLogP value greater than 0.',
  'Other']
query_text = st.selectbox('Select an example query:', question_list, disabled=not uploaded_file)
openai_api_key = st.text_input('OpenAI API Key', type='password', disabled=not (uploaded_file and query_text))
# App logic
if query_text is 'Other':
  query_text = st.text_input('Enter your query:', placeholder = 'Enter query here ...', disabled=not uploaded_file)
if not openai_api_key.startswith('sk-'):
  st.warning('Please enter your OpenAI API key!', icon='⚠')
if openai_api_key.startswith('sk-') and (uploaded_file is not None):
  st.header('Output')
  generate_response(uploaded_file, query_text)

Import libraries

To start, import the necessary libraries:

  • Streamlit. A low-code web framework used for creating the app's frontend
  • pandas. A data wrangling framework for loading the CSV file as a DataFrame
  • LangChain. An LLM framework that coordinates the use of an LLM model to generate a response based on the user-provided prompt.
import streamlit as st
import pandas as pd
from langchain.chat_models import ChatOpenAI
from langchain.agents import create_pandas_dataframe_agent
from langchain.agents.agent_types import AgentType

Display the app title

Next, display the title of the app:

# Page title
st.set_page_config(page_title='🦜🔗 Ask the Data App')
st.title('🦜🔗 Ask the Data App')

Load the CSV file

Since the CSV file is one of the app's inputs, along with the data query, you need to create a custom function to load it (use pandas' read_csv() method). Once loaded, display the DataFrame inside an expander box:

# Load CSV file
def load_csv(input_csv):
  df = pd.read_csv(input_csv)
  with st.expander('See DataFrame'):
    st.write(df)
  return df

Create the LLM response generation function

The next step is to process data using the Agent, specifically the pandas DataFrame Agent, and the LLM model (GPT 3.5).

To create an instance of the LLM model, use ChatOpenAI() and set gpt-3.5-turbo-0613 as the model_name. Next, create the pandas DataFrame Agent using the create_pandas_dataframe_agent() method and assign the LLM model, defined by llm, and the input data, defined by df.

🦜

NOTE: While creating and testing the app, I discovered that usage costs were significantly higher compared to previous apps built in this tutorial series. So I decided to use the GPT 3.5 model due to its significantly lower cost.

# Generate LLM response
def generate_response(csv_file, input_query):
  llm = ChatOpenAI(model_name='gpt-3.5-turbo-0613', temperature=0.2, openai_api_key=openai_api_key)
  df = load_csv(csv_file)
  # Create Pandas DataFrame Agent
  agent = create_pandas_dataframe_agent(llm, df, verbose=True, agent_type=AgentType.OPENAI_FUNCTIONS)
  # Perform Query using the Agent
  response = agent.run(input_query)
  return st.success(response)

Next, create input widgets to accept various variables for data analysis. These include:

  • The user-provided CSV file (stored in the uploaded_file variable)
  • The input query (stored in the question_list and query_text variables)
  • The OpenAI API (stored in the openai_api_key variable)
# Input widgets
uploaded_file = st.file_uploader('Upload a CSV file', type=['csv'])
question_list = [
  'How many rows are there?',
  'What is the range of values for MolWt with logS greater than 0?',
  'How many rows have MolLogP value greater than 0.',
  'Other']
query_text = st.selectbox('Select an example query:', question_list, disabled=not uploaded_file)
openai_api_key = st.text_input('OpenAI API Key', type='password', disabled=not (uploaded_file and query_text))

Define the app logic

The app logic is defined in this last code block. Follow these steps:

  1. Check if the user has selected the Other option from the drop-down select box defined in query_text to provide a custom text query. If so, the user can enter their query text.
  2. Check if the user has provided their OpenAI API key. If not, a reminder message is displayed for the user to enter their API key.
  3. Perform a final check for the API key and the user-provided CSV file. If the check is successful (meaning the user has provided all necessary information), we proceed to generate a response from the pandas DataFrame Agent.
# App logic
if query_text is 'Other':
  query_text = st.text_input('Enter your query:', placeholder = 'Enter query here ...', disabled=not uploaded_file)
if not openai_api_key.startswith('sk-'):
  st.warning('Please enter your OpenAI API key!', icon='⚠')
if openai_api_key.startswith('sk-') and (uploaded_file is not None):
  st.header('Output')
  generate_response(uploaded_file, query_text)

Step 4. Deploy the app

Once the app has been created, it can be deployed to the cloud in three steps:

  1. Create a GitHub repository to store the app files.
  2. Go to the Streamlit Community Cloud, click the New app button, and select the appropriate repository, branch, and application file.
  3. Finally, click Deploy!.

After a few moments, the app should be ready to use!

Wrapping up

You've learned how to build an Ask the Data app that lets you ask questions to understand your data better. We used Streamlit as the frontend to accept user input (CSV file, questions about the data, and OpenAI API key) and LangChain for backend processing of the data via the pandas DataFrame Agent.

If you're looking for ideas and inspiration, check out the Generative AI page and the LLM gallery. And if you have any questions, please post them in the comments below or on Twitter at @thedataprof, on LinkedIn, on the Streamlit YouTube channel, or on my personal YouTube channel, Data Professor.

I can't wait to see what you'll build! 🎈


This is a companion discussion topic for the original entry at https://blog.streamlit.io/langchain-tutorial-5-build-an-ask-the-data-app
2 Likes

So… If I understand correctly, the df is never uploaded to the LLM API, rather the agent works with the DF to code an answer to your question (via the LLM API)… but is that correct? Seems it would need to know the df meta-data at least…

Would be good if you could dive into some more details like these…

This code doesn’t work anymore! Can you support? Thanks.

RuntimeError: no validator found for <class ‘re.Pattern’>, see arbitrary_types_allowed in Config

Traceback:
File “C:\ProgramData\Anaconda3\lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py”, line 534, in run_script
exec(code, module.dict)
File “C:\Users\Niko\AskDataAppp.py”, line 10, in
from langchain.agents.agent_types import AgentType
File "C:\ProgramData\Anaconda3\lib\site-packages\langchain\agents_init
.py", line 35, in
from langchain.agents.agent import (
File “C:\ProgramData\Anaconda3\lib\site-packages\langchain\agents\agent.py”, line 34, in
from langchain.chains.base import Chain
File “C:\ProgramData\Anaconda3\lib\site-packages\langchain\chains_init_.py”, line 24, in
from langchain.chains.combine_documents.map_rerank import MapRerankDocumentsChain
File “C:\ProgramData\Anaconda3\lib\site-packages\langchain\chains\combine_documents\map_rerank.py”, line 11, in
from langchain.output_parsers.regex import RegexParser
File “C:\ProgramData\Anaconda3\lib\site-packages\langchain\output_parsers_init_.py”, line 37, in
from langchain.output_parsers.xml import XMLOutputParser
File “C:\ProgramData\Anaconda3\lib\site-packages\langchain\output_parsers\xml.py”, line 9, in
class XMLOutputParser(BaseOutputParser):
File “pydantic\main.py”, line 205, in pydantic.main.ModelMetaclass.new
File “pydantic\fields.py”, line 491, in pydantic.fields.ModelField.infer
File “pydantic\fields.py”, line 421, in pydantic.fields.ModelField.init
File “pydantic\fields.py”, line 542, in pydantic.fields.ModelField.prepare
File “pydantic\fields.py”, line 804, in pydantic.fields.ModelField.populate_validators
File “pydantic\validators.py”, line 723, in find_validators

Hi @Niko :wave:

It’s possibly due to a recent update of the Langchain library.

I had the same issue on several of my Langchain apps due to recent updates, solved by pinning to a legacy version in the requirements.txt.

(I’m also CC’ing @dataprofessor as he created the app :))

Best wishes,
Charly

1 Like

Hi @Niko

I agree with @Charly_Wargnier that recent updates to the LangChain library may have caused the error. In the deployed demo app, I’ve downgraded langchain to an older version to the version number when the blog went live (langchain==0.0.239) and rebooted the app, now it works as usual. Perhaps you can also try doing this.

Hope this helps!

That’s great, thanks Chanin!

1 Like

Still seeing issue after downgraded the langchain in the notebook. I am curious how you were able to run after downgrade. Here’s the error:


TypeError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_26308\2878462945.py in
1 import streamlit as st
2 import pandas as pd
----> 3 from langchain.chat_models import ChatOpenAI
4 from langchain.agents import create_pandas_dataframe_agent
5 from langchain.agents.agent_types import AgentType

C:\ProgramData\Anaconda3\lib\site-packages\langchain_init_.py in
4 from typing import Optional
5
----> 6 from langchain.agents import MRKLChain, ReActChain, SelfAskWithSearchChain
7 from langchain.cache import BaseCache
8 from langchain.chains import (

C:\ProgramData\Anaconda3\lib\site-packages\langchain\agents_init_.py in
1 “”“Interface for agents.”“”
----> 2 from langchain.agents.agent import (
3 Agent,
4 AgentExecutor,
5 AgentOutputParser,

C:\ProgramData\Anaconda3\lib\site-packages\langchain\agents\agent.py in
23 Callbacks,
24 )
—> 25 from langchain.chains.base import Chain
26 from langchain.chains.llm import LLMChain
27 from langchain.input import get_color_mapping

C:\ProgramData\Anaconda3\lib\site-packages\langchain\chains_init_.py in
13 “”"
14
—> 15 from langchain.chains.api.base import APIChain
16 from langchain.chains.api.openapi.chain import OpenAPIEndpointChain
17 from langchain.chains.combine_documents.base import AnalyzeDocumentChain

C:\ProgramData\Anaconda3\lib\site-packages\langchain\chains\api\base.py in
10 CallbackManagerForChainRun,
11 )
—> 12 from langchain.chains.api.prompt import API_RESPONSE_PROMPT, API_URL_PROMPT
13 from langchain.chains.base import Chain
14 from langchain.chains.llm import LLMChain

C:\ProgramData\Anaconda3\lib\site-packages\langchain\chains\api\prompt.py in
1 # flake8: noqa
----> 2 from langchain.prompts.prompt import PromptTemplate
3
4 API_URL_PROMPT_TEMPLATE = “”"You are given the below API Documentation:
5 {api_docs}

C:\ProgramData\Anaconda3\lib\site-packages\langchain\prompts_init_.py in
10 SystemMessagePromptTemplate,
11 )
—> 12 from langchain.prompts.example_selector import (
13 LengthBasedExampleSelector,
14 MaxMarginalRelevanceExampleSelector,

C:\ProgramData\Anaconda3\lib\site-packages\langchain\prompts\example_selector_init_.py in
2 from langchain.prompts.example_selector.length_based import LengthBasedExampleSelector
3 from langchain.prompts.example_selector.ngram_overlap import NGramOverlapExampleSelector
----> 4 from langchain.prompts.example_selector.semantic_similarity import (
5 MaxMarginalRelevanceExampleSelector,
6 SemanticSimilarityExampleSelector,

C:\ProgramData\Anaconda3\lib\site-packages\langchain\prompts\example_selector\semantic_similarity.py in
6 from pydantic import BaseModel, Extra
7
----> 8 from langchain.embeddings.base import Embeddings
9 from langchain.prompts.example_selector.base import BaseExampleSelector
10 from langchain.vectorstores.base import VectorStore

C:\ProgramData\Anaconda3\lib\site-packages\langchain\embeddings_init_.py in
31 from langchain.embeddings.octoai_embeddings import OctoAIEmbeddings
32 from langchain.embeddings.openai import OpenAIEmbeddings
—> 33 from langchain.embeddings.sagemaker_endpoint import SagemakerEndpointEmbeddings
34 from langchain.embeddings.self_hosted import SelfHostedEmbeddings
35 from langchain.embeddings.self_hosted_hugging_face import (

C:\ProgramData\Anaconda3\lib\site-packages\langchain\embeddings\sagemaker_endpoint.py in
4
5 from langchain.embeddings.base import Embeddings
----> 6 from langchain.llms.sagemaker_endpoint import ContentHandlerBase
7
8

C:\ProgramData\Anaconda3\lib\site-packages\langchain\llms_init_.py in
55 from langchain.llms.textgen import TextGen
56 from langchain.llms.tongyi import Tongyi
—> 57 from langchain.llms.vertexai import VertexAI
58 from langchain.llms.writer import Writer
59

C:\ProgramData\Anaconda3\lib\site-packages\langchain\llms\vertexai.py in
13 from langchain.llms.base import LLM, create_base_retry_decorator
14 from langchain.llms.utils import enforce_stop_tokens
—> 15 from langchain.utilities.vertexai import (
16 init_vertexai,
17 raise_vertex_import_error,

C:\ProgramData\Anaconda3\lib\site-packages\langchain\utilities_init_.py in
1 “”“General utilities.”“”
2 from langchain.requests import TextRequestsWrapper
----> 3 from langchain.utilities.apify import ApifyWrapper
4 from langchain.utilities.arxiv import ArxivAPIWrapper
5 from langchain.utilities.awslambda import LambdaWrapper

C:\ProgramData\Anaconda3\lib\site-packages\langchain\utilities\apify.py in
3 from pydantic import BaseModel, root_validator
4
----> 5 from langchain.document_loaders import ApifyDatasetLoader
6 from langchain.document_loaders.base import Document
7 from langchain.utils import get_from_dict_or_env

C:\ProgramData\Anaconda3\lib\site-packages\langchain\document_loaders_init_.py in
44 UnstructuredEmailLoader,
45 )
—> 46 from langchain.document_loaders.embaas import EmbaasBlobLoader, EmbaasLoader
47 from langchain.document_loaders.epub import UnstructuredEPubLoader
48 from langchain.document_loaders.evernote import EverNoteLoader

C:\ProgramData\Anaconda3\lib\site-packages\langchain\document_loaders\embaas.py in
52
53
—> 54 class BaseEmbaasLoader(BaseModel):
55 “”“Base class for embedding a model into an Embaas document extraction API.”“”
56

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\main.cp39-win_amd64.pyd in pydantic.main.ModelMetaclass.new()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.infer()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.init()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.prepare()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.populate_validators()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\validators.cp39-win_amd64.pyd in find_validators()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\validators.cp39-win_amd64.pyd in pydantic.validators.make_typeddict_validator()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\annotated_types.cp39-win_amd64.pyd in pydantic.annotated_types.create_model_from_typeddict()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\main.cp39-win_amd64.pyd in pydantic.main.create_model()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\main.cp39-win_amd64.pyd in pydantic.main.ModelMetaclass.new()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.infer()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.init()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField.prepare()

C:\ProgramData\Anaconda3\lib\site-packages\pydantic\fields.cp39-win_amd64.pyd in pydantic.fields.ModelField._type_analysis()

C:\ProgramData\Anaconda3\lib\typing.py in subclasscheck(self, cls)
850 return issubclass(cls.origin, self.origin)
851 if not isinstance(cls, _GenericAlias):
→ 852 return issubclass(cls, self.origin)
853 return super().subclasscheck(cls)
854

TypeError: issubclass() arg 1 must be a class

1 Like

Hey after searching on the web, I make it work by modifying some libraries.

typing-inspect==0.8.0
typing_extensions==4.5.0

python - import langchain => Error : TypeError: issubclass() arg 1 must be a class - Stack Overflow

1 Like

Fantastic! Thanks for heads-up, @Niko! :raised_hands:

Best,
Charly