Scalability of Streamlit in Snowflake

I have been playing with Streamlit in Snowflake and everything works flawlessly, it’s awesome. I’m currently trying to wrap my head around how I can scale my app in terms of data volume, since Snowflake abstracts most of the computing/scaling.

When working with Snowflake only, I tend to add cluster keys for the columns I expect the user will filter by (e.g. user_id, created_at, etc) but when working with Streamlit in Snowflake, after I turn my table into a pandas dataframe, I’m pretty much lost:

df = session.sql("""
SELECT * FROM MY_TABLE;
"""
)
df = df.to_pandas()

This becomes a bigger problem when I’m aggregating data further using pandas.
Any advice here?

Hey @Felipe3, would it be possible to add a dropdown or similar to your Streamlit app for the columns you want your users to allow to filter for and do the initial querying based on that and only subsequent filtering on pandas? Something like:

import streamlit as st
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark.functions import col

allowed_columns = ["created_at", "user_id"]
df = session.table("MY_TABLE")
with st.form("query_input"):
    for allowed_column in allowed_columns:
        col1, col2, col3 = st.columns(3)
        col1.code(f"{allowed_column}")
        col2.code("LIKE")
        col3.text_input("Filter Value", value="%", label_visibility="collapsed", key=f"{allowed_column}.value")
    submitted = st.form_submit_button("Execute Query")
    if submitted:
        df = df.filter(col(allowed_column).like(st.session_state[f"{allowed_column}.value"]))
        res = df.collect()
        st.data_editor(res)

       st.session_state.query_res = res

if "query_res" in st.session_state:
  # do other things based on pandas now

It’d look something like this:
Screenshot 2024-08-27 at 15.32.46

I believe the snowpark filter function already takes care of sanitizing the input to prevent sql injection vulnerabilities, but perhaps you want to restrict the input even more. Also, the LIKE keyword in the example can be replaced with a dropdown menu etc to allow for other filters etc.
I am not a Snowpark expert, so maybe others have even better ideas :slight_smile:

1 Like

Hey @raethlein - that’s a good approach.

I was trying to work with something slightly different, here’s the code for it:

# Python packages
import streamlit as st
import numpy as np
import datetime as datetime
import pandas as pd
from snowflake.snowpark.context import get_active_session

# Get the current credentials
session = get_active_session()

# Create initial dataframe
df = session.sql("SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS;").to_pandas()

# Sidebar with filters
st.sidebar.header('Filters')

st.sidebar.divider()

st.sidebar.subheader('Total price')

o_total_price_max = st.sidebar.number_input('Max', min_value=0.00, max_value=df["O_TOTALPRICE"].max(), value=df["O_TOTALPRICE"].max(), step=1.00)
o_total_price_min = st.sidebar.number_input('Min', min_value=0.00, max_value=df["O_TOTALPRICE"].max(), value=df["O_TOTALPRICE"].min(), step=1.00)


st.sidebar.divider()

st.sidebar.subheader('Priority')

o_order_priority = st.sidebar.multiselect(
    "Player tier:",
    options=df["O_ORDERPRIORITY"].unique(),
    default=df["O_ORDERPRIORITY"].unique()
)

# Filter dataframe
df_filtered = df.query(
    "O_ORDERPRIORITY in @o_order_priority & \
    O_TOTALPRICE >= @o_total_price_min & \
    O_TOTALPRICE <= @o_total_price_max" 
)

# Dataframe
st.title("Dataframe")

st.dataframe(df_filtered.head(1000))

# Aggregation
df_agg = df_filtered.groupby('O_ORDERSTATUS').agg(
    TOTAL_PRICE_AGG=('O_TOTALPRICE', 'sum')
).reset_index()

st.title("Dataframe agg")

st.dataframe(df_agg)

And here’s what I get:

There are a few problems with it though (I think):

  • The filters and aggregations are being applied directly to the dataframe (not sure if it would be better to it to the table instead)
  • Whenever I change a filter, the dashboard reloads automatically - it would be nice to have a button that let’s me change all filters and then click “Go”
  • If my table has more than 100k rows, it starts to get slow

Any recommendations?

Hi @Felipe3

Below is the code with improvements on both the filter changes and handling of large tables.
However, Streamlit in Snowflake does not yet support Pandas on Snowflake, so unfortunately it does not work perfectly.

  1. First, st.form is a useful UI for changing all filters and then clicking “Execute”.
  2. Next, when handling large amounts of data with Streamlit in Snowflake, we recommend Snowpark DataFrame or, if you are familiar with Pandas, Pandas on Snowflake. By writing an API compatible with Pandas, it is converted to Snowflake SQL and executed behind the scenes, allowing for scalable processing.
# Python packages
import streamlit as st
import numpy as np
import datetime as datetime
import modin.pandas as pd  ## ★ modify!
import snowflake.snowpark.modin.plugin  ## ★ Add!
from snowflake.snowpark.context import get_active_session

# Get the current credentials
session = get_active_session()

# Create initial dataframe
df = session.sql("SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS").to_snowpark_pandas() ## ★ modify!

# Sidebar with filters
with st.sidebar:
    st.header('Filters')
    
    st.divider()
    
    st.subheader('Total price')
    
    o_total_price_max = st.number_input('Max', min_value=0.00, max_value=df["O_TOTALPRICE"].max(), value=df["O_TOTALPRICE"].max(), step=1.00)
    o_total_price_min = st.number_input('Min', min_value=0.00, max_value=df["O_TOTALPRICE"].max(), value=df["O_TOTALPRICE"].min(), step=1.00)
    
    
    st.divider()
    
    st.subheader('Priority')
    
    with st.form("form"):  ## ★ Add!
        o_order_priority = st.multiselect(
            "Player tier:",
            options=df["O_ORDERPRIORITY"].unique(),
            default=df["O_ORDERPRIORITY"].unique()
        )

        st.form_submit_button("Go")
    

# Filter dataframe  ## ★ modify because of unimplemented in pandas on snowflake
df_filtered = df[
    df["O_ORDERPRIORITY"].isin(o_order_priority) &
    (df["O_TOTALPRICE"] >= o_total_price_min) &
    (df["O_TOTALPRICE"] <= o_total_price_max)
]


# Dataframe
st.title("Dataframe")

st.dataframe(df_filtered.head(1000))

# Aggregation
df_agg = df_filtered.groupby('O_ORDERSTATUS').agg(
    TOTAL_PRICE_AGG=('O_TOTALPRICE', 'sum')
).reset_index()

st.title("Dataframe agg")

st.dataframe(df_agg)

Also, this is an example implemented with Snowpark DataFrame. This works perfectly, but if you are not familiar with Snowpark DataFrame, it may be a little difficult to write. In that case, let’s look forward to being able to use Pandas on Snowflake with Streamlit in Snowflake together :laughing:

# Python packages
import streamlit as st
import numpy as np
import datetime as datetime
from snowflake.snowpark.context import get_active_session
from snowflake.snowpark import functions as F

# Get the current credentials
session = get_active_session()

# Create initial dataframe
df = session.sql("SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.ORDERS") ## ★ modify!

# Sidebar with filters
with st.sidebar:
    st.header('Filters')
    
    st.divider()
    
    st.subheader('Total price')

    o_totalprice_max = float(df.agg(F.max("O_TOTALPRICE")).collect()[0][0])
    o_totalprice_min = float(df.agg(F.min("O_TOTALPRICE")).collect()[0][0])
    
    o_total_price_max = st.number_input('Max', min_value=0.00, max_value=o_totalprice_max, value=o_totalprice_max, step=1.00)
    o_total_price_min = st.number_input('Min', min_value=0.00, max_value=o_totalprice_max, value=o_totalprice_min, step=1.00)
    
    
    st.divider()
    
    st.subheader('Priority')

    unique_values = df.select("O_ORDERPRIORITY").distinct().collect()
    unique_values_list = [row["O_ORDERPRIORITY"] for row in unique_values]
    
    with st.form("form"):  ## ★ modify!
        o_order_priority = st.multiselect(
            "Player tier:",
            options=unique_values_list,
            default=unique_values_list
        )

        st.form_submit_button("Go")
    

# Filter dataframe
df_filtered = df.filter(
    (F.col("O_ORDERPRIORITY").isin(o_order_priority)) &
    (F.col("O_TOTALPRICE") >= o_total_price_min) &
    (F.col("O_TOTALPRICE") <= o_total_price_max)
)

# Dataframe
st.title("Dataframe")

st.dataframe(df_filtered.limit(1000))

# Aggregation
df_agg = (
    df_filtered.group_by('O_ORDERSTATUS')
    .agg(F.sum('O_TOTALPRICE').alias('TOTAL_PRICE_AGG'))
)

st.title("Dataframe agg")

st.dataframe(df_agg)