Whether streamlit can handle Big Data Analysis

Hi Community,

I am new to streamlit.io whether streamlit can handle big data analysis. Anybody using streamlit in production level, kindly comment. And Also we can built multiple user login web application like HTML,CSS and Javascript?

1 Like

I don’t have the hard experience with it at scale that you’re looking for, but it’s so slow and unstable even with small data I don’t see how this could be used with Big Data at scale unless you want to pump a ton of money into paying for custom support.

Multi-user production use with logins and enterprise grade security? You saw the zero day that just happened this week right? I don’t think you’d want to use it for that unless it’s just doing viz of publicly available data (which is my use case but I am on the verge of giving up on it even for that).

I think it’s mostly good for making little tutorial posts and proof of concept examples for those who can’t be bothered to learn HTML, CSS and JS… I sound bitter right? Don’t go by me…

I have been personally using Streamlit in an enterprise setting processing Dataframes with hundreds of thousands of rows. We use ag-grid to display the data and we use Redis caching for our Dataframes that are generated via some Jupyter Notebooks. Some things to keep in mind:

  1. If possible avoid doing things like groupby in Streamlit as this is rather expensive to do in Streamlit.
  2. Compress your Dataframes when stored. In our case we use compressed pickled Dataframes that we load from Redis and only load from files when the data is unavailable in the cache.
  3. We don’t use memoization or streamlit caching as we ran into problems when our app is deployed in the cloud. We instead rely on Redis caching.
  4. If we need to load Dataframes from files we use compressed Parquet format and this seems to make loading and parsing much faster in Streamlit.

Hi @hack-r ,

You mean to say, better go with HTML, CSS, JS instead of Streamlit. Do you have any knowledge on Databricks. How to connect with streamlit to get data from databricks

Hi @Steven_Atkin,

Thanks for support. Do you have knowledge on Databricks. I am looking for support in connecting Databricks and Streamlit. Like We are doing computation in Databricks, those results to be shown in streamlit.

@sridharr Correct. Streamlit is just a Python library so it’s the same as connecting Python to Databricks:

1 Like

@Steven_Atkin Quick question about your use of aggrid - did you determine how to set the default column widths and show/hide columns using code?

It’s really worth $750 per app for you? Why not just use something free?

@hack-r would love any examples you could share of Streamlit being slow and unstable with small data. Also can you elaborate on what you mean by slow? There are a few tricks under the hood to help speed up data processing and we are always looking for ways to improve Streamlit especially in terms of speed, so any examples you can provide will help!

In terms of Ag-Grid you can use the Streamlit component for free if you’re not using enterprise features. It’s also one of the highest priorities for us this year to upgrade st.dataframe to provide more native data interaction experiences.

@sridharr in terms of production, thousands of companies run Streamlit apps in production and host their apps themselves. Streamlit doesn’t offer an enterprise deployment option on our Cloud as Streamlit Cloud is specifically to help users get started and share apps with the world in order to help the broader open source data community and researchers, students, and hobbiests spread their data work. Many large companies (including over half the Fortune 50) host Streamlit to do their own deployments and add in their own logins and other features there. We are currently working with Snowflake to provide an enterprise deployment solution for companies that don’t want to do their own deployments.

4 Likes

@hack-r Thanks, i am using the same docs. In this case, i sending the SQL query and getting the required dataframe and showing it in streamlit. Since in Databricks, we have workspace where all SQL queries were run and results are there as csv, i dont know to get that csv like get API of databricks.(like Nodejs, we write API for CRUD operations)

@Amanda_Kelly Thanks for your reply. my company using databricks to handle very large dataset (eg. 1 to 10 million row data - size of 50GB csv file). If your team having expertise on databricks, so that i can discuss further with him/her to get knowledge on connecting streamlit with it.

I don’t have specific knowledge on DataBricks but maybe someone else in the community has experience they can share?

@Amanda_Kelly Thank you

@Steven_Atkin This goes a bit off the track but can you elaborate on “We don’t use memoization or streamlit caching as we ran into problems when our app is deployed in the cloud. We instead rely on Redis caching”? We’re revisiting caching right now, so would love to get any feedback you have!

@Amanda_Kelly Regarding speed - thanks for asking - I do have some examples and would love to hear the tips and tricks. Let me know if it’s appropriate for this thread or if I should open another question.

On that note - is there a recommended method for measuring page load times? There are some 3rd party web apps I normally use but in at least some cases (if not all) I’m not sure they are correctly measuring the full load time for Streamlit, but rather when Streamlit itself returns a page before it loads the content.

@sridarr are you actually trying to load 50GB into a browser?

I assume that’s not the case. So it must be that you just want to transfer or read 50GB raw data from DataBricks to whatever server you’re running streamlit on then let your DS’s or whoever run some Python scripts against it and display some of the results in Streamlit?

The details of how you’re processing the data may matter here. If you just need an example of how to get 50GB (or whatever size) data from DB to a server via Python I can find that for you but I assume you probably already have it. So maybe it’s just that you were thinking of this as a Streamlit operation when really the data pull doesn’t need to be a Streamlit step.

Are they trying to read this data dynamically with it updating all the time and always wanting the Streamlit app hitting the most recent version of the data? Maybe you can give an example. If the real use case is too secretive you can just explain by analogy like “the 50GB is a large dataset of transactions, the Streamlit app will use menus to let users graphically create DB queries which will then be summarized in a word cloud and a tabular table. The data is refreshed daily. The upstream database type is Snowflake.”.

@hack-r You are right, I am working 50GB data on Databricks and workout results are there in Databricks Workspace. I want to connect streamlit with results of databricks. I don’t want to read entire table. Is there any option or function can be written to fetch the results of databricks.

currently i am using this function

def run_query(query):
  with sql.connect(server_hostname = "dbc-xxxxx-xxxx.cloud.databricks.com",
                  http_path       = "sql/protocolv1/o/123456789/0614-xxxxx-yyyyyy",
                  access_token    = "dapia12cdfertfg45467yghu9hjghdfher3") as connection:
    df = pd.read_sql(query, connection)
    return df

Here the Problem, i am sending sql query and getting result from databricks as table but getting the existing workout results. Is there any work way around…

I would suggest that you don’t attempt to use the caching feature that is available in Streamlit and instead rely on external sources of caching. The built-in Streamlit caching will not work with 50GB of data.

@Steven_Atkin Thanks for reply. As per your suggestion, I willl check out on Redis caching…