CSV over 2 GB

Hello everyone!

I need to upload a csv over 2 GB to my streamlit app. Then filter it and download the result. Someone worked with such volumes? What commands i need to add except maxsize? How streamlit behaves?

Thanks!

1 Like

Hi @Saveliy_Borkov,

You might be able to solve your problem using Pandas, pd.read_csv(…): https://stackoverflow.com/questions/25962114/how-do-i-read-a-large-csv-file-with-pandas

You might also want to check out Dask. I haven’t tried Dask with Streamlit yet, but have a side project (on my ever growing list of Streamlit apps I want to build) to do so.https://pythondata.com/dask-large-csv-python/

1 Like

Thank you for your response! @Chad_Mitchell

My question was more about streamlit (how it can cope with big data), despite this, big thanks for dask advice i ll try it tomorrow.

1 Like

Hi @Chad_Mitchell!

Dask wants to get a path to a file, but streamlit can’t provide it .

Streamlit returns StringIO type, maybe there is a solution with it?(without read_csv)

Hi @Saveliy_Borkov,

Can you share your code? Below is a potential workaround.

import os
filenames = os.listdir(folder_path)
selected_filename = st.selectbox('Select a file', filenames) 

@Chad_Mitchell I need to select random file, so my folder_path isn’t always the same.

In my code there is nothing special, just st.file_uploader and read_csv, then some selectbox with filters.

I think it could be instructive to take a step back…what is the desired user flow here?

Uploading large files through the browser is always going to have some inefficiency. Is the intent to allow a user other than the developer to upload large files to have them processed?

1 Like

@randyzwitch Yes, i’d like to create a tool where people can upload big data file, filter it and get brief overview (like graphs,sums, etc.)
User flow ~ 10 people per month

Ok, in that case, you just need to decide what is “big” for your use case. If you’re going to let people upload 100GB CSV files, then you need to have a machine that can hold that much data in RAM.

When using file_uploader, we save the bytes of the file to RAM. You’ll need to change the configuration file to allow larger than 200MB, but I think that 2GB limit comes from our use of Protobuf to transfer messages. You can read the background about the 2GB limit on Protobuf in this StackOverflow post

In general, if you are thinking about “big data” applications, I would suggest changing your interface to one where users provide a public URL to the file. This way, you will not transfer the data through the browser, but rather, you’d use a library such as Requests to download the file straight to the Python backend, rather than hitting any limitations via the browser.

2 Likes