Read_excel on uploaded_file is too slow

I uploaded a 1.4MB file using st.file_uploader feature of streamlit. When I read the uploaded file using pd.read_excel(), it takes a lot of time to read. However, it only takes only a few seconds to read it in my local machine. So I wonder what could be the issue?

Appreciate any pointers.

NOTE: We had fixed nginx config to increase the client_max_body_size to even allow the >1MB file upload

Hi @Sarnath_K :wave:

Can you please share a link to your app and / or repo?

Thanks for the reply. unfortunately, it is a private app inside our company app infrastructure which is based on EKS (Kubenetes).

No worries – Since I can’t see the code, I’ll proceed by elimination! :smiling_face:

First things first, have you tried reading the file using a BytesIO object instead of directly from the uploaded file? This may improve performance by reducing read latency in some environments. Here’s an example:

import pandas as pd
from io import BytesIO

uploaded_file = st.file_uploader("Upload Excel file", type=["xlsx", "xls"])
if uploaded_file is not None:
    file_bytes = BytesIO(uploaded_file.read())
    df = pd.read_excel(file_bytes)

Best,
Charly

1 Like

I have not tried this. I will try and let you know! Thanks!

Sounds good, @Sarnath_K!

Let me know how it goes.

Best,
Charly

Hi Charly,

I tried but I get an error like below after waiting for more than 20s or so.

ValueError: Unable to read workbook: could not read manifest from None. This is most probably because the workbook source files contain some invalid XML. Please see the exception for more details.

UPDATE:

  1. The error was my mistake. I had to .seek(0) before using it again (I first check the number of sheets before invoking read_excel)
    One improvement we can do is to use “getvalue()” instead of “read()”

  2. Now I see that the reads are quite faster. Much better than before. Earlier it used to take 2 to 3 minutes. Now it is reduced to 15 seconds. Is there a way we can reduce that to 5 seconds? :slight_smile:

FURTHER UPDATE

  1. The main issue of low performance was also because our Docker container had meager resources. After doubling the resources and with BytesIO, we reached 15 seconds. The main issue was the speed of “pd.ExcelFile” / “pd.read_excel” calls on BytesIO buffer. This one completes in 5 seconds in my laptop. But on the docker container it is slower. Not sure why. But it is important to understand that these calls required a good allocation of CPU.

Thanks a lot for your inputs!

Glad to hear my input helped resolve your issues! :hugs:

Happy Streamlit-ing! :balloon:

Best,
Charly

1 Like

Thank you Charly for your time! Appreciate much!

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.