File_uploader test different encoding

Hello good people,
I have a specific use case where I need the user to upload xls files which might have different 2 encoding types.
Unfortunately using encoding = “auto” doesn’t do the trick with one of them, so I would to try opening the file with one encoder and then move to the next one if the first fails. Right now the only workaround I could think of is the following:

try:
    uploaded_file = st.file_uploader(type="xls", encoding = 'GB18030', key = 'a')
except:
    uploaded_file = st.file_uploader(type="xls", encoding = 'utf-16-le', key = 'b')

Which as you can imagine it is not very clean as it forces the user to select the file a second time if the decoding fails

Any thoughts on how I could solve this?
Thank you!
f.

Hi @Fil, welcome to the Streamlit community!

I think your try/except solution could work structured a little differently. Could the file_uploader widget be moved out of the try block, and instead you take those bytes and try to convert the Excel file? Meaning, take the file upload in an encoding that covers both the GB18030 and utf-16-le encoding ranges (UTF-8?), then convert to each of the encodings you might get and see if it gives you the right answer?

Best,
Randy

Hi Randy,
thank you for your reply, UTF-8 doesn’t work in my case, initially I did try something like this (if that’s what you mean):

> uploaded_file = st.sidebar.file_uploader(type="csv", encoding = None, key = 'a')
> 
>try:
>     encoding = 'utf-16-le'
>     df = pd.read_csv(uploaded_file, encoding = encoding )
>except:
>     encoding = 'gb18030'
>     df = pd.read_csv(uploaded_file, encoding = encoding )

If I do that the uploading doesn’t give any error, and if the first try is successful I get my data and everything works.
The problem is when the right encoding is the second one( ‘gb18030’ in the example above), when that is the case I get this error:

“EmptyDataError: No columns to parse from file”

Then if I try first with ‘gb18030’ everything works again

It is almost like after the first attempt the variable uploaded_file is lost somehow (I’m sure this is not the right technical explanation).

Do I need somehow to cache the uploaded_file in order to try several things after?
Thanks again
F.

Yes, this is what I meant, and it looks like you are close. I think what’s happening here is that file_uploader returns a BytesIO buffer, which in most cases functions the same way as having the file itself. The one difference is, once you read the buffer, it’s empty.

Try putting a statement like file_bytes = uploaded_file.read() after the uploaded_file line, then try to read the file_bytes object instead. My theory here is that file_bytes will be a bytestring, and that will persist across the try/except block.

2 Likes

Hi Randy,
Thank you again for taking the time, your solution works wonders!

I’ll leave my attempt here in case others encounter the same issue:

from io import StringIO

uploaded_file = st.sidebar.file_uploader(type="xls", encoding =None, key = 'a')   
bytes_data = uploaded_file.read()

try:        
    encoding = 'gb18030'
    s=str(bytes_data,encoding)
    
except:
    encoding = 'utf-16-le' 
    s=str(bytes_data,encoding)

data = StringIO(s)

then you can simply read your data with Pandas as a normal csv.

Thanks again for this! Streamlit is really great, looking forward to see what you put together for the team version.
Cheers
f.

2 Likes

Fantastic!

how to read data with Pandas as a normal csv after performing str() function…
please do write the code