St.file_uploader provides different pdf outcome than running it locally from file path

Hi everyone!

EDIT: Please see here for a better overview of the issue: st.file_uploader produces undesired results for some pdfs. · Issue #4878 · streamlit/streamlit · GitHub

____ Old Post:
I am running into a weird bug that I can not figure out. I’ll try to explain.

I have a streamlit app that allows user to upload documents (docx, pdf, txt) and automatically processes and cleans them into paragraphs via haystack and some customised functions.

The upload and processing works just fine, however I noticed that when I am running the EXACT SAME Function line by line in a notebook instead of in the streamlit app I get very different outcomes for uploaded pdfs (usually for the streamlit upload some words are not correctly processed).

The only difference is that, when running it in my notebook I simply define the local file path of the PDFs instead of uploading it over st.file_uploader.
I assume that the pdf is somehow differently stored by st.file_uploader and I am not sure how to fix this.

Again, the correct output is coming form defining local file paths in the notebooks. The streamlit output ist faulty.

Please let me know if this is understandable. The preprocessing scripts and document uploading functions can be found here:

It seems to work fine for docx and txt.

Example difference

Pdf uploaded over st.file_uploader:
2nd paragraph (words not correctly processed):

Pdf uploaded by defining local file path in jupyter notebook:
2nd paragraph (correctly processed words):

Something (undesired) is happening to the pdf when it is stored in memory by st.file_uploader.