Where does the data go when using file_uploader? When does it get deleted?

Working on a concept or two here and curious as to what happens to the file once a user uploads a it through streamlit.

Say a user uploads a CSV file to process. Once it’s processed, how long does that CSV file live on or when does it get deleted?

Thanks!

Hi @felixdasilva -

What you are experiencing is the difference between web-based programming and “data programming” (however that’s defined)…in this case, Streamlit isn’t saving your file anywhere. The file uploader takes the stream of bytes coming from the widget and saving it in RAM (like any other piece of data). By not writing to disk, you’re removing a step which takes time, so you’re improving performance.

The data lives on until the Streamlit app re-runs from top-to-bottom, which is on each widget interaction. If you need to save the data that was uploaded between runs, then you can cache it so that Streamlit persists it across re-runs:

Best,
Randy

1 Like

Thank you!

1 Like

@felixdasilva yeah - this has been a lot of trial and error for me too… converting streams in to files and such because that’s easier for the documentation i have on with the GIS processing libraries… i wonder if there might be a way to make github file storage a simpler thing? so a stream comes in from some api - caches to github and become persistent for “a while”. idk. i’m sure there are a lot of problems with a solution like that… the nice thing about being in memory is that it is ephemeral… but it can make finding documentation for specific projects harder - especially if you’re a novice like myself

The important point here is that Streamlit doesn’t do it…but that doesn’t mean that Python can’t. If you have a stream in a BytesIO buffer, writing to a file is done by:

with open("out.txt", "wb") as outfile:
    # Copy the BytesIO stream to the output file
    outfile.write(myio.getbuffer())

When you say saving it in ram, I’m assuming the ram on streamlit’s server? I just want to make sure that any data passed is not kept forever or ideally no data is ‘uploaded’ if that’s even possible when processing ‘local’ data.

@felixdasilva that is certainly my understanding… i think it makes the streamlit sharing platform relatively safe by design…

RAM is only capable of temporary storage. The data is only available inside the container your app is running in, with each app running in separate containers.

@randyzwitch

Thanks again for this amazing product. I have a related question.

I have an app running with nginx. I automatically clear the cache after 30 minutes. For added security I would like to suggest to the user to clear the cache when done.

Does your answer “The data is only available inside the container your app is running in, with each app running in separate containers” mean that if a user clears the streamlit cache using the hamburger menu option she is clearing “her” cache or is she also clearing the cache of all other users?

Thanks

Fabio

This is in reference to the Streamlit sharing service. Streamlit the open-source library does not itself run as a container process.

The cache is global, relative to the arguments passed into the function call. If you are doing something with sensitive information per user account, you should consider alternative ways of authenticating your application.

Thanks Randy,

Mine is a free service without user authentication. I would like to keep it that way at least for now. However, the information the users upload might indeed by sensitive.

I thought that by using nginx/apache I would get separate sessions and therefore separate caches. Is this not the case? If not, any suggestions on possible approaches?

Many thanks

Fabio

I’m not an nginx expert to be honest, but my expectation is that it would be the tornado server that’s managing the overall memory, not nginx. I think nginx would just be a reverse proxy?

Maybe @thiago has some ideas here?

Hi all!

Just jumping in real quick to clarify how files are stored:

Streamlit manages your uploaded files for you. Files are stored in memory (i.e. RAM, not disk), and they get deleted immediately as soon as they’re not needed anymore.

This means we remove a file from memory when:

  • The user uploads another file, replacing the original one
  • The user clears the file uploader
  • The user closes the browser tab where they uploaded the file

LMK if this helps!

Thanks @thiago,

Very helpful. My question is what happens when there is more than one user at the same time, specifically, to the cache? If one user clears ‘his’ cache and exits does another user that was running another session (say under nginx) also have her cache cleared, suffering a dip in performance?

Or say the first user exits without clearing the cache and the second user is a dangerous hacker. Will she be somehow able (we are assuming an app with no user credentials) to recover the sensitive info of her victim?

I know I am dramatizing a bit but just to give it color…,

Many thanks ! Great product!

Fabio

Oh, I see!

To clarify: no user has any access to the files uploaded by any other user.

This is true whether the two users are using the app concurrently or not. This is because our file manager data structure is a per-user-session structure.

(Of course, you can always program the ability to share files between user yourself if you want to. For example, by saving the file to disk and showing it to the other user. But I’m assuming you’re not doing that)


One thing that isn’t clear from your response, though, is that at some points you talk about “cache”. If by “cache” you mean “uploaded files”, then what I said above is correct.

But it by “cache” you mean “st.cache”, then the behavior there is different.

st.cache is a global cache keyed by the input parameters to the cached function (among other things. But for the sake of this discussion, let’s simplify!). This means that if you call the same function with the same parameters for two different users, the returned value will be the same for both users. So this is one way where you could inadvertently share information between users.

Another way you could share information between users is by storing information in other shared resources, like disk, databases, global module-level variables, and so on.

Hopefully this clarifies things!

Thanks @thiago , I am getting there :slight_smile:

To check if I understood correctly…

My assumptions: Streamlit app, running under nginx. No authentication mechanism (no userId, no login,…). Users upload a sensitive CSV dataset they do not want anybody else to see. The app uses many instances of st.cache to improve performance of different functions. The sensitive user data is always one of the input parameters of the cached functions.

So:

  1. If a user A uploads a dataset, this dataset cannot be accessed by user B, even if user B is a hacker. :slight_smile: Great.

  2. The global st.cache content that “belongs” to user A is not accessible to user B (since user B cannot access user A’ dataset, and therefore cannot call the function with the same parameters) :slight_smile: Great.

Last doubt.

User A and user B are working, at the same time, with different datasets, on the streamlit app. All is good and they cannot see the other user’s file, whether they are hackers or not.

At one point, user A, for whatever reason, clears the cache, deleting, if I understand correctly, also the info in the global st.cache tied to user B.

User B will experience a loss in performance until the cache rebuilds itself? :frowning_face:

If this is true (hope not) once you have enough users and there statistically is always one clearing the cache, the cache will more or less always be empty? Any ideas on how to avoid it, assuming this is true?

Thanks for the clarification!

Fabio