for some background, my specific use case, is that we want to create dashboard where we can review some of the raw FTP files we get. the process in dashboard is to download the data from S3, which is 1GB xml file, and able to parse and display to user is prettified way (Eg some of the xml entries are better displayed as dataframe table). so main area trying to consider how do best is right now using
download_file method (Downloading files - Boto3 1.26.124 documentation) but worried about just downloading file to disk will leave this file on server indefinitely and not clean up. curious how others have solved this issue. I know in past I have used things such as downloading file in-memory instead to file (but again dont know if that will have in-memory issues, especially if trying caching in-memory, in terms of storage/running out of memory). Curious how others have solved
with that was thinking if somehow using the cached_data persist=True would help, but with this still have questions.
I am looking to get some advice on the proper use of cache_data persist parameter. I have following questions:
- what are main use cases for persist parameter. is it namely if potentially large amount of data that if stored in memory may take up to much space where disk may have more space? eg downloading 1gb xml file (that end up doing data manipulation on)?
- given there is no integration with ttl functioanltiy, is it case if data cached with persist, that will indefinitely be on servers disk (is there process that can clean up/delete this data). I am namely worried about using this parameter and just having files saved to disk indefinitely (having long term space concerns on server).
other ideas dont know if some way to use tempdirectory (but still dont know how can guarntee that gets deleted after certain amount of time, other safety features like that, etc)