I m using streamlit to create a exploratory data tool. The application is running in a container in ECS Fargate, and this container runs two processes:
- python update_service.py, that monitors an S3 bucket for updated vaex hdf5 file. Whenever a new file is found, it is downloaded into the container where it is used by an streamlit app;
- steamlit run app.py, which starts streamlit server;
Due to the memory mapped nature of vaex hdf5 files, whenever the file is updated on the file system by update_service.py, Linux delete the original file in use, but keeps it’s content in an orphaned (deleted) state, spending storage. Each update, makes the free disk space smaller. It seems that depending on what you do with the data, a handle is not released, even if you close the data, delete the variable and call gc.collect(). Lsof command shows the opened hdf5 file in a “deleted” state.
I tracked down the issue to the function to_pandas_df(). I call it to transform a
vaex dataframe into a pandas one, before showing it in a graph or aggrid. It looks like when I do this, vaex lose track of the memory mapped file handle. Fixing this permanently might requires some change on vaex core, but in the meantime, I was thinking a temporary workaround would be restarting the streamlit server altogether, whenever the hdf5 is modified. A “Rerun” or a browser Refresh dont do the trick. It has to be a total restart of the streamlit server (like stopping and command “streamlit run app.py” again).
How could I do such server restart properly?