I developed a program to upload a pdf and then extract information in a specific format from the pdf using a RE match. I’ve uploaded the code to github and streamlit cloud. Now when I upload many PDFS at the same time, the extraction speed is very slow, is there any way to speed up?
Would you share a link to your Github repository? Without seeing any code, the first thing that comes to mind is multiprocessing — Process-based parallelism — Python 3.11.4 documentation, assuming that all the PDF are processed independently of each other.
Thank you so much.
I used the multiprocess method you mentioned, and it was much faster.
But I have a new question, what is the configuration of cpu when I run the program develoyed on github, can multiprocess be supported?
On Streamlit Cloud you get 16 CPUs, however, I am not sure of how many of those you are actually allowed to use concurrently. Some time ago I tested this for an image processing app and it seemed that all the 16 CPUs were available, at least for parallel tasks that did not take too long finishing.
Here are other system details that might be helpful: