Selenium web scraping on streamlit cloud

Hello. I was wondering if streamlit supports selenium. I have an app that scrapes data from WholeFoods, cleans the data, and shows insights of different discounts. Everything works great locally, however via deployment, I’m getting a

ModuleNotFoundError: No module named 'selenium'

If anyone has any tips to this that would be great. Hopefully this app will help those shoppers at WholeFoods who have Amazon Prime and want to make the best out of their membership and find the best discounts! (Highest discounts :sunglasses:)

Here is a link to the app. Currently, when entering your own zipcode it does not work, as when doing so that runs a .py file that contains the scraping code. When that .py file is running, it requires selenium. I have checked and selenium installs fine on streamlit cloud when booting up, the requirements.txt is also loaded with the latest version.

https://share.streamlit.io/youssefsultan/wholefoods-datascraping-project-deployment/main/Deployment/streamlit_app.py#live-wholefoods-on-sale-product-insights

Hi @YoussefSultan, welcome to the Streamlit community!

If I had to guess, I suspect your issue is with this line:

On Streamlit Cloud, it is a Debian image, not a Windows one. So C:/ won’t exist. I would explore how to install Selenium on Debian, and add those installations to a packages.txt file as highlighted in the documentation:

Best,
Randy

4 Likes

Hi Randy,

Thank you for the response. I think that is a great point that I definitely want to look into, however since it is saying that selenium was not found, I want to find the root of that issue first, as then once selenium is fully loaded it should give me the error of the chrome driver path.

I think the chrome driver path can be easily fixed by adding a pathlib path to the driver in the github repo, however when it comes to selenium not loading, the streamlit cloud actually doesn’t recognize that there is a module named ‘selenium’. So I am supposing it has to do with the install.

Please let me know if you know any successful projects deployed that use the selenium module so I can compare and contrast and come to a fix! Hopefully this can help many others on the platform.

1 Like

There is an issue with locating google-chrome-stable in the packages.txt when spooling up the server, in order to fix I require to wget a chrome Debian package from googles website. Is there a way to input a link so when spooling up I can wget this driver and have chrome locatable?


In this case, upon startup, we would have

Get:7 https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

Thus from there, it can find the location of the chrome installation.

Thanks.

Here’s a minimal example of running Selenium on Streamlit Cloud:

import streamlit as st
import os, sys

@st.experimental_singleton
def installff():
  os.system('sbase install geckodriver')
  os.system('ln -s /home/appuser/venv/lib/python3.7/site-packages/seleniumbase/drivers/geckodriver /home/appuser/venv/bin/geckodriver')

_ = installff()
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(options=opts)

browser.get('http://example.com')
st.write(browser.page_source)

The only Python requirement is installing seleniumbase; the only package required for packages.txt is firefox-esr.

If you absolutely have to use Chrome, you should be able to specify chrome in the sbase... line, and instead of firefox-esr, you can install chromium.

I will eventually make a public example of this, since it seems to be tripping a few people up.

Best,
Randy

6 Likes

The new Service install procedure works super well. Repaired my Selenium Scraping POC too (with Selenium, no SeleniumBase :slight_smile: ) thanks @randyzwitch !

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.firefox import GeckoDriverManager

URL = ""
TIMEOUT = 20

st.title("Test Selenium")

firefoxOptions = Options()
firefoxOptions.add_argument("--headless")
service = Service(GeckoDriverManager().install())
driver = webdriver.Firefox(
    options=firefoxOptions,
    service=service,
)
driver.get(URL)

Source code: andfanilo/s4a-selenium: Test Selenium + Firefox on Streamlit Share (github.com)
App: Streamlit

3 Likes

In terms of performance and optimization as I know streamlit provisioned servers allocate very small shm (memory) per instance. Do you find your solution to be faster than using seleniumbase? What main differences are you seeing and why not use seleniumBase? Just interested in your perspective. Thanks and congrats on having it work!

I’ve not tested but IMO there should not be a difference. SeleniumBase is a testing framework wrapper around Selenium so you may find the API nicer to use :slight_smile: (I’m just personally more used to the low-level framework ahah)

can u edit the code that works with Chrome? if you don’t mind…

@jonsarz16 here you go

https://selenium.streamlit.app/

packages.txt
chromium
requirements.txt
streamlit
seleniumbase
webdriver-manager
streamlit_app.py
import streamlit as st

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

@st.experimental_singleton
def get_driver():
    return webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)

options = Options()
options.add_argument('--disable-gpu')
options.add_argument('--headless')

driver = get_driver()
driver.get('http://example.com')

st.code(driver.page_source)
3 Likes

Hi, I’m trying to use selenium for web scraping on Streamlit Cloud but I have an error.
can u solve the problem?
This is my app
APP_LINK

it’s used to get comments from youtube using selenium and classify those comments into categories using machine learning.



Screen Shot 2022-12-14 at 10.07.50

LINK TO GITHUB CODE

Can u check my requirement.txt and Dashboard.py to find the error in selenium Chrome driver?

Absolute thanks

@Franky1

1 Like

Perhaps switch the st.experimental_singleton to st.experimental_memo here and reboot/redeploy the app?

Hi thank you for replying!

I’ve tried to do so but the error still won’t go away…

Could it be my packages.txt or requirements.txt?

or something is not right with my driver selenium code?

Any Help would be fantastic!

@ snehankekre @Franky1

This is the error my Selenium won’t work on Streamlit Cloud.


Screen Shot 2022-12-14 at 10.07.50

Any help would be great!
@ snehankekre @ andfanilo

I will defer to the community. In my example, I cache decorated the function returning the driver. You’ve applied caching to more than just that. Perhaps you could decorate the bit only returning the driver? Beyond that, I would look to the community for help.

1 Like

it works fine in my local host but doesn’t work on Streamlit Cloud.

By the way, thank you so much!

Hey, I have changed the code to return just the driver but the error is still there

could you check the code on Dashboard.py?

Thank you!

My issue is that I cannot find all the html content. I’m looking for a table that in the deployed stage does not appear, and locally it works perfectly.

s = BeautifulSoup(self.driver.page_source, features='lxml')
table = s.find(
                "table",
                class_=
                "table table-primary table-forecast allSwellsActive msw-js-table msw-units-large"
            )

I have a problem

Details: Can't pickle local object '_createenviron.<locals>.encode'

Don’t forget the packages.txt (see MWE).

In my case this solved the problem you have reported.