Selenium web scraping on streamlit cloud

Hello. I was wondering if streamlit supports selenium. I have an app that scrapes data from WholeFoods, cleans the data, and shows insights of different discounts. Everything works great locally, however via deployment, I’m getting a

ModuleNotFoundError: No module named 'selenium'

If anyone has any tips to this that would be great. Hopefully this app will help those shoppers at WholeFoods who have Amazon Prime and want to make the best out of their membership and find the best discounts! (Highest discounts :sunglasses:)

Here is a link to the app. Currently, when entering your own zipcode it does not work, as when doing so that runs a .py file that contains the scraping code. When that .py file is running, it requires selenium. I have checked and selenium installs fine on streamlit cloud when booting up, the requirements.txt is also loaded with the latest version.

https://share.streamlit.io/youssefsultan/wholefoods-datascraping-project-deployment/main/Deployment/streamlit_app.py#live-wholefoods-on-sale-product-insights

Hi @YoussefSultan, welcome to the Streamlit community!

If I had to guess, I suspect your issue is with this line:

On Streamlit Cloud, it is a Debian image, not a Windows one. So C:/ won’t exist. I would explore how to install Selenium on Debian, and add those installations to a packages.txt file as highlighted in the documentation:

Best,
Randy

2 Likes

Hi Randy,

Thank you for the response. I think that is a great point that I definitely want to look into, however since it is saying that selenium was not found, I want to find the root of that issue first, as then once selenium is fully loaded it should give me the error of the chrome driver path.

I think the chrome driver path can be easily fixed by adding a pathlib path to the driver in the github repo, however when it comes to selenium not loading, the streamlit cloud actually doesn’t recognize that there is a module named ‘selenium’. So I am supposing it has to do with the install.

Please let me know if you know any successful projects deployed that use the selenium module so I can compare and contrast and come to a fix! Hopefully this can help many others on the platform.

There is an issue with locating google-chrome-stable in the packages.txt when spooling up the server, in order to fix I require to wget a chrome Debian package from googles website. Is there a way to input a link so when spooling up I can wget this driver and have chrome locatable?


In this case, upon startup, we would have

Get:7 https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

Thus from there, it can find the location of the chrome installation.

Thanks.

Here’s a minimal example of running Selenium on Streamlit Cloud:

import streamlit as st
import os, sys

@st.experimental_singleton
def installff():
  os.system('sbase install geckodriver')
  os.system('ln -s /home/appuser/venv/lib/python3.7/site-packages/seleniumbase/drivers/geckodriver /home/appuser/venv/bin/geckodriver')

_ = installff()
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(options=opts)

browser.get('http://example.com')
st.write(browser.page_source)

The only Python requirement is installing seleniumbase; the only package required for packages.txt is firefox-esr.

If you absolutely have to use Chrome, you should be able to specify chrome in the sbase... line, and instead of firefox-esr, you can install chromium.

I will eventually make a public example of this, since it seems to be tripping a few people up.

Best,
Randy

5 Likes

The new Service install procedure works super well. Repaired my Selenium Scraping POC too (with Selenium, no SeleniumBase :slight_smile: ) thanks @randyzwitch !

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.firefox import GeckoDriverManager

URL = ""
TIMEOUT = 20

st.title("Test Selenium")

firefoxOptions = Options()
firefoxOptions.add_argument("--headless")
service = Service(GeckoDriverManager().install())
driver = webdriver.Firefox(
    options=firefoxOptions,
    service=service,
)
driver.get(URL)

Source code: andfanilo/s4a-selenium: Test Selenium + Firefox on Streamlit Share (github.com)
App: Streamlit

1 Like

In terms of performance and optimization as I know streamlit provisioned servers allocate very small shm (memory) per instance. Do you find your solution to be faster than using seleniumbase? What main differences are you seeing and why not use seleniumBase? Just interested in your perspective. Thanks and congrats on having it work!

I’ve not tested but IMO there should not be a difference. SeleniumBase is a testing framework wrapper around Selenium so you may find the API nicer to use :slight_smile: (I’m just personally more used to the low-level framework ahah)