High Level Streamlit System Design Questions

:rotating_light: Before clicking “Create Topic”, please make sure your post includes the following information (otherwise, the post will be locked). :rotating_light:

  1. Are you running your app locally or is it deployed?
    No
  2. If your app is deployed:
    a. Is it deployed on Community Cloud or another hosting platform?
    b. Share the link to the public deployed app.
    No
  3. Share the link to your app’s public GitHub repository (including a requirements file).
    No link yet
  4. Share the full text of the error message (not a screenshot).
    No error message
  5. Share the Streamlit and Python versions.
    Not relevant

I’m in the design phase of creating a financial analysis app. While there is some very good info on the forum, most of it seems to address down-in-the-weeds issues. But I can’t find anything about a Streamlit high-level design that:

  • Uses financial metrics data from a back-end python app that I created

  • Pulls data from public and private data-provider websites

  • Can be used from my Windows dev workstation, my iPad, my Windows laptop, or my Mac laptop via WiFi, OR…

  • Can be used over the Web using standard security protocols from the Community Cloud.

I think I understand the Python basics (classes, libraries, Pandas, etc.), but one big issue is understanding the best way to get my custom data from my workstation up to the cloud. It seems like the suggested method is to check in the data to GitHub and that will resolve the problem. The problem is that my backend software produces multiple custom datasets, each with thousands or tens of thousands of data rows. I’m concerned about controlling the status of the process and having the ability to upload data ad-hoc without having to use GitHub.

Any suggestions on how I can deal with these issues? Or about where (links) I can find high-level design solutions?

Thanks,

Dan.

3 Likes

Hi @DcPublished

I know that @asehmi has some amazing Streamlit apps that are well architectured to scale where the frontend (Streamlit) and backend (FastAPI) are built separately. Perhaps you can get some inspiration from his apps.

Here are some GitHub repos of his apps:

Hope this helps!

3 Likes

Dataprofessor,

First, thanks for the links.

It’s a bit confusing, but it appears that his Streamlit app is spun up from a windows pc. And that the media server is also run from his local windows pc. Media is coming from unsplash.com.

That said, this line indicates that some or all of this app can be run on Heroku Dynos and Heroku PostgreSQL. My backend app is written in Python and the database is PostgreSQL. This implies that:

  1. I could build and test the client app on a my local windows PC

  2. When I’m ready to deploy to the “net”, I can deploy the client app to Heroku Dyno(s) and upload the data to Heroku Postgres. At first glance, the cost of Heroku is reasonable. Of course, this needs more investigation and study, but it looks very interesting.

Since I’ve had 30+ years as a data engineer/scientist, database developer, blah, blah, blah… Bottom line is that I love working with databases.

Here’s a link to their website: Cloud Application Platform | Heroku

I’m not sure how to upload data from my backend app running on my PC workstation would need to securely upload data daily to a server. That said, this commercial service might be worthwhile to provide good answers.

Thanks,

Dan.

3 Likes

Hi,

Thanks @dataprofessor for the shout out. @DcPublished the media explorer app can be configured to run its backend as a monolith embedded in the Streamlit app, or in client-server mode as a standalone Streamlit frontend and FastAPI backend server.

The media explorer app only serves as an example of a flexible client-server design that you can follow. If you want to run it locally, then your media collections can point to local folders, but in the cloud you need to use URLs. You could write a small program that uploaded images to an S3 bucket and built the links list used in the media-service.toml file. Alternatively, you can change the appropriate media list generators in media_service.py. I have another version of this app which uses a sqlite database, for example.

Deploying on Heroku is easy. There are many posts in this forum showing what to do. Since you want to use PostgreSQL I’d suggest you start with a pure Streamlit application and squeeze as much perf from it using data caching and the new partial run decorators. You’ll be able to upload data to Heroku PostgreSQL using the Herkou desktop client, or the tools available in the build pack (I haven’t used it, so just guessing). Why not use Snowflake? There is a data connector that makes it really easy.

Have a look at @C_Quang’s app too: Public-facing, enterprise-grade deployment of Streamlit

3 Likes

Asehmi and dataprofessor,

First, many thanks for the feedback.

Second, I’ve been thinking this through carefully. Although I love databases, I decided that it adds a lot of unnecessary complexity. So I switched back to using a file-based approach. I.e. Including some files in the Streamlit Analysis App checked into GitHub and uploading some files via OneDrive.

I want to use OneDrive because I have a lot of space on it and it’s easily accessible from my workstation. And it appears to be as easily accessible from a Streamlit app running on a cloud server or running from local host.

Attached is a data flow drawing (DFD) of my to-be application environment. I know the Load System App well because it’s 95%+ built and works well. However, I’m fuzzy about Analysis System App and Streamlit Environment. Here are a few questions:

  • Roughly speaking do you seen anything wrong with this overall design?
  • I’m assuming that Streamlit can consume files from OneDrive. But can files be dropped from Streamlit to OneDrive?
  • Suggested improvements to the design?
  • Will this work with Streamlit Community Cloud?
  • Or do I need to deploy my Streamlit app on a Heroku or Snowflake server?

Thanks in advance for any feedback.

Best,

Dan.

p.s. The necessity for a OneDrive load is volume. Once well debugged, the Analysis System App will not be checked in daily, but there likely will be multiple files dropped to OneDrive daily. And some files will be removed from OneDrive daily. Files should note be huge (maybe a few thousand bytes), but there may be a fairly large number of files - all dropped hands free.

4 Likes

@DcPublished, Thank You. i was looking for similar solution.

2 Likes

My original plans for this application was to run everything entirely from the cloud, but I’ve been able to keep all of the various components running without any hiccups for about a month. Unless I’m making changes, it only ties up 10-15 minutes of my time each day the market is open using a local environment as a hub/launching point for 23+ million rows of data using the following:

  • Streamlit
  • Microsoft Power BI
  • Microsoft Fabric
  • PostgreSQL
  • Rest API

Fin Stream

There’s endless possibilities, hopefully you find the one that works best for what your trying to achieve.

Hi @DcPublished

If there’s a Python library that allows the interface of your app with OneDrive files, then sure that works! You can also look into Google Drive as well, I think there’s a Python library that supports this.

2 Likes

First, many thanks to all of you for the great feedback. After some analysis, I realized:

  1. That the benefits of running Streamlit on localhost versus a cloud far outweigh benefits of running Streamlit on a cloud.

  2. One major benefit is that data in my database can be stored and accessed from the Streamlit app ad hoc whenever necessary. Another benefit is that managing the data is much easier with a database. And a major benefit is that it makes it easier to do adhoc, iterative analysis. For example, pulling different types of data based on what the current point in the analysis tells us.

RepNot, your list of products is very interesting and caused me to consider them carefully. From my research, it appears that Microsoft Fabric and Power BI are meant for the corporate market. Given my 30 years in that market, my experience told me that internal customers needed clean, accurate, semantically correct, current, simple data. (“Simple” does NOT imply easy. Far from it.)

Your Fin Stream website is VERY nice but seems to support my point. Many investors and traders will love your Fin Stream website. On the other hand…

My needs are a bit different. I want to combine metrics that you’ve presented with other metrics like linear regressions for multiple datasets, And then match that with sentiment analysis and other metrics for a ticker’s Industry and Sector. And would be loaded into AI models for further crunching. (Yes, I’m a masochist :face_with_spiral_eyes:)

And this brings me back to Microsoft Fabric and Power BI. It seems like a perfect match for your needs, but I’m not sure how it matches my needs. These tools provide a wide scope of services. My needs on the other hand require a narrower but deeper scope of services.

My new DFD is attached. Still needs work but reflects that data will be stored into and accessed from a local PostgreSQL database. It’s a bit rough, but should help explain what I’m trying to.

Many thanks,

Dan.

1 Like

I forgot to mention that I’m considering PyGWalker instead of Plotly which I’m currently using.

This is a library that integrates well with Jupyter (which can run under Streamlit). A nice feature is that it uses Pandas data frames for data retrieval, which can be loaded from delimited files or APIs.

It is being promoted as a competitor to Tableau. I don’t know how well it compares to PowerBI. That said, PyGWalker has a strong cost/benefit ratio. (smile)

Best,

Dan.

2 Likes

Repnot,

I tried PyGWalker tonight. Unfortunately I was not impressed. What I was able to bring up did not match the demos I’ve seen. And the documentation was very limited.

After seeing your embedded code and screenshots, I searched further. It appears that Microsoft Fabric is very expensive. But looking closer, PowerBI looks like it’s within my budget.

Can you point me to some info about the PowerBI tools that you are used to integrate with Streamlit?

Best,

Dan

1 Like

SNOW - CURRENT OBV - 02/09/2020 to 02/09/2024



Hi,

I would create separate modules for each major piece of functionality, especially for the data processing parts. These should be written as plain old python objects (POPOs). Then I would layer a CLI on top of those so that they can be executed manually as required. Use Typer to implement the CLI. Test your core functions fully at this stage via the CLI (you can build a separate CLI test harness, use Jupyter, or even a test Streamlit app). Next you can layer an API on the modules, using FastAPI. This API can perform similar functions as the CLI, and expose endpoints which can be called from your Streamlit app (or any web app). You’ll find the API is mostly just mapping between POPOs and JSON and calling your module functions (like the CLI).

This structure gives you a lot of deployment flexibility and you can focus on adding different levels of capability in the API, CLI or Streamlit app as you see fit. For example, you can easily schedule jobs or build a data pipeline from your modules.

The key thing is the business logic (your modules) is encapsulated and reusable in different ways (in CLI, API, Streamlit app, etc.). Over time your modules can take on their own structure to reflect their independent responsibilities.

Start with a solid initial design and you’ll be able maintain and evolve your system more easily over time. Streamlit encourages building monoliths which is great up to a point. Thereafter, you should introduce a few simple modular design principles into the app architecture… before your app starts “doing things to you, rather than for you!”

Solely my opinions. Your milage may vary.

Arvindra

1 Like

First, many thanks for all the great feedback. Great direct and indirect ideas.

This is a data analysis/reporting app with a small data mart as the base layer. A key characteristic about data warehouses and data marts is that they store data from multiple sources, transform it, and attempt make the dataset consistent. There are multiple other characteristics bit those are the critical ones. And of course, it must meet business needs.

How the data is distributed and transmitted are important, but these are lesser requirements. Which brings me to the driving force that pushes me: generating signals for trading and investing.

The question is how? To help answer this, I created a simplified Process Flow/Data Flow Drawing (attached below) for discussion. Some details:

  • All current processing is on a robust PC workstation. It will be upgraded if necessary (memory, cpu, disks, GPU). Note that upgrading a PC workstation is not super-expensive these days. For example, high-quality NVME drives cost $165 for a two terabyte drive that runs more than three times faster than the same size drive two years ago. And adding 128GB can cost less then $300. Certainly not cheap, but bang for the buck is pretty high now.

  • A PostgreSQL database is used to store structured data - mostly financial transactions.

  • Moving forward, a MongoDB (or other NoSql) database will be added to store unstructured data from external sources (sentiment, news, events, sector/industry data, etc). Tools to analyze this is TBD. I need feedback on this.

  • Data is processed daily using downloaded daily and intraday data. It iterates through approximately 1,300 tickers, looking for tickers that meet criteria. I.e. generate a “signal” ticker.

  • These signal tickers are then analyzed using other methods and other data (both structured and unstructured) as a second level. At this point, it’s not clear which tools to use. Probably Jupyter and PowerBI. I tried several graphical/analysis packages and libraries, but all had issues. But these fit my needs the best. HOW I use them is an open question. I appreciate any feedback.

  • The results can be presented for trading/investing using either Streamlit or PowerBI. Again, this is still TBD. Any feedback is appreciated. Note that virtually no other users will use this system. (Not 100% sure on this yet.)

Best,

Dan.

1 Like

You describe a classic data pipeline scenario (choose your flavour: ETL, ELT, Data Mesh). Your case sounds straightforward enough to hand craft the pipeline, but I may be wrong. I imagine your system is largely unattended, so you’ll want decent error handling, telemetry, and observability, then start using a workflow orchestration tool. In the python world “celery” is very popular, but you can look at newer tools like “pathway.com”, “prefect.io”, “kedro.org” (I wrote a Streamlit blog post on that.) Building modular processing units, as I mentioned earlier, is ideal for orchestration using these tools. Those newer tools will be able to use Redis, which would be super handy in your system as a data staging area and as a pub/sub to trigger your workflows.

1 Like

Repnot,

Very nice looking webpage. Clean and clear. It shows the power (pardon the pun) of using a database to store and retrieve data.

I don’t know your schema, but I suspect that you have a calendar table. And can be (or is being) used for multiple applications. I have used calendar tables in multiple databases and have expanded them over the years to include multiple dimensions of data. This can save a ton of work.

One small issue… While the graphs on the right side are very understandable, I don’t understand the color coding for the two graphs on the left.

After going to your website, it’s a bit more understandable, but it still a little confusing to me.

Please consider adding legends and titles. When a user clicks on part of a map or table, the title and maybe the legend will change.

And if possible, keep the colors consistent. For example, clicking on Texas in the map, shows the highest megawatts generation to be a red bar with a “Natural Gas” label.

On the other hand, clicking on Washtington State shows the highest generation to be Hydroelectric, but it is also colored in red. (I live in Washington State, so I believe your numbers are correct.) The colors confuse me.

In general, this is an excellent website and in general demonstrates the strengths of PowerBI. The static screenshot doesn’t do it justice. I strongly encourage others to visit the website at: Power BI Report.

FYI, I just figured out how to message you directly. Check messages.

Best,

Dan

1 Like

asehmi,

Yes, this is a classic pipeline scenario. But that is completely intentional because I’ve been creating pipelines since about 1998 on multiple enterprise systems (SQL Server, Azure, Teradata, Oracle, PostgreSQL and several others) using everything from simple batch scripts to C#, Visual Basic, Perl, SQL scripts, Azure pipelines, Databricks, and Python (and probably a couple others I’ve forgotten).

Looking at my data/process flow drawing, ALL of these solid lines are dataflows written in Python and PostgreSQL stored procedures. Many people think Python is slow. That is true to some extent, until you use tools like Pandas data frames, numpy, and other libraries, and leverage PostgreSQL functions and stored procedures. Then it can be very fast. And almost all of the processing is in batches - very little is in row-by-row processing (called “RBAR” by data warehouse pros - Row-By-Agonizing-Row).

PostgreSQL is my “staging” area. It holds master data and control data, as well as application data. Right now, it has something like 10 million rows in the tables, down from 100 million after some judicious trimming.

Virtually all of the processing is done in the Python app and PostgreSQL database. This makes it much easier to access any data whenever needed, knowing that it’s all in one place.

After Derek turned me on to Power BI, I can see that it will be very useful for what I’m doing. So, I’m about ready to pull the trigger on a Power BI Pro subscription - the lowest level subscription.

Workflow is done in Python, with the Python app being kicked off by Windows Task Scheduler (TS). Right now, TS runs the Python app directly. That said, it would be a simple matter of writing a wrapper app for more sophisticated control.

Note that I’ve used commercial workflow apps before. The major issue I’ve found that they (like virtually everything else) have brick wall you can run into. And when you do, it can be a bear to get around that brick wall. For example…

I once had a contract with a large, Seattle-based coffee company. They used a sophisticated IBM workflow system that could only be used if you licensed a “seat”. The problem was that it cost $4,000 per seat in the mid-2000’s. I was creating software for them and needed an algorithm they used to spit out a specific metric. But that algorithm was buried inside this IBM software and not documented anywhere. I had to wait a MONTH before the company finally realized that they needed to fork over $4,000 to get me the needed license.

When I finally got the license, it took about an hour to figure out how to use the system to find the algorithm. Then about 10 minutes to copy it down and exit. MAJOR wasted time. I can give further examples, but I’ve got some concerns with these systems.

Anyway…

After downloading PowerBI Desktop (free version), it only took me a few minutes to get it running and connected to my PostgreSQL database. While I’ve had years of experience with reporting systems (including Tableau), I’ve never used PowerBI. Unlike other tools I’ve tried, it was extreme stable, easy to use, and had lots of features.

Right now, I’m trying to figure out the right balance of Streamlit, Jupyter, PowerBI, Python, and some AI libraries. It’s going to be fun.

Best,

Dan.

1 Like

Yes, PBI is quite fabulous. I’ve built many dashboards and embedded PBI apps in my past life. We even built custom visuals to match our brand and specific ways of presenting charts (the company is in global economics consulting, modeling, and forecast data). I heard the company I worked for moved to Tableau though, because most of their customers use Tableau and as usual end users just want to download the dashboards rather than use them embedded online, and then they want to extract the data and put it into Excel. All the nice work you do to roll up the data and display it for insights and decision-making simply ends up being unraveled and used once in Excel and then forgotten forever (I’m being sarcastic).

The workflow tools I mentioned are intentionally lightweight and available as open source. They’re available as Python pip install packages.

1 Like

Cheers! :+1:

Push your (non-proprietary) data into Power BI cloud as a dataset, then make it available for others on the data market place. That way you may even be able to monetize your work. Those PBI datasets can be consumed directly by PBI.

By the way this is a Streamlit forum, so we should probably be mention how you can do this on Snowflake too.

1 Like