FAQ: How to improve performance of apps with large data

Problem

Your app uses large database or data files (CSV or JSON) and you’re noticing that the app is slow and you want to improve its speed.

Solution

Here are some suggestions that you can look into to speed up your app:

1. Removing unused data

An important point to consider is whether the entirety of the data file is really needed. Often times, you may actually need only a small subset of the original dataset. So instead of loading the entire data file, you can actually load only the columns that you really need, which has the advantage of consuming less memory and consequently improves the speed of the app.

Here’s what you can do when loading specific column that you need (i.e. let’s say we need only 3 columns x1, x2 and x3 instead of loading the entire data) a subset of a CSV data file using Pandas:

df = pd.read_csv('data.csv', usecols=['x1', 'x2', 'x3'])

2. Caching the data

You can use st.cache_data to cache your data:

@st.cache_data
def load_csv_data():
    df = pd.read_csv('data.csv', usecols=['x1', 'x2', 'x3'])
    return df

Briefly, this allows your data to be loaded on the first run, and subsequent runs would then load from memory. This can really add up if your app uses the same data more than once.

Furthermore, there’s also the persist="disk" option that allows you to cache the data to the local disk.

3. Choosing an optimal data storage format

If you’re using a lot of data in CSV or JSON formats, consider switching to a more computer-friendly format like Apache Parquet or Apache Arrow IPC. Particularly, while CSV and JSON are optimized for humans, they’re not the speediest for computers! Opting for a binary format like Parquet or Arrow removes the need to parsing text into data types like integers, floats, and strings, which is a time-consuming process. Binary formats usually come with metadata and logical partitioning, which Python can use for efficient data handling.

Resources

1 Like