Loading Polars dataframe from Delta table in S3 results in ComputeError

Hi there,

I’m new to Streamlit, this is my first go at creating a simple app that can query some delta tables on S3. I’m using the polars package to scan a number of delta tables from S3, creates a sql context in which those tables are registered and then execute a query from user input in this context, and showing the resulting dataframe to the user. However, even though the exact same code runs fine locally outside of streamlit, when used in a streamlit app, it throws the following exception:
“ComputeError: Object at location …/part-00001-747373c8-281b-4f70-937d-358e2a4e121d-c000.zstd.parquet not found: Client error with status 404 Not Found: No Body”
To reproduce, I have created a minimal example of the app:

import streamlit as st
import polars as pl

s3_storage_options =  { 
    "aws_access_key_id":st.secrets["AWS_SECRET_KEY_ID"],
    "aws_secret_access_key":st.secrets["AWS_SECRET_VALUE"],
    "aws_region": "eu-central-1"
}
 s3_bucket = st.secrets["AWS_S3_BUCKET"]

def main():
    st.set_page_config(page_title="Delta Lake Explorer", page_icon="📊", layout="wide")
    st.title("Delta Lake Explorer")
    
    try:
        polars_df_djt = pl.scan_delta('s3://{s3_bucket}/data/gold/dim_jira_team', storage_options=s3_storage_options)  
        ctx = pl.SQLContext().register("dim_jira_team", polars_df_djt)
        query = "select * from dim_jira_team"
        df = ctx.execute(query, eager = True)
        st.write("query suceeded")
        #st.dataframe(
            #     df,
            #     use_container_width=True,
            #     hide_index=False
            # )
    except Exception as e:
        st.write("query failed")
        st.write(e)
    
if __name__ == "__main__":
    main()

My app uses python 3.11.9, streamlit==1.40.0, and polars ==1.16.0

Running above example on my local machine, the app writes ‘query failed’ and shows the following exception:
ComputeError: Object at location data/gold/dim_jira_team/part-00001-747373c8-281b-4f70-937d-358e2a4e121d-c000.zstd.parquet not found: Client error with status 404 Not Found: No Body

Traceback:

File "C:\Users\aaluij\Documents\Delta Lake Explorer\delta-lake-explorer\streamlit_test_scan_delta.py", line 23, in main
    df = ctx.execute(query, eager = True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File "C:\Users\aaluij\Documents\Delta Lake Explorer\delta-lake-explorer\.venv\Lib\site-packages\polars\sql\context.py", line 439, in execute
    return res.collect() if (eager or self._eager_execution) else res
           ^^^^^^^^^^^^^File "C:\Users\aaluij\Documents\Delta Lake Explorer\delta-lake-explorer\.venv\Lib\site-packages\polars\lazyframe\frame.py", line 2029, in collect
    return wrap_df(ldf.collect(callback))

At first I thought it might have to do with the scan_delta, but the exact same thing happens when I replace scan_delta with read_delta, it just fails earlier (read_delta fails at the initial dataframe definition: polars_df_djt = pl.scan_delta(‘s3://{s3_bucket}/data/gold/dim_jira_team’, storage_options=s3_storage_options), whereas scan_delta generates a lazyframe and errors only at the moment the actual collect takes place, which is when the query is executed).
To reiterate, this exact piece of code runs fine locally in the same virtual environment but outside of streamlit. For this, I have created a test.py file that has the same main() method but without any streamlit dependency, and it runs without any errors, printing the resulting dataframe in the console.
When I look at the error message returned in the streamlit app, what is striking is that the location it refers to seems to miss the actual s3 bucket, but just points to a location starting with a partial path (starting at /data). What is also weird is that the file that it refers to (part-00001-747373c8-281b-4f70-937d-358e2a4e121d-c000.zstd.parquet), is a file that indeed does not exist any longer in the store, as it is part of a stale version and the table has been vacuumed since. I can see it being referenced as added in my version 99 commit file, version 99 is also the latest checkpoint. Latest version of the table is 157, so I’m assuming the file got removed in one of the versions between 99 and 157.

As I’m quite a novice with polars, deltalake and streamlit, I’m posting this here, but it might be a polars issue. Seems weird though that the exact same code runs fine outside of streamlit, where it can find the correct files that are part of the last version without any problem.
So anyone out there that has tried this as well as has found a solution / explanation?
Thanks :-).

After some more poking and trying to create a minimal reproducible example that does not rely on a cloud store, I found out that if and only if polars is run from a streamlit app, it is not able to properly read a delta table that has more than 1 version. When there are more versions of the table in the folder, polars read_delta returns a dataframe that contains data in all versions in the table folder, instead of just the data in the latest version. Outside of streamlit, polars read_delta always just returns a dataframe with data from a single version (when the version parameter is left empty, it returns the latest version).
In my case, the file it was trying to find was part of an older version but has been removed from the table (and also actually deleted from storage) since.
I have reported this as an issue with polars, see also Read_delta does not return latest version of delta table when called in streamlit app · Issue #20253 · pola-rs/polars · GitHub.

Fixed in polaris 0.22.3.