Read files with dask bag using file_uploader

Summary

I would like to read a txt file with a dask bag (db) using the st.file_uploader method, but I am not able to do that cause I get error with the delimiter.
TypeError: ‘linedelimiter’ is an invalid keyword argument for StringIO()

Steps to reproduce

Code snippet:

import dask.bag as db
import io
import streamlit as st

uploaded_file = st.file_uploader("Upload file")
file = db.read_text(io.StringIO(uploaded_file.getvalue().decode("windows-1252"),linedelimiter='\n'))


If applicable, please provide the steps we should take to reproduce the error or specified behavior.

Expected behavior:

Read the uploaded file in a dask bag.
I am using dask because my file size is 2 GB.

Actual behavior:

EXCEPTION: Traceback (most recent call last):
File “C:\Users\andres\data_fuente_streamlit.py”, line 172, in main
file = db.read_text(io.StringIO(uploaded_file.getvalue().decode(“windows-1252”),linedelimiter=‘\n’))
TypeError: ‘linedelimiter’ is an invalid keyword argument for StringIO()

Debug info

  • Streamlit version: version 1.16.0
  • Python version: 3.9.12
  • Using PipEnv
  • OS version: Windows 10
  • Browser version: Microsoft Edge Version 108.0.1462.54

Requirements file

altair==4.2.0
altgraph==0.17.3
asttokens==2.2.1
attrs==22.2.0
backcall==0.2.0
blinker==1.5
cachetools==5.2.0
certifi==2022.12.7
charset-normalizer==2.1.1
click==8.1.3
cloudpickle==2.2.0
colorama==0.4.6
comm==0.1.2
commonmark==0.9.1
cx-Oracle==8.3.0
dask==2022.12.0
debugpy==1.6.4
decorator==5.1.1
entrypoints==0.4
et-xmlfile==1.1.0
executing==1.2.0
fsspec==2022.11.0
future==0.18.2
gitdb==4.0.10
GitPython==3.1.30
idna==3.4
importlib-metadata==5.2.0
ipykernel==6.20.0
ipython==8.7.0
jedi==0.18.2
Jinja2==3.1.2
jsonschema==4.17.3
jupyter_client==7.4.8
jupyter_core==5.1.1
locket==1.0.0
MarkupSafe==2.1.1
matplotlib-inline==0.1.6
nest-asyncio==1.5.6
numpy==1.23.5
openpyxl==3.0.10
packaging==22.0
pandas==1.5.2
parso==0.8.3
partd==1.3.0
pefile==2022.5.30
pickleshare==0.7.5
Pillow==9.3.0
platformdirs==2.6.0
prompt-toolkit==3.0.36
protobuf==3.20.3
psutil==5.9.4
pure-eval==0.2.2
pyarrow==10.0.1
pydeck==0.8.0
Pygments==2.13.0
pyinstaller==5.7.0
pyinstaller-hooks-contrib==2022.14
Pympler==1.0.1
PyQt5==5.15.7
PyQt5-Qt5==5.15.2
PyQt5-sip==12.11.0
pyrsistent==0.19.3
python-dateutil==2.8.2
pytz==2022.6
pytz-deprecation-shim==0.1.0.post0
pywin32==305
pywin32-ctypes==0.2.0
PyYAML==6.0
pyzmq==24.0.1
requests==2.28.1
rich==12.6.0
semver==2.13.0
six==1.16.0
smmap==5.0.0
stack-data==0.6.2
streamlit==1.16.0
toml==0.10.2
toolz==0.12.0
tornado==6.2
traitlets==5.8.0
typing_extensions==4.4.0
tzdata==2022.7
tzlocal==4.2
urllib3==1.26.13
validators==0.20.0
watchdog==2.2.0
wcwidth==0.2.5
zipp==3.11.0

Additional information

This is just part of my code, the code compares sql tables with the content of the txt file. Some of the packages in the requirements file are not using in the streamlit app. But I decided to include them cause they are in my virtual enviroment.

It looks like you have a typo in your code, and are passing linedelimiter to StringIO() instead of to db.read_text()

You may want this instead:

uploaded_file = st.file_uploader("Upload file")
if uploaded_file is not None:
  text = uploaded_file.getvalue().decode("windows-1252")
  file = db.read_text(io.StringIO(text), linedelimiter='\n')

Now I am get this error :c

That appears to be because you are passing a StringIO object to read_text, instead of a filename, which is what it is expecting. You can see an example of using NamedTemporaryFile here, and elsewhere on this forum Get path from file_uploader() - #17 by ennui