Hi Streamlit team and community!!
Iām struggling with an issue while loading an embeddings model and a Llama model.
Goal
My goal is to just input some text to a Llama model and get retrieved some response based on a prompt template I have set up. This prompt template includes some documents retrieved by a PGVector vectordb, thatās why Iām running into issues with the embeddings model too.
Problem
The problem is, that when I upload the models for the first time, instead of just uploading them once, the system loads the model twice, and my GPU runs out of memory, which stops the deployment before anything else happens.
Generic questions answers
-
Iām running the app locally, but, inside a Docker container deployed in an AWS machine with a GPU.
This problem also occurs in my machine, deploying it locally.
It occurred to me in the past too, but the models were lighter, so the memory wasnāt really affected by this duplication. -
Not applicable to answer.
-
Not applicable to answer.
-
The error message is:
fepd_streamlit | llm_load_tensors: offloaded 43/43 layers to GPU
fepd_streamlit | llm_load_tensors: VRAM used: 13023.85 MB
fepd_streamlit | ā¦fepd_streamlit exited with code 139Checked online that the code 139 is a SIGSEGV signal that shuts down the container due to a memory violation issue.
-
Python version is 3.11.5 and tried with Streamlit versions 1.27.2 and 1.28.1 with the same results.
The code
In the following space imma show you the different codes I tried to debug this issue and get to what was happening (without success).
First approach
This one is the one I thought that it was the best way to do it, based on the latest updates to Streamlit caches management (hopefully Iām not so lost)
The first thing Iāve done was to create two different functions:
- Load embeddings model
- Load LlamaCPP model
Load embeddings model
@st.cache_resource
def return_embeddings_model():
emb_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", model_kwargs={"device":"cuda"})
return emb_model
@st.cache_resource
def return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers):
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, verbose=True, streaming=True, n_threads=model_n_threads, temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_tokens, n_gpu_layers=n_gpu_layers)
return llm
Once this functions were already created, I would allocate these models into two variables, inside session_state (to make sure that even the variables were created just once)
if 'llm_model' not in st.session_state:
load_dotenv()
model_path = os.getenv('MODEL_PATH')
model_n_ctx = os.getenv('MODEL_N_CTX')
model_n_batch = os.getenv('MODEL_N_BATCH')
model_n_threads = os.getenv('MODEL_N_THREADS')
temperature = os.getenv('TEMPERATURE')
top_p = os.getenv('TOP_P')
top_k = os.getenv('TOP_K')
max_tokens = os.getenv('MAX_TOKENS')
n_gpu_layers =os.getenv('N_GPU_LAYERS')
st.session_state['llm_model'] = return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers)
if 'emb_model' not in st.session_state:
st.session_state['emb_model'] = return_embeddings_model()
Second approach
Not so happy with this one, but thought that it would pass the bar and work.
Basically, I have just deleted the functions with the st.cache_resource
and initialized the models directly in the session_state
.
if 'llm_model' not in st.session_state:
load_dotenv()
model_path = os.getenv('MODEL_PATH')
model_n_ctx = os.getenv('MODEL_N_CTX')
model_n_batch = os.getenv('MODEL_N_BATCH')
model_n_threads = os.getenv('MODEL_N_THREADS')
temperature = os.getenv('TEMPERATURE')
top_p = os.getenv('TOP_P')
top_k = os.getenv('TOP_K')
max_tokens = os.getenv('MAX_TOKENS')
n_gpu_layers =os.getenv('N_GPU_LAYERS')
st.session_state['llm_model'] = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, verbose=True, streaming=True, n_threads=model_n_threads, temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_tokens, n_gpu_layers=n_gpu_layers)
if 'emb_model' not in st.session_state:
st.session_state['emb_model'] = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", model_kwargs={"device":"cuda"})
Third approach
As you can expect, exactly the opposite. Letās use only st.cache_resource
instead of using session_state
.
This is by far the one I hate the most, because variables are getting initialized everytime, and I donāt like that.
@st.cache_resource
def return_embeddings_model():
emb_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", model_kwargs={"device":"cuda"})
return emb_model
@st.cache_resource
def return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers):
llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, verbose=True, streaming=True, n_threads=model_n_threads, temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_tokens, n_gpu_layers=n_gpu_layers)
return llm
load_dotenv()
model_path = os.getenv('MODEL_PATH')
model_n_ctx = os.getenv('MODEL_N_CTX')
model_n_batch = os.getenv('MODEL_N_BATCH')
model_n_threads = os.getenv('MODEL_N_THREADS')
temperature = os.getenv('TEMPERATURE')
top_p = os.getenv('TOP_P')
top_k = os.getenv('TOP_K')
max_tokens = os.getenv('MAX_TOKENS')
n_gpu_layers =os.getenv('N_GPU_LAYERS')
llm_model = return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers)
emb_model = return_embeddings_model()
I hope you can help me find whatās going on here!!
Thanks in advance for your support,
Jacob.