Embedding models and LlamaCPP models getting duplicated on load

Hi Streamlit team and community!!

I’m struggling with an issue while loading an embeddings model and a Llama model.

Goal

My goal is to just input some text to a Llama model and get retrieved some response based on a prompt template I have set up. This prompt template includes some documents retrieved by a PGVector vectordb, that’s why I’m running into issues with the embeddings model too.

Problem

The problem is, that when I upload the models for the first time, instead of just uploading them once, the system loads the model twice, and my GPU runs out of memory, which stops the deployment before anything else happens.

Generic questions answers

  1. I’m running the app locally, but, inside a Docker container deployed in an AWS machine with a GPU.
    This problem also occurs in my machine, deploying it locally.
    It occurred to me in the past too, but the models were lighter, so the memory wasn’t really affected by this duplication.

  2. Not applicable to answer.

  3. Not applicable to answer.

  4. The error message is:

    fepd_streamlit | llm_load_tensors: offloaded 43/43 layers to GPU
    fepd_streamlit | llm_load_tensors: VRAM used: 13023.85 MB
    fepd_streamlit | …fepd_streamlit exited with code 139

    Checked online that the code 139 is a SIGSEGV signal that shuts down the container due to a memory violation issue.

  5. Python version is 3.11.5 and tried with Streamlit versions 1.27.2 and 1.28.1 with the same results.

The code

In the following space imma show you the different codes I tried to debug this issue and get to what was happening (without success).

First approach

This one is the one I thought that it was the best way to do it, based on the latest updates to Streamlit caches management (hopefully I’m not so lost)

The first thing I’ve done was to create two different functions:

  1. Load embeddings model
  2. Load LlamaCPP model

Load embeddings model

@st.cache_resource
def return_embeddings_model():

    emb_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", model_kwargs={"device":"cuda"})

    return emb_model

@st.cache_resource
def return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers):

    llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, verbose=True, streaming=True, n_threads=model_n_threads, temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_tokens, n_gpu_layers=n_gpu_layers)

    return llm

Once this functions were already created, I would allocate these models into two variables, inside session_state (to make sure that even the variables were created just once)

if 'llm_model' not in st.session_state:

    load_dotenv()

	model_path = os.getenv('MODEL_PATH')
	
	model_n_ctx = os.getenv('MODEL_N_CTX')
	
	model_n_batch = os.getenv('MODEL_N_BATCH')
	
	model_n_threads = os.getenv('MODEL_N_THREADS')
	
	temperature = os.getenv('TEMPERATURE')
	
	top_p = os.getenv('TOP_P')
	
	top_k = os.getenv('TOP_K')
	
	max_tokens = os.getenv('MAX_TOKENS')
	
	n_gpu_layers =os.getenv('N_GPU_LAYERS')

    st.session_state['llm_model'] = return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers)

if 'emb_model' not in st.session_state:

    st.session_state['emb_model'] = return_embeddings_model()

Second approach

Not so happy with this one, but thought that it would pass the bar and work.

Basically, I have just deleted the functions with the st.cache_resource and initialized the models directly in the session_state.

if 'llm_model' not in st.session_state:

    load_dotenv()

	model_path = os.getenv('MODEL_PATH')
	
	model_n_ctx = os.getenv('MODEL_N_CTX')
	
	model_n_batch = os.getenv('MODEL_N_BATCH')
	
	model_n_threads = os.getenv('MODEL_N_THREADS')
	
	temperature = os.getenv('TEMPERATURE')
	
	top_p = os.getenv('TOP_P')
	
	top_k = os.getenv('TOP_K')
	
	max_tokens = os.getenv('MAX_TOKENS')
	
	n_gpu_layers =os.getenv('N_GPU_LAYERS')
	
	st.session_state['llm_model'] = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, verbose=True, streaming=True, n_threads=model_n_threads, temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_tokens, n_gpu_layers=n_gpu_layers)

if 'emb_model' not in st.session_state:

	st.session_state['emb_model'] = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", model_kwargs={"device":"cuda"})

Third approach

As you can expect, exactly the opposite. Let’s use only st.cache_resource instead of using session_state.

This is by far the one I hate the most, because variables are getting initialized everytime, and I don’t like that.

@st.cache_resource
def return_embeddings_model():

    emb_model = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-large", model_kwargs={"device":"cuda"})

    return emb_model

@st.cache_resource
def return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers):

    llm = LlamaCpp(model_path=model_path, n_ctx=model_n_ctx, n_batch=model_n_batch, verbose=True, streaming=True, n_threads=model_n_threads, temperature=temperature, top_p=top_p, top_k=top_k, max_tokens=max_tokens, n_gpu_layers=n_gpu_layers)

    return llm

load_dotenv()

model_path = os.getenv('MODEL_PATH')

model_n_ctx = os.getenv('MODEL_N_CTX')

model_n_batch = os.getenv('MODEL_N_BATCH')

model_n_threads = os.getenv('MODEL_N_THREADS')

temperature = os.getenv('TEMPERATURE')

top_p = os.getenv('TOP_P')

top_k = os.getenv('TOP_K')

max_tokens = os.getenv('MAX_TOKENS')

n_gpu_layers =os.getenv('N_GPU_LAYERS')

llm_model = return_llm(model_path, model_n_ctx, model_n_batch, model_n_threads, temperature, top_p, top_k, max_tokens, n_gpu_layers)

emb_model = return_embeddings_model()

I hope you can help me find what’s going on here!!

Thanks in advance for your support,

Jacob.

Which version of streamlit are you using?

Hi Joseph!

Thanks for your question.

I mentioned in the post I was jumping between 1.27.2 and 1.28.1, but if you have any version suggestion to try, just tell me and I’ll bring you my results with that trial!

Thanks in advance.

Kind regards,

The reason I ask is that I have had issues with my code once I upgraded to 1.28 due to Streamlit running a second time when I would not have expected it.

Your script is loading the model into memory twice when you would only expect that particular segment of code to run once.

In my case the submit button would get pressed the entire program would run and then it would still be true and run the code inside the If statement (after which it became false) this is not how my code ran in 1.27.

Anyways the point is I think your code is rerunning and if you have not tried on 1.7x I would do that first if it’s happening with both versions then it’s unrelated to what was happening with me.

If you have access to logs maybe go old school and add a bunch of print statements so you ca. see which parts are running/not running which is probably different than what you are thinking.

Then once you know the sequence you can add a session flag that is false but becomes true on the first run as a means of controlling the flow of its similar as what happened to me.

Hi Joseph,

Thanks for your response.

The version ended up not being a problem at all (tried with many different ones). But I just discovered what was going on.

My initial third approach, was in the end the one that solves loading the models more than once, as with caches the model just uploads once, ignoring the user’s session.

By the way, I’m still having issues with st.cache_resource, turning down my containers when trying to load the llama-2-13b-chat.Q8_0.gguf.

I tried it with Chainlit instead of Streamlit (same model, same function but not using caches) and worked smooth.

I would love to get the same smoothness using Streamlit as I need my user to input a lot of data that wouldn’t be possible to include at all using Chainlit.

Thanks in advance for any help you can provide.