Does anyone have experience with Google's Gemini Pro Vision?

asehmi · April 18, 2024, 10:10am

Hi,

Specifically, I’m asking about using Google Vertex AI SDK with the Gemini Pro Vision generative model. I’ve built a solution for a client to do image object localization which was straight forward using Google Cloud Vision, but I recently tried to do something similar with Google’s newer multimodal model gemini-1.5-pro-preview-0409 where I send it an image and a detailed prompt to identify objects and their bounding boxes.

I’m using these imports:

import vertexai
import vertexai.generative_models as genai

I’m calling the model as simply as this:

# Code snippet

vertexai.init(project="MY PROJECT", location="us-east1")
MODEL = genai.GenerativeModel('gemini-1.5-pro-preview-0409')

def generate_vertexai_response(image_file: str, prompt: str) -> dict:
    prompt_prologue = "You are an expert in fashion and clothing. You have been asked to identify objects in an image."

    genai_image = genai.Image.load_from_file(image_file)
    with st.spinner("Generating Gemini result..."):
        if prompt:
            response = MODEL.generate_content([genai_image, f"{prompt_prologue}\n\n{prompt}"])
        else:
            response = MODEL.generate_content([genai_image, f"{prompt_prologue}\n\n{DEFAULT_PROMPT}"])

    resp_json_str = find_json_string(response.text)
    resp_json = json.loads(resp_json_str)
    
    # return dict
    return resp_json

The prompt looks like this:

"""
    Identify as many objects as possible in this image?

    INSTRUCTIONS:

    - Objects include: any clothing items, jewelry and accessories, and footwear.
    - Each object annotation should be given a LABEL name, in lowercase.
    - Find the bounding box NORMALIZED_VERTICES for each object.
    - Report the NORMALIZED_VERTICES in the order of top-left, top-right, bottom-right, bottom-left.
    - The NORMALIZED_VERTICES should be a list of 4 pairs of float values between 0 and 1, representing 
        [[top_left_x, top_left_y], [top_right_x, top_right_y], [bottom_right_x, bottom_right_y], [bottom_left_x, bottom_left_y]].
        For example: [[0.24609375, 0.671875], [0.64453125, 0.671875], [0.64453125, 0.9296875], [0.24609375, 0.9296875]].
    - Report the IMAGE_PROPERTIES for the image, for example, the dominant colors, and width and height. 
        For example: {"dominant_colors": ["red", "white", "black"], "width": 1024, "height": 768}.
    - If an object is partially visible, not visible or not clear, you can skip it.
    - Your response should be a valid JSON object string containing each object label and its NORMALIZED_VERTICES.
    - Do not add any unnecessary markdown markup in your response.
    - The required JSON format is shown below:

        { "image_properties": IMAGE_PROPERTIES, \
            "objects": [ \
            {"label": LABEL, "normalized_vertices": NORMALIZED_VERTICES}, \
            {"label": LABEL, "normalized_vertices": NORMALIZED_VERTICES}, \
            {...}, ...] }
"""

The objects detected are correct, but the bounding box results (NORMALIZED_VERTICES) I’m getting are pretty rubbish.

If you’ve been able to make genai.GenerativeModel work correctly for this use case, please let me know and I’d be happy to jump on a call with you?

(My progress is not blocked as I’m using Google Cloud Vision instead, but would love to use the generative model for object localization and other things too.)

Thanks,
Arvindra

Processing results displayed in Streamlit

Google Cloud Vision

Google Vertex AI generative model

tu_pham · May 17, 2024, 10:31am

Hi brother,

Any update?

asehmi · May 17, 2024, 4:04pm

None… perhaps this forum isn’t appropriate to ask this kind of question. I’m using Google Cloud Vision for localization. Much, much faster anyway, especially of you resize the images as recommended in their docs.

system · November 13, 2024, 4:04pm

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Google Gemini Models in action! Show the Community! build-with-streamlit	3	1820	July 3, 2024
Google AI Chat - Streamlit App Show the Community! llms , build-with-streamlit	1	1518	June 19, 2024
No audio input devices found Community Cloud discussion	7	156	March 7, 2025
Comparing data visualisations from Code Llama, GPT-3.5, and GPT-4 Show the Community! llms	3	746	September 20, 2024
7 ways GPT-4 with Vision can uplevel your Streamlit apps Show the Community! llms	3	1467	August 24, 2024

Does anyone have experience with Google's Gemini Pro Vision?

Related topics

Hello there 👋🏻

Cookie settings

Strictly necessary cookies

Performance cookies

Functional cookies

Targeting cookies