Does anyone have experience with Google's Gemini Pro Vision?

Hi,

Specifically, I’m asking about using Google Vertex AI SDK with the Gemini Pro Vision generative model. I’ve built a solution for a client to do image object localization which was straight forward using Google Cloud Vision, but I recently tried to do something similar with Google’s newer multimodal model gemini-1.5-pro-preview-0409 where I send it an image and a detailed prompt to identify objects and their bounding boxes.

I’m using these imports:

import vertexai
import vertexai.generative_models as genai

I’m calling the model as simply as this:

# Code snippet

vertexai.init(project="MY PROJECT", location="us-east1")
MODEL = genai.GenerativeModel('gemini-1.5-pro-preview-0409')

def generate_vertexai_response(image_file: str, prompt: str) -> dict:
    prompt_prologue = "You are an expert in fashion and clothing. You have been asked to identify objects in an image."

    genai_image = genai.Image.load_from_file(image_file)
    with st.spinner("Generating Gemini result..."):
        if prompt:
            response = MODEL.generate_content([genai_image, f"{prompt_prologue}\n\n{prompt}"])
        else:
            response = MODEL.generate_content([genai_image, f"{prompt_prologue}\n\n{DEFAULT_PROMPT}"])

    resp_json_str = find_json_string(response.text)
    resp_json = json.loads(resp_json_str)
    
    # return dict
    return resp_json

The prompt looks like this:

"""
    Identify as many objects as possible in this image?

    INSTRUCTIONS:

    - Objects include: any clothing items, jewelry and accessories, and footwear.
    - Each object annotation should be given a LABEL name, in lowercase.
    - Find the bounding box NORMALIZED_VERTICES for each object.
    - Report the NORMALIZED_VERTICES in the order of top-left, top-right, bottom-right, bottom-left.
    - The NORMALIZED_VERTICES should be a list of 4 pairs of float values between 0 and 1, representing 
        [[top_left_x, top_left_y], [top_right_x, top_right_y], [bottom_right_x, bottom_right_y], [bottom_left_x, bottom_left_y]].
        For example: [[0.24609375, 0.671875], [0.64453125, 0.671875], [0.64453125, 0.9296875], [0.24609375, 0.9296875]].
    - Report the IMAGE_PROPERTIES for the image, for example, the dominant colors, and width and height. 
        For example: {"dominant_colors": ["red", "white", "black"], "width": 1024, "height": 768}.
    - If an object is partially visible, not visible or not clear, you can skip it.
    - Your response should be a valid JSON object string containing each object label and its NORMALIZED_VERTICES.
    - Do not add any unnecessary markdown markup in your response.
    - The required JSON format is shown below:

        { "image_properties": IMAGE_PROPERTIES, \
            "objects": [ \
            {"label": LABEL, "normalized_vertices": NORMALIZED_VERTICES}, \
            {"label": LABEL, "normalized_vertices": NORMALIZED_VERTICES}, \
            {...}, ...] }
"""

The objects detected are correct, but the bounding box results (NORMALIZED_VERTICES) I’m getting are pretty rubbish.

If you’ve been able to make genai.GenerativeModel work correctly for this use case, please let me know and I’d be happy to jump on a call with you?

(My progress is not blocked as I’m using Google Cloud Vision instead, but would love to use the generative model for object localization and other things too.)

Thanks,
Arvindra

Processing results displayed in Streamlit

Google Cloud Vision :white_check_mark:

Google Vertex AI generative model :negative_squared_cross_mark:

1 Like

Hi brother,

Any update?

None… perhaps this forum isn’t appropriate to ask this kind of question. I’m using Google Cloud Vision for localization. Much, much faster anyway, especially of you resize the images as recommended in their docs.