Hi,
Specifically, I’m asking about using Google Vertex AI SDK with the Gemini Pro Vision generative model. I’ve built a solution for a client to do image object localization which was straight forward using Google Cloud Vision, but I recently tried to do something similar with Google’s newer multimodal model gemini-1.5-pro-preview-0409
where I send it an image and a detailed prompt to identify objects and their bounding boxes.
I’m using these imports:
import vertexai
import vertexai.generative_models as genai
I’m calling the model as simply as this:
# Code snippet
vertexai.init(project="MY PROJECT", location="us-east1")
MODEL = genai.GenerativeModel('gemini-1.5-pro-preview-0409')
def generate_vertexai_response(image_file: str, prompt: str) -> dict:
prompt_prologue = "You are an expert in fashion and clothing. You have been asked to identify objects in an image."
genai_image = genai.Image.load_from_file(image_file)
with st.spinner("Generating Gemini result..."):
if prompt:
response = MODEL.generate_content([genai_image, f"{prompt_prologue}\n\n{prompt}"])
else:
response = MODEL.generate_content([genai_image, f"{prompt_prologue}\n\n{DEFAULT_PROMPT}"])
resp_json_str = find_json_string(response.text)
resp_json = json.loads(resp_json_str)
# return dict
return resp_json
The prompt
looks like this:
"""
Identify as many objects as possible in this image?
INSTRUCTIONS:
- Objects include: any clothing items, jewelry and accessories, and footwear.
- Each object annotation should be given a LABEL name, in lowercase.
- Find the bounding box NORMALIZED_VERTICES for each object.
- Report the NORMALIZED_VERTICES in the order of top-left, top-right, bottom-right, bottom-left.
- The NORMALIZED_VERTICES should be a list of 4 pairs of float values between 0 and 1, representing
[[top_left_x, top_left_y], [top_right_x, top_right_y], [bottom_right_x, bottom_right_y], [bottom_left_x, bottom_left_y]].
For example: [[0.24609375, 0.671875], [0.64453125, 0.671875], [0.64453125, 0.9296875], [0.24609375, 0.9296875]].
- Report the IMAGE_PROPERTIES for the image, for example, the dominant colors, and width and height.
For example: {"dominant_colors": ["red", "white", "black"], "width": 1024, "height": 768}.
- If an object is partially visible, not visible or not clear, you can skip it.
- Your response should be a valid JSON object string containing each object label and its NORMALIZED_VERTICES.
- Do not add any unnecessary markdown markup in your response.
- The required JSON format is shown below:
{ "image_properties": IMAGE_PROPERTIES, \
"objects": [ \
{"label": LABEL, "normalized_vertices": NORMALIZED_VERTICES}, \
{"label": LABEL, "normalized_vertices": NORMALIZED_VERTICES}, \
{...}, ...] }
"""
The objects detected are correct, but the bounding box results (NORMALIZED_VERTICES
) I’m getting are pretty rubbish.
If you’ve been able to make genai.GenerativeModel
work correctly for this use case, please let me know and I’d be happy to jump on a call with you?
(My progress is not blocked as I’m using Google Cloud Vision instead, but would love to use the generative model for object localization and other things too.)
Thanks,
Arvindra
Processing results displayed in Streamlit
Google Cloud Vision
Google Vertex AI generative model