Part 2: Hands-on Experience with Gemini - Unlocking Multimodal Magic

Initial Impressions

The second interview with Jason Quek, CTO, and Tristan Van Thielen, Head of Machine Learning, provides a fascinating insight into their hands-on experience with Gemini, Google’s latest AI language model. Initial impressions highlight the seamless integration of Gemini with the Google Cloud platform, leveraging familiar tools like service accounts and single sign-on. It is very beneficial to have the entire Google Cloud experience integrated into Gemini.

Real-World Applications

In a noteworthy use case, is the application of Gemini in retail intelligence. The scenario involves capturing images of SIM card plans from various retailers distributed across numerous stalls. Traditionally, this process required sending individuals to take pictures of plans, such as those offering five gigabytes for 50 Euros, with different pricing plans available. The conventional method involved using Optical Character Recognition (OCR) technology, followed by the integration with custom software to match prices with specific plans. However, with Gemini, a transformative shift occurred. The team experimented with Gemini’s capabilities and discovered its proficiency in understanding context. This meant that Gemini could discern the placement of prices in relation to products, providing a comprehensive response in a single iteration.

A noteworthy enhancement emerged in the seamless integration of image and text data. This stands out as a considerable advancement, particularly in contrast to the complexities associated with video processing in traditional methods. Previously, describing an image necessitated the involvement of a model like Imogen, followed by the translation of text through a separate model. However, Gemini streamlines this process by consolidating image and text data into a single prompt, efficiently managed by a high-quality machine-learning model. Gemini’s impact on combining image and text data, citing examples like virtual trials in retail and potential applications in manufacturing, where it can streamline processes and save time for technicians.

Tristan Van Thielen Devoteam G Cloud ML Tribe Lead

Comparing Gemini to Its Competitors

A notable aspect of Gemini is its superiority over competitors, such as ChatGPT, in handling multimodal inputs. Gemini’s capability to process both text and images in a single prompt is a feature lacking in some of its counterparts.

Gemini’s low inference time for image and video data is a significant advantage that enhances its usability in real-time conversations.

Addressing Challenges and Potential Limitations

The challenge associated with AI models is the importance of responsible use, which emphasises the need for filtering prompts and adopting a rule-based approach as an initial step to ensure ethical usage.

Google’s commitment to responsible AI, mentioning the responsible AI filter that flags potentially offensive or discriminatory output is great. The challenges in constraining models and prompts effectively.

In a retail context, deploying a chatbot introduces a challenge when users divert from the intended purpose. For instance, inquiries about unrelated topics incur a cost for the business. To address this, establishing clear boundaries and defining the chatbot’s specific task becomes crucial. Setting constraints ensures user interactions align with the intended scope, preventing misuse and keeping the chatbot focused on the retailer’s objectives. This management is essential for optimising user engagement and maintaining the chatbot’s effectiveness.

Future Outlook and Predictions

Anticipating the future of AI language models, particularly Gemini, there is a shared sense of optimism expressed by experts in the field. The flexibility in deployment options, including edge deployment and customisable model sizes, stands out as a key feature, hinting at a future where the emphasis is on utility and customisation rather than sheer model size. In addition, the concept of LLMOps, introduced by one of the experts, brings attention to the operational aspects of large language models. This includes advancements in managing prompts and knowledge bases and implementing ongoing monitoring for quality improvement. The vision painted by these insights suggests a dynamic future for AI language models, emphasising practicality, customisation, and efficient operational management.

Positive Impact on Society and Businesses

The intriguing possibilities, particularly in the realm of accessibility understanding. Noting current challenges for individuals with visual impairments, Gemini’s role in converting image content to text potentially enhances aspects of life like Sensory Processing Disorder (SPD).

The essence of time-saving in professions where efficiency is paramount. There are benefits for medical professionals, first responders, and public servants, foreseeing streamlined access to vital information. The cumulative impact of these efficiency gains, predicts a collective liberation of time globally.

Your future with AI? Experience the possibilities with a Gen AI Hackathon.

Talk to our experts