top of page

Vision Recognition in Action: How AI "Sees" the World

  • sheharav
  • Jul 2
  • 3 min read

One of the most compelling capabilities of AI is how it interprets visual information. This includes facial recognition and encompasses computer vision, object and label detection, optical character recognition (OCR), and contextual image understanding, all working together in real time.


In this post, I’ll walk you through a practical, hands-on experiment using Google Cloud Vision and Microsoft Azure’s Computer Vision tools. These no-code experiments let you upload an image and instantly get back a detailed AI interpretation and it’s a great way to understand how machines “see.”


Computer Vision is a core capability of artificial intelligence, enabling machines to:

  • Identify objects and faces

  • Detect emotions and text

  • Understand scenes and relationships

  • Label content for search, categorization, and automation

These capabilities underpin many real-world applications in healthcare, security, retail, education, and more.


What This Lab Demonstrates

By uploading an image and seeing the AI-generated output, you are:

  • Interacting with pre-trained convolutional neural networks

  • Observing supervised learning in action (labels learned from human-labeled data)

  • Understanding the importance of data quality and diversity

  • Learning about confidence scores and what makes models “sure” or “uncertain”

  • Comparing multimodal capabilities: label recognition, text extraction, object detection


These capabilities underpin many real-world applications in healthcare, security, retail, education, and more. Eg: Google’s Cloud Vision model identifying a cat in an image by recognizing patterns such as fur texture, shape of ears, and common color gradients. Or Azure’s captioning tool which can generate the phrase “a group of people playing soccer in a park” from a single frame, combining object, activity, and scene recognition.


What Are Convolutional Neural Networks (CNNs)?

Convolutional Neural Networks (CNNs) are a type of deep learning algorithm specifically designed to process and analyze visual data. They use layers of convolutional filters to scan images, detecting features like edges, textures, and shapes. These features are then combined to recognize complex patterns and classify the content of images. CNNs are the backbone of modern computer vision systems and are trained on large labeled datasets using supervised learning.


Real-World Case Study: AI for Accessibility

Microsoft’s Seeing AI app uses computer vision to help blind or low-vision users navigate the world. It can read text aloud, recognize faces and emotions, and describe scenes using the same underlying vision APIs showcased in this post. This shows how AI can create deeply human-centered applications that extend independence and dignity.


These tools provide a useful demonstration of what vision models can currently achieve and where human oversight still plays a role and it’s an interesting way to start “seeing like an AI.”


Vision Lab

Try It Yourself: AI Image Analysis in 3 Steps

  1. Go to the Google or Microsoft tool

  2. Upload your chosen image (e.g., a selfie, family photo, or product shot)

  3. Review the AI’s predictions and see:

    • What does it get right?

    • Where does it struggle?

    • What can you do with this output?


Tool 1: Google Cloud Vision Drag-and-Drop Try it here →


What it does:

  • Upload or drag an image

  • Returns:

    • Labels (e.g., “Cat,” “Furniture,” “Table”)

    • Object detection (bounding boxes on recognized items)

    • Face detection (with emotional state estimations)

    • Text detection (OCR) — even handwritten

    • SafeSearch (flags inappropriate content)


What it teaches:

  • Computer vision fundamentals

  • Classification vs detection

  • Confidence scoring and thresholds

  • Multi-model outputs (e.g., labels + faces + text)

This tool uses Google’s pre-trained deep learning models. It’s fast, free, and visually intuitive — perfect for learning and experimentation.


Tool 2: Microsoft Azure Computer Vision Playground Try it here →


What it does:

  • Upload or paste an image URL

  • Returns:

    • Object detection with bounding boxes

    • Image description (e.g., “A man riding a bike on a street”)

    • Text reading (OCR)

    • Tags and categories

    • Celebrity and landmark recognition


What it teaches:

  • Pre-trained AI vision pipelines

  • Scene understanding and captioning

  • Text extraction and handwriting analysis

  • Metadata enrichment for media

Azure’s platform also supports testing of custom models, making it ideal if you’re progressing from no-code to low-code.


Tips for Success:

- Look for confidence scores above 80% as indicators of higher accuracy

- Keep your image sizes manageable (~1MB or less)

- Consider the implications of AI mislabeling images in real-world use

- Think critically about the AI’s limitations when applied to new contexts

- Toggle between models and API versions to explore differences.


Comments


bottom of page