Vision Recognition in Action: How AI "Sees" the World

sheharav
Jul 2
3 min read

One of the most compelling capabilities of AI is how it interprets visual information. This includes facial recognition and encompasses computer vision, object and label detection, optical character recognition (OCR), and contextual image understanding, all working together in real time.

In this post, I’ll walk you through a practical, hands-on experiment using Google Cloud Vision and Microsoft Azure’s Computer Vision tools. These no-code experiments let you upload an image and instantly get back a detailed AI interpretation and it’s a great way to understand how machines “see.”

Computer Vision is a core capability of artificial intelligence, enabling machines to:

Identify objects and faces
Detect emotions and text
Understand scenes and relationships
Label content for search, categorization, and automation

These capabilities underpin many real-world applications in healthcare, security, retail, education, and more.

What This Lab Demonstrates

By uploading an image and seeing the AI-generated output, you are:

Interacting with pre-trained convolutional neural networks
Observing supervised learning in action (labels learned from human-labeled data)
Understanding the importance of data quality and diversity
Learning about confidence scores and what makes models “sure” or “uncertain”
Comparing multimodal capabilities: label recognition, text extraction, object detection

These capabilities underpin many real-world applications in healthcare, security, retail, education, and more. Eg: Google’s Cloud Vision model identifying a cat in an image by recognizing patterns such as fur texture, shape of ears, and common color gradients. Or Azure’s captioning tool which can generate the phrase “a group of people playing soccer in a park” from a single frame, combining object, activity, and scene recognition.

What Are Convolutional Neural Networks (CNNs)?

Convolutional Neural Networks (CNNs) are a type of deep learning algorithm specifically designed to process and analyze visual data. They use layers of convolutional filters to scan images, detecting features like edges, textures, and shapes. These features are then combined to recognize complex patterns and classify the content of images. CNNs are the backbone of modern computer vision systems and are trained on large labeled datasets using supervised learning.

Real-World Case Study: AI for Accessibility

Microsoft’s Seeing AI app uses computer vision to help blind or low-vision users navigate the world. It can read text aloud, recognize faces and emotions, and describe scenes using the same underlying vision APIs showcased in this post. This shows how AI can create deeply human-centered applications that extend independence and dignity.

These tools provide a useful demonstration of what vision models can currently achieve and where human oversight still plays a role and it’s an interesting way to start “seeing like an AI.”

Vision Lab

Try It Yourself: AI Image Analysis in 3 Steps

Go to the Google or Microsoft tool
Upload your chosen image (e.g., a selfie, family photo, or product shot)
Review the AI’s predictions and see:
- What does it get right?
- Where does it struggle?
- What can you do with this output?

Tool 1: Google Cloud Vision Drag-and-Drop Try it here →

What it does:

Upload or drag an image
Returns:
- Labels (e.g., “Cat,” “Furniture,” “Table”)
- Object detection (bounding boxes on recognized items)
- Face detection (with emotional state estimations)
- Text detection (OCR) — even handwritten
- SafeSearch (flags inappropriate content)

What it teaches:

Computer vision fundamentals
Classification vs detection
Confidence scoring and thresholds
Multi-model outputs (e.g., labels + faces + text)

This tool uses Google’s pre-trained deep learning models. It’s fast, free, and visually intuitive — perfect for learning and experimentation.

Tool 2: Microsoft Azure Computer Vision Playground Try it here →

What it does:

Upload or paste an image URL
Returns:
- Object detection with bounding boxes
- Image description (e.g., “A man riding a bike on a street”)
- Text reading (OCR)
- Tags and categories
- Celebrity and landmark recognition

What it teaches:

Pre-trained AI vision pipelines
Scene understanding and captioning
Text extraction and handwriting analysis
Metadata enrichment for media

Azure’s platform also supports testing of custom models, making it ideal if you’re progressing from no-code to low-code.

Tips for Success:

- Look for confidence scores above 80% as indicators of higher accuracy

- Keep your image sizes manageable (~1MB or less)

- Consider the implications of AI mislabeling images in real-world use

- Think critically about the AI’s limitations when applied to new contexts

- Toggle between models and API versions to explore differences.

Vision Recognition in Action: How AI "Sees" the World

What This Lab Demonstrates

What Are Convolutional Neural Networks (CNNs)?

Real-World Case Study: AI for Accessibility

Try It Yourself: AI Image Analysis in 3 Steps

Tool 1: Google Cloud Vision Drag-and-Drop Try it here →

What it does:

What it teaches:

Tool 2: Microsoft Azure Computer Vision Playground Try it here →

What it does:

What it teaches:

Tips for Success:

Recent Posts

Comments