Navigating the New AI Landscape: A Developer’s Journey Through the Noise
June 16, 2025The Fashion Trends Discovery Scenario
In this walkthrough, we’ll explore a sample application that demonstrates the power of combining Computer Use (CUA) models with Playwright browser automation to autonomously compile trend information from the internet, while leveraging MCP integration to intelligently catalog and store insights in Azure Blob Storage.
The User Experience
A fashion analyst simply provides a query like “latest trends in sustainable fashion” to our command-line interface. What happens next showcases the power of agentic AI—the system requires no further human intervention to:
- Autonomous Web Navigation: The agent launches Pinterest, intelligently locates search interfaces, and performs targeted queries
- Intelligent Content Discovery: Systematically identifies and interacts with trend images, navigating to detailed pages
- Advanced Content Analysis: Applies computer vision to analyze fashion elements, colors, patterns, and design trends
- Intelligent Compilation: Consolidates findings into comprehensive, professionally formatted markdown reports
- Contextual Storage: Recognizes the value of preserving insights and autonomously offers cloud storage options
Technical capabilities leveraged
Behind this seamless experience lies a coordination of AI models:
- Pinterest Navigation: The CUA model visually understands Pinterest’s interface layout, identifying search boxes and navigation elements with pixel-perfect precision
- Search Results Processing: Rather than relying on traditional DOM parsing, our agent uses visual understanding to identify trend images and calculate precise interaction coordinates
- Content Analysis: Each discovered trend undergoes detailed analysis using GPT-4o’s advanced vision capabilities, extracting insights about fashion elements, seasonal trends, and style patterns
- Autonomous Decision Making: The agent contextually understands when information should be preserved and automatically engages with cloud storage systems
Technology Stack Overview
At the heart of this solution lies an orchestration of several AI technologies, each serving a specific purpose in creating a truly autonomous agent.
The architecture used
“`
┌─────────────────────────────────────────────────────────────────┐
│ Azure AI Foundry │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Responses API │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ CUA Model │ │ GPT-4o │ │ Built-in MCP │ │ │
│ │ │ (Interface) │ │ (Content) │ │ Client │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Function Calling Layer │
│ (Workflow Orchestration) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────┐ ┌──────────────────┐
│ Playwright │◄──────────────► │ Trends Compiler │
│ Automation │ │ Engine │
└─────────────────┘ └──────────────────┘
│
▼
┌─────────────────────┐
│ Azure Blob │
│ Storage (MCP) │
└─────────────────────┘
“`
Azure OpenAI Responses API
At the core of the agentic architecture in this solution, the Responses API provides intelligent decision-making capabilities that determine when to invoke Computer Use models for web crawling versus when to engage MCP servers for data persistence. This API serves as the brain of our agent, contextually understanding user intent and autonomously choosing the appropriate tools to fulfill complex multi-step workflows.
Computer Use (CUA) Model
Our specialized CUA model excels at visual understanding of web interfaces, providing precise coordinate mapping for browser interactions, layout analysis, and navigation planning. Unlike general-purpose language models, the CUA model is specifically trained to understand web page structures, identify interactive elements, and provide actionable coordinates for automated browser control.
Playwright Browser Automation
Acting as the hands of our agent, Playwright executes the precise actions determined by the CUA model. This robust automation framework translates AI insights into real-world browser interactions, handling everything from clicking and typing to screenshot capture and page navigation with pixel-perfect accuracy.
GPT-4o Vision Model for Content Analysis
While the CUA model handles interface understanding, GPT-4o provides domain-specific content reasoning. This powerful vision model analyzes fashion trends, extracts meaningful insights from images, and provides rich semantic understanding of visual content—capabilities that complement rather than overlap with the CUA model’s interface-focused expertise.
Model Context Protocol (MCP) Integration
The application showcases the power of agentic AI through its autonomous decision-making around data persistence. The agent intelligently recognizes when compiled information needs to be stored and automatically engages with Azure Blob Storage through MCP integration, without requiring explicit user instruction for each storage operation.
Unlike traditional function calling patterns where custom applications must relay MCP calls through client libraries, the Responses API includes a built-in MCP client that directly communicates with MCP servers. This eliminates the need for complex relay logic, making MCP integration as simple as defining tool configurations.
Function Calling Orchestration
Function calling orchestrates the complex workflow between CUA model insights and Playwright actions. Each step is verified and validated before proceeding, ensuring robust autonomous operation without human intervention throughout the entire trend discovery and analysis process.
Let me walk you through the code used in the Application.
Agentic Decision Making in Action
Let’s examine how our application demonstrates true agentic behavior through the main orchestrator in `app.py`:
async def main() -> str:
“””Main entry point demonstrating agentic decision making.”””
conversation_history = []
generated_reports = []
while True:
user_query = input(“Enter your query for fashion trends:-> “)
# Add user input to conversation context
new_user_message = {
“role”: “user”,
“content”: [{“type”: “input_text”, “text”: user_query}],
}
conversation_history.append(new_user_message)
# The agent analyzes context and decides on appropriate actions
response = ai_client.create_app_response(
instructions=instructions,
conversation_history=conversation_history,
mcp_server_url=config.mcp_server_url,
available_functions=available_functions,
)
# Process autonomous function calls and MCP tool invocations
for output in response.output:
if output.type == “function_call”:
# Agent decides to compile trends
function_to_call = available_functions[output.name]
function_args = json.loads(output.arguments)
function_response = await function_to_call(**function_args)
elif output.type == “mcp_tool_call”:
# Agent decides to use MCP tools for storage
print(f”MCP tool call: {output.name}”)
# MCP calls handled automatically by Responses API
Key Agentic Behaviors Demonstrated:
- Contextual Analysis: The agent examines conversation history to understand whether the user wants trend compilation or storage operations
- Autonomous Tool Selection: Based on context, the agent chooses between function calls (for trend compilation) and MCP tools (for storage)
- State Management: The agent maintains conversation context across multiple interactions, enabling sophisticated multi-turn workflows
Function Calling Orchestration: Autonomous Web Intelligence
The `TrendsCompiler` class in `compiler.py` demonstrates sophisticated autonomous workflow orchestration:
class TrendsCompiler:
“””Autonomous trends compilation with multi-step verification.”””
async def compile_trends(self, user_query: str) -> str:
“””Main orchestration loop with autonomous step progression.”””
async with LocalPlaywrightComputer() as computer:
state = {“trends_compiled”: False}
step = 0
while not state[“trends_compiled”]:
try:
if step == 0:
# Step 1: Autonomous Pinterest navigation
await self._launch_pinterest(computer)
step += 1
elif step == 1:
# Step 2: CUA-driven search and coordinate extraction
coordinates = await self._search_and_get_coordinates(
computer, user_query
)
if coordinates:
step += 1
elif step == 2:
# Step 3: Autonomous content analysis and compilation
await self._process_image_results(
computer, coordinates, user_query
)
markdown_report = await self._generate_markdown_report(
user_query
)
state[“trends_compiled”] = True
except Exception as e:
print(f”Autonomous error handling in step {step}: {e}”)
state[“trends_compiled”] = True
return markdown_report
Autonomous Operation Highlights:
- Self-Verifying Steps: Each step validates completion before advancing
- Error Recovery: Autonomous error handling without human intervention
- State-Driven Progression: The agent maintains its own execution state
- No User Prompts: Complete automation from query to final report
Pinterest’s Unique Challenge: Visual Coordinate Intelligence
One of the most impressive demonstrations of CUA model capabilities lies in solving Pinterest’s hidden URL challenge:
async def _detect_search_results(self, computer) -> List[Tuple[int, int, int, int]]:
“””Use CUA model to extract image coordinates from search results.”””
# Take screenshot for CUA analysis
screenshot_bytes = await computer.screenshot()
screenshot_b64 = base64.b64encode(screenshot_bytes).decode()
# CUA model analyzes visual layout and identifies image boundaries
prompt = “””
Analyze this Pinterest search results page and identify all trend/fashion
images displayed. For each image, provide the exact bounding box coordinates
in the format: x1,y1,x2,y2
Focus on the main content images, not navigation or advertisement elements.
“””
response = await self.ai_client.create_cua_response(
prompt=prompt,
screenshot_b64=screenshot_b64
)
# Extract coordinates using specialized parser
coordinates = self.coordinate_parser.extract_coordinates(response.content)
print(f”CUA model identified {len(coordinates)} image regions”)
return coordinates
The Coordinate Calculation:
def calculate_centers(self, coordinates: List[Tuple[int, int, int, int]]) -> List[Tuple[int, int]]:
“””Calculate center coordinates for precise clicking.”””
centers = []
for x1, y1, x2, y2 in coordinates:
center_x = (x1 + x2) // 2
center_y = (y1 + y2) // 2
centers.append((center_x, center_y))
return centers
key take aways with this approach:
- No DOM Dependency: Pinterest’s hover-based URL revelation becomes irrelevant
- Visual Understanding: The CUA model sees what humans see—image boundaries
- Pixel-Perfect Targeting: Calculated center coordinates ensure reliable clicking
- Robust Navigation: Works regardless of Pinterest’s frontend implementation changes
Model Specialization: The Right AI for the Right Job
Our solution demonstrates sophisticated AI model specialization:
async def _analyze_trend_page(self, computer, user_query: str) -> Dict[str, Any]:
“””Use GPT-4o for domain-specific content analysis.”””
# Capture the detailed trend page
screenshot_bytes = await computer.screenshot()
screenshot_b64 = base64.b64encode(screenshot_bytes).decode()
# GPT-4o analyzes fashion content semantically
analysis_prompt = f”””
Analyze this fashion trend page for the query: “{user_query}”
Provide detailed analysis of:
1. Fashion elements and style characteristics
2. Color palettes and patterns
3. Seasonal relevance and trend timing
4. Target demographics and style categories
5. Design inspiration and cultural influences
Format as structured markdown with clear sections.
“””
# Note: Using GPT-4o instead of CUA model for content reasoning
response = await self.ai_client.create_vision_response(
model=self.config.vision_model_name, # GPT-4o
prompt=analysis_prompt,
screenshot_b64=screenshot_b64
)
return {
“analysis”: response.content,
“timestamp”: datetime.now().isoformat(),
“query_context”: user_query
}
Model Selection Rationale:
- CUA Model: Perfect for understanding “Where to click” and “How to navigate”
- GPT-4o: Excels at “What does this mean” and “How is this relevant”
- Specialized Strengths: Each model operates in its domain of expertise
- Complementary Intelligence: Combined capabilities exceed individual model limitations
Compilation and Consolidation
async def _generate_markdown_report(self, user_query: str) -> str:
“””Consolidate all analyses into comprehensive markdown report.”””
if not self.image_analyses:
return “No trend data collected for analysis.”
# Intelligent report structuring
report_sections = [
f”# Fashion Trends Analysis: {user_query}”,
f”*Generated on {datetime.now().strftime(‘%B %d, %Y’)}*”,
“”,
“## Executive Summary”,
await self._generate_executive_summary(),
“”,
“## Detailed Trend Analysis”
]
# Process each analyzed trend with intelligent categorization
for idx, analysis in enumerate(self.image_analyses, 1):
trend_section = [
f”### Trend Item {idx}”,
analysis.get(‘analysis’, ‘No analysis available’),
f”*Analysis timestamp: {analysis.get(‘timestamp’, ‘Unknown’)}*”,
“”
]
report_sections.extend(trend_section)
# Add intelligent trend synthesis
report_sections.extend([
“## Trend Synthesis and Insights”,
await self._generate_trend_synthesis(),
“”,
“## Recommendations”,
await self._generate_recommendations()
])
return “n”.join(report_sections)
Intelligent Compilation Features:
- Automatic Structuring: Creates professional report formats automatically
- Content Synthesis: Combines individual analyses into coherent insights
- Temporal Context: Maintains timestamp and query context
- Executive Summaries: Generates high-level insights from detailed data
Autonomous Storage Intelligence
Note that there is no MCP Client code that needs to be implemented here. The integration is completely turnkey, through configuration alone.
# In app_client.py – MCP tool configuration
def create_app_tools(self, mcp_server_url: str, available_functions: Dict[str, Any]) -> List[Dict[str, Any]]:
“””Configure tools with automatic MCP integration.”””
tools = [
{
“type”: “mcp”,
“server_label”: “azure-storage-mcp-server”,
“server_url”: mcp_server_url,
“require_approval”: “never”, # Autonomous operation
“allowed_tools”: [“create_container”, “list_containers”, “upload_blob”],
}
]
return tools
# Agent instructions demonstrate contextual intelligence
instructions = f”””
Step1: Compile trends based on user query using computer use agent.
Step2: Prompt user to store trends report in Azure Blob Storage.
Use MCP Server tools to perform this action autonomously.
IMPORTANT: Maintain context of previously generated reports.
If user asks to store a report, use the report generated in this session.
“””
Turnkey MCP Integration:
- Direct API Calls: MCP tools called directly by Responses API
- No Relay Logic: No custom MCP client implementation required
- Autonomous Tool Selection: Agent chooses appropriate MCP tools based on context
- Contextual Storage: Agent understands what to store and when
Demo and Code reference
Here is the GitHub Repo of the Application described in this post.
See a demo of this application in action:
Conclusion: Entering the Age of Practical Agentic AI
The Fashion Trends Compiler Agent represents Agentic AI applications that work autonomously in real-world scenarios. By combining Azure AI Foundry’s turnkey MCP integration with specialized AI models and robust automation frameworks, we’ve created an agent that doesn’t just follow instructions but intelligently navigates complex multi-step workflows with minimal human oversight.
Ready to build your own agentic AI solutions? Start exploring Azure AI Foundry’s MCP integration and Computer Use capabilities to create the next generation of intelligent automation.