Building Your First Agentic AI System on GCP

Introduction

The shape of cloud workloads is changing. The unit of deployment used to be a microservice that responded to a request; increasingly, it is an agent that takes a goal, plans, calls tools, remembers what happened, and decides when it is finished. That shift is not just a new framework on top of existing infrastructure. It changes how you think about state, scaling, identity, and cost.

This article is a hands-on walkthrough for engineers and architects building their first production agentic system on Google Cloud. We will pick a concrete stack — LangGraph 1.x for orchestration, Gemini 3.1 Pro and 3.5 Flash for reasoning, Vector Search 2.0 for semantic memory, Firestore for durable state, and Cloud Run for serving — and step through the architecture choices, code patterns, and operational guardrails that matter once you go past a notebook demo. By the end you should have a clear mental model of how the pieces fit and where the sharp edges live.

A naming note up front, because Google’s branding has shifted under everyone’s feet. At Cloud Next on April 22, 2026, Google announced the Gemini Enterprise Agent Platform, the consolidated home for what used to be called Vertex AI plus Agentspace. As of May 21, 2026 the Vertex AI brand no longer appears in the Cloud Console — everything you used to do under Vertex AI now lives under Gemini Enterprise Agent Platform, though the underlying APIs, SDKs, and surface names still carry the vertex-ai-* and aiplatform prefixes for backward compatibility. Deprecated Vertex AI SDK modules stop receiving updates after June 24, 2026, so plan migrations accordingly. Throughout this article I will say “Gemini Enterprise Agent Platform” for the umbrella product and keep “Vertex AI” only where it still appears in API and product names (Vertex AI Vector Search, the aiplatform_v1 client library, and so on).

What “Agentic” Actually Means

Before any code, it is worth being precise about what we are building. An agent, in this context, is a system with four properties that a traditional API does not have:

Autonomy. It chooses its own next step from a set of available actions rather than executing a fixed sequence.
Tool use. It can invoke external functions — search, database queries, HTTP calls, code execution — and incorporate the results into subsequent reasoning.
Memory. It carries context across steps within a run and, often, across runs for the same user or workflow.
A control loop. It runs until a goal is achieved, a stop condition fires, or a budget is exhausted, rather than terminating after a single response.

Those four properties drive every infrastructure decision below. Autonomy means you cannot pre-route traffic; the agent’s runtime needs to survive long, variable-length executions. Tool use means you need a secure, auditable way to expose capabilities. Memory means you need at least two storage tiers — short-term working state and long-term semantic recall. The control loop means you need durable execution, because losing a container three steps into a ten-step plan is not acceptable.

The GCP Service Map

GCP offers a lot of overlapping services, and the choice between them matters more for agents than for stateless APIs. Here is the opinionated short list for a first system as of mid-2026.

Reasoning model. Use Gemini on the Gemini Enterprise Agent Platform. The practical defaults today are Gemini 3.1 Pro (in preview since February 19, 2026) for complex, multi-step reasoning, and Gemini 3.5 Flash (released May 19, 2026) for the high-volume nodes inside the graph where you do not need the top tier. Mixing models inside one graph is normal: route classification and summarization to 3.5 Flash, route planning and synthesis to 3.1 Pro. Avoid Gemini 2.0 Flash and 2.0 Flash-Lite — they reached end-of-life on June 1, 2026. Gemini 2.5 Flash is still GA but is on the retirement glide path: Google has said it will be discontinued no earlier than October 16, 2026, with a confirmed date once Gemini 3 Flash hits GA. If you are starting today, picking 3.5 Flash for the worker tier saves you one migration.

Orchestration. Use LangGraph 1.x. The 1.0 release shipped in October 2025 with a no-breaking-changes promise through 2.0, and as of June 2026 the library is on the 1.2.x line. The graph-as-runtime model maps cleanly onto things production systems actually need: explicit state, checkpointing, interrupts, and replay. Alternatives exist — Google’s own Agent Development Kit (ADK) 2.0, announced at Google I/O in May 2026, is a credible code-first option that integrates natively with the Gemini Enterprise Agent Platform and the Agent2Agent (A2A) protocol. ADK is a reasonable second choice if you want a Google-blessed stack end-to-end; LangGraph remains the most portable open framework.

Semantic memory. Use Vertex AI Vector Search 2.0. This is the GA successor to what was once called Matching Engine and then plain Vertex AI Vector Search. The 2.0 generation adds Collections (data and vectors together in one logical object), auto-embeddings for populating vector fields automatically, and a hybrid search mode that combines dense vectors, sparse embeddings, and a built-in semantic re-ranker in a single parallel query. It is powered by Google’s ScaNN algorithm — the same engine behind Google Search, YouTube, and Google Play — and delivers sub-10 ms latency on billion-scale indexes. For a first project you do not need the broader Vertex AI Search (“Search” without “Vector”); pick Vector Search 2.0 and graduate later if your retrieval needs widen.

Durable state. Use Firestore as the default for agent run state, checkpoints, and conversation history. It is serverless, scales without operator effort, and the document model maps naturally to a graph’s per-node state objects. Reach for Spanner only when you genuinely need cross-region strong consistency or relational joins across high-volume operational data — most first agents do not. Memorystore for Redis is worth adding for hot scratch state (rate limiting, in-flight tool call dedup), but not as the system of record.

Serving. Use Cloud Run. It supports streaming HTTP, WebSockets, and (importantly for agents) request timeouts up to 60 minutes and tunable concurrency. The platform’s built-in service identity is what you bind IAM permissions to, so the agent’s authority to read Firestore or call Vector Search is a property of the deployment, not a credential stuffed into an environment variable.

Tool integration. Standardize on the Model Context Protocol (MCP) for new tools and the Agent2Agent (A2A) protocol v1.0 for agent-to-agent communication. Google adopted MCP across its own services in December 2025 and now offers fully managed remote MCP servers for Google Maps, BigQuery, Compute Engine, and Kubernetes; Apigee acts as an MCP bridge that turns any standard API into a discoverable agent tool. A2A v1.0 reached 150+ organizations in production within its first year and is the default interoperability layer on the Gemini Enterprise Agent Platform — use it the moment you have more than one agent that needs to coordinate across services. The two protocols complement each other: MCP wires an agent to tools and data, A2A wires agents to each other.

Secrets. Use Secret Manager, accessed through the Cloud Run service identity. No keys in environment variables, no keys in code, no keys in container images. This is the single most important security control on the list, and it is essentially free to get right on day one.

A Reference Architecture

A practical first agent on GCP looks like this:

Client (web/CLI)
  → Cloud Run (FastAPI + LangGraph agent)
      → Gemini Enterprise Agent Platform (Gemini 3.1 Pro / 3.5 Flash)
      → Vertex AI Vector Search 2.0 (semantic memory)
      → Firestore (run state, checkpoints, history)
      → MCP tool servers on Cloud Run (domain tools)
      → A2A peers (other agents, optional)
      → Secret Manager (API keys, signing material)
  → Cloud Logging / Cloud Trace (observability)

Three properties of this layout are worth calling out. First, the agent process is the only thing on the hot path that holds intelligence — everything else is a managed dependency. Second, every storage tier is serverless or autoscaling, so the bill scales with usage rather than with peak provisioning. Third, identity flows through a single Cloud Run service account, which means the agent’s permissions are visible, auditable, and revocable from one place.

Orchestrating with LangGraph

LangGraph models an agent as a directed graph of nodes operating on a shared state object. Each node is a Python function that takes state, optionally calls an LLM or a tool, and returns a state update. Conditional edges decide where to go next based on the updated state.

A minimal planning agent looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from typing import TypedDict, Annotated, Sequence
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.firestore import FirestoreSaver
from langchain_google_vertexai import ChatVertexAI

class AgentState(TypedDict):
    goal: str
    plan: list[str]
    observations: Annotated[Sequence[str], "append"]
    answer: str | None

planner = ChatVertexAI(model="gemini-3.1-pro", temperature=0.2)
worker  = ChatVertexAI(model="gemini-3.5-flash", temperature=0.0)

def plan_node(state: AgentState) -> AgentState:
    response = planner.invoke(
        f"Break down this goal into 3-5 concrete steps:\n{state['goal']}"
    )
    return {"plan": _parse_steps(response.content)}

def act_node(state: AgentState) -> AgentState:
    step = state["plan"][len(state["observations"])]
    result = worker.invoke(f"Execute: {step}\nContext so far: {state['observations']}")
    return {"observations": [result.content]}

def is_done(state: AgentState) -> str:
    return "synthesize" if len(state["observations"]) >= len(state["plan"]) else "act"

graph = StateGraph(AgentState)
graph.add_node("plan", plan_node)
graph.add_node("act", act_node)
graph.add_node("synthesize", synthesize_node)
graph.set_entry_point("plan")
graph.add_edge("plan", "act")
graph.add_conditional_edges("act", is_done, {"act": "act", "synthesize": "synthesize"})
graph.add_edge("synthesize", END)

agent = graph.compile(checkpointer=FirestoreSaver(collection="agent_checkpoints"))

Two things are doing real work here. The checkpointer (provided by the langgraph-checkpoint-firestore package) writes the state object to Firestore after every node, so a Cloud Run instance crash mid-run is recoverable — when the next request comes in with the same thread ID, LangGraph rehydrates state and resumes at the last completed node. The mixed-model setup uses the expensive planner only at the top of the loop and the cheaper worker model for the per-step grind. On a representative workload with five steps, this pattern typically lands 60-70% cheaper than running everything through Pro without measurably hurting quality.

Semantic Memory with Vector Search 2.0

Working memory lives in AgentState; long-term memory lives in Vector Search. A common pattern is to index, for each completed run, a short summary plus key facts, then retrieve the top-k most relevant prior summaries at the start of a new run with a similar goal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from google.cloud import aiplatform_v1
from langchain_google_vertexai import VertexAIEmbeddings

# text-embedding-005 is the current English-only Gecko-line model;
# switch to gemini-embedding-001 if you need multilingual or code retrieval.
embeddings = VertexAIEmbeddings(model_name="text-embedding-005")

def remember(run_id: str, summary: str, metadata: dict):
    vector = embeddings.embed_query(summary)
    index_client.upsert_datapoints(
        index=INDEX_NAME,
        datapoints=[{
            "datapoint_id": run_id,
            "feature_vector": vector,
            "restricts": [{"namespace": "tenant", "allow_list": [metadata["tenant"]]}],
        }],
    )
    firestore_client.collection("memory_payload").document(run_id).set({
        "summary": summary, **metadata,
    })

def recall(query: str, tenant: str, k: int = 5) -> list[dict]:
    vector = embeddings.embed_query(query)
    hits = index_client.find_neighbors(
        index_endpoint=INDEX_ENDPOINT,
        queries=[{"datapoint": {"feature_vector": vector},
                  "neighbor_count": k,
                  "restricts": [{"namespace": "tenant", "allow_list": [tenant]}]}],
    )
    return [_load_payload(h.datapoint.datapoint_id) for h in hits[0].neighbors]

The restricts field on each datapoint enforces tenant isolation at query time — a vector belonging to tenant A is never returned to tenant B even if the query is similar. For any multi-tenant system this is mandatory; bolting it on later means re-indexing everything. Note also that the heavy payload (the summary text, metadata) lives in Firestore, not in Vector Search. The vector index stores IDs and embeddings; Firestore stores the human-readable content. This separation keeps index size, and therefore cost, predictable. If your retrieval needs span keyword and semantic matches, Vector Search 2.0’s hybrid mode lets you run dense and sparse queries plus a built-in re-ranker in one round trip, which is materially better than chaining a keyword index and a vector index yourself.

Cloud Run Deployment

Cloud Run is well-suited to LangGraph agents for one specific reason: the request can be long. A multi-step plan with tool calls easily runs 20-60 seconds; with retries and human-in-the-loop interrupts it can run for minutes. Configure for that explicitly:

Timeout: raise to 600s or the platform maximum of 3600s (60 minutes), depending on agent depth.
Concurrency: start at 4-8 per instance; agents are I/O-bound on LLM calls but each one holds a chunk of memory while it waits. You can push higher (Cloud Run allows up to 1000 per instance) once you have measured memory pressure under load.
Min instances: 1 if cold-start latency matters; 0 if cost dominates and a 2-3 second cold start is acceptable.
CPU allocation: keep “CPU always allocated” off for cost unless you do meaningful synchronous work between LLM calls.
Service account: dedicated per agent; grant only the roles it actually needs (roles/aiplatform.user, roles/datastore.user, scoped Secret Manager access, and the specific tool APIs).

A Terraform sketch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
resource "google_cloud_run_v2_service" "agent" {
  name     = "rlg-agent"
  location = var.region

  template {
    service_account = google_service_account.agent.email
    timeout         = "3600s"
    scaling { min_instance_count = 1, max_instance_count = 20 }

    containers {
      image = "${var.region}-docker.pkg.dev/${var.project}/agents/rlg-agent:${var.version}"
      resources { limits = { cpu = "2", memory = "2Gi" } }
      env {
        name = "VECTOR_INDEX_ENDPOINT"
        value = google_vertex_ai_index_endpoint.memory.id
      }
      env {
        name = "GEMINI_API_KEY"
        value_source {
          secret_key_ref {
            secret  = google_secret_manager_secret.gemini.secret_id
            version = "latest"
          }
        }
      }
    }
  }
}

You can lay this on top of an existing Terraform repo — at RLGeeX we keep agent infrastructure under $HOME/git/rlgeex/rlg-infra/ next to the rest of the platform so it shares state, naming, and IAM modules.

Error Handling and Recovery

Agents fail more interestingly than APIs do. Plan for three failure modes from day one:

Transient tool/model failures. Wrap every LLM and tool call in exponential backoff with jitter. The Vertex AI SDK does some of this already; for tools, do it yourself. Cap total retries per node so a flapping dependency cannot consume an entire token budget.
Mid-run crashes. LangGraph’s Firestore checkpointer covers this for you — every node commit is durable. Make sure your Cloud Run service is configured to accept a retry on the same thread ID rather than starting a fresh run.
Stuck loops. Enforce a step budget in the graph itself (if len(state["observations"]) > MAX_STEPS: route to synthesize) and a wall-clock budget in the entry handler. A 50-step infinite plan can easily eclipse the model bill of every other run that day.

Cost Optimization

Most of the cost in a working agent system is tokens, and most token spend is avoidable. A few patterns that move the needle:

Tier your models. Reserve Gemini 3.1 Pro for the planner and final synthesizer. Use 3.5 Flash for everything else, including summarization, classification, and tool argument formatting. (Avoid 2.5 Flash for new builds — its retirement window opens in October 2026.)
Cache aggressively. The platform supports context caching for repeated prompt prefixes; if your system prompt is 4 KB and you make 50 calls per session, caching it pays for itself within a few requests.
Compress memory. Do not feed the entire conversation history back into every step. Summarize older turns into a short rolling summary in state, and recall specifics from Vector Search only when needed.
Cap retrieval. A k of 5 is usually plenty for semantic memory; very few systems benefit from k=20, and the latency and token cost grows linearly.
Scale to zero where you can. Tool MCP servers that see bursty traffic should run on Cloud Run with min_instances=0. The agent itself often warrants min_instances=1 to dodge cold starts on the user-facing path.

Track per-run token cost as a first-class metric. Emit it from the agent into Cloud Logging, then alert on regressions; a code change that doubles average run cost is a bug worth catching the same day.

Security

Three controls are non-negotiable. First, no secrets in code or environment variables — pull from Secret Manager via the Cloud Run service identity. Second, scope IAM tightly: the agent’s service account should have exactly the roles it needs and no more, and any internal tools it calls should themselves require authentication so a prompt-injected tool call cannot reach a privileged endpoint. Third, tenant-isolate memory at the storage layer using Vector Search restricts and Firestore security rules — never trust the agent to filter its own queries.

Beyond those, treat user input as untrusted (which is obvious) and treat tool outputs as untrusted (which is less obvious — a tool that scrapes the web can return prompt-injection payloads that hijack the next node). Sanitize, length-limit, and where possible route untrusted content through a Flash model with a tight system prompt before letting it influence planning. If you wire an A2A peer into the loop, verify its Agent Card and treat its outputs the same way you treat a remote API: assume nothing, validate everything.

Wrapping Up

A first agentic system on GCP is not a moonshot. The stack is well-defined in mid-2026: LangGraph 1.x for orchestration, Gemini 3.1 Pro and 3.5 Flash on the Gemini Enterprise Agent Platform for reasoning, Vertex AI Vector Search 2.0 for semantic memory, Firestore for durable state, Cloud Run for serving, MCP for tools and A2A for peer agents, and Secret Manager and IAM for the guardrails. The hard parts are not which service to pick; they are deciding what your agent should not do, scoping its tools narrowly, instrumenting cost from day one, and treating durability and identity as design constraints rather than afterthoughts.

Build the smallest useful loop end-to-end first — a single planner, a single tool, Firestore checkpointing, one Vector Search recall — and run it on real traffic before adding nodes. Most of what you learn about an agent only shows up in production, and the architecture above is designed to let you ship that minimal version on day one and grow it without rewrites.