AI XHIELD

Understanding the Black Box - Part 2

Alde — Sun, 09 Nov 2025 17:08:17 GMT

Before we jump right in the next steps a quick recap of Part 1

Since their rise in 2022, LLMs built on the transformer architecture such as ChatGPT, Gemini, and Claude have revolutionized how humans interact with AI, software and computers. By 2025, their influence has expanded into image and video generation with systems like OpenAI’s Sora, Meta’s Vibes, and xAI’s Grok. Yet, despite their transformative capabilities, the mechanisms driving their intelligence remain largely mysterious. Unlike traditional software, which follows explicit, human-written instructions, LLMs learn from vast amounts of text data. Through this training process, they develop a dense network of trillions of parameters capable of encoding knowledge, reasoning, and creativity, but with little interpretability. This opacity has given rise to the field of mechanistic interpretability, which aims to uncover how these systems actually work.

The first article in this series introduced how transformers process information during training. It explained how text is first tokenized into numerical representations, how those tokens are transformed into embedding vectors that capture meaning, and how information flows through the residual stream, a shared workspace where each transformer layer refines understanding. Together, these steps form the foundation for how transformers represent meaning and context.

Step 4: How attention heads let transformers use context and move information between tokens

The embedding matrix gives each word its standalone meaning, but understanding language requires more than that. The real breakthrough in transformers is the attention mechanism, which enables models to connect words across a sentence and interpret them in context.

Take the word “bank”. It means something entirely different in I swam near the river bank versus I got cash from the bank. Attention allows the model to figure out which meaning fits by relating words to one another.

An attention layer contains multiple attention heads that operate in parallel, each focusing on different relationships between tokens. Every head has two core components:

QK (Query–Key) circuit: Decides where to look for relevant information. For each token being processed (the query), it scores how related it is to every previous token (the keys). These scores turn into probabilities, effectively telling the model how much attention to give to each earlier token.
OV (Output–Value) circuit: Determines what information to bring over. Each source token (key) produces a value vector. The destination token (query) then receives a weighted average of these values, with weights coming from the attention pattern learned by the QK circuit. This new information is added back into the residual stream at that token’s position.

When a token gives another a high attention score, it’s like saying, “That’s the information I need.”

Importantly, a query token can only attend to tokens that came before it, never to future ones.

Intuition: Think of each query as asking a question about all earlier words, and the keys and values as providing the answers.

A key mechanism: Induction heads

One particularly interesting type of attention head is the induction head, which powers what’s known as in-context learning, a model’s ability to pick up patterns or rules directly from examples in the prompt.

An induction head follows a simple algorithm:

If token A was followed by token B earlier in the text, then the next time A appears, predict that B will follow again.

This allows the model to generalize patterns it has never explicitly seen during training.

In practice, the induction circuit involves two heads:

The previous-token head in the first layer copies information from one token to the next (for example, copying from sat to on).
The induction head in the second layer looks back to find where the current token appeared before, attends to the token that followed it (on in this case), and boosts the probability of generating that token next.

This behaviour shows that transformers can learn algorithms, not just memorize data, and since induction heads only appear in models with at least two layers, they’re evidence that deeper models develop qualitatively new reasoning abilities.

In attention visualisations, induction heads appear as off-center diagonal patterns, showing how tokens in repeated phrases attend to the next token in their earlier counterparts.

Understanding attention through indirect object identification (IOI)

Rowan Wang Tweet https://x.com/rowankwang/status/1587601532639494146

Another fascinating example of how attention works comes from a task called indirect object identification (IOI), for instance:

When Mary and John went to the store, John gave a drink to...

The correct answer is Mary. In 2022, Redwood Research reverse-engineered how transformers solve this using a network of specialized attention heads arranged in a three-step circuit:

Identify all names in the sentence (Mary, John, John).
Filter out duplicates (John).
Output the remaining name (Mary).

These steps are carried out by three main groups of heads:

Duplicate Token Heads: Detect repeated names and connect the later one to its earlier instance.
S-Inhibition Heads: Suppress duplicate tokens, preventing them from influencing the model’s next prediction.
Name Mover Heads: Copy the correct (non-duplicated) name to the final position, ensuring the model predicts Mary.

This IOI circuit highlights how complex reasoning can emerge from the coordination of many attention heads, each performing a small, specialized role within the larger mechanism of understanding.

Source

Indirect Object Identification in GPT-2: https://arxiv.org/abs/2211.00593
Neel Nanda: A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)
https://www.alignmentforum.org/posts/3ecs6duLmTfyra3Gp/some-lessons-learned-from-studying-indirect-object
https://transformer-circuits.pub/2021/framework/index.html
https://aignishant.medium.com/unraveling-the-magic-of-q-k-and-v-in-the-attention-mechanism-with-formulas-035cb0781905
https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
https://medium.com/data-science/what-are-query-key-and-value-in-the-transformer-architecture-and-why-are-they-used-acbe73f731f2
https://www.lesswrong.com/posts/XGHf7EY3CK4KorBpw/understanding-llms-insights-from-mechanistic

Understanding the Black Box - Part 1

Alde — Sat, 18 Oct 2025 16:02:41 GMT

Since the launch 2022, LLMs built on the transformer architecture such as ChatGPT, Gemini, and Claude have reshaped the world with their ability to produce remarkably human-like text. Today 2025, we are witnessing their rapid adoption into images and videos, through OpenAI’s Sora, Meta’s Vibes, and xAI’s Grok. Yet behind this astonishing capability lies a deep mystery:

we still don’t fully understand how these systems function.

Traditional software is explicitly programmed by humans, written line by line in interpretable code. LLMs, however, are not designed in this way; they are trained. Their behaviour emerges from learning to predict the next word across immense amounts of internet text, producing a dense web of trillions of parameters that somehow encode knowledge, reasoning, and creativity. This process yields extraordinary performance but little transparency. These models are undeniably powerful, yet the mechanisms driving their success remain largely opaque.

As AI adoption accelerates understanding why LLMs say what they say has become paramount. This is where mechanistic interpretability comes in: the field dedicated to uncovering the inner workings of these black boxes and bringing clarity to the most powerful technology of our time.

As an investor in AI, I often find it difficult to distinguish genuine innovation from noise. Every technological wave attracts opportunistic or casual entrepreneurs , and the AI boom is no exception.

With software itself becoming increasingly Agentic, understanding the brain behind these agents has never been more crucial so these series of essays explores the inner mechanics of LLMs: how they learn, represent knowledge, and generate meaning.

My goal is to explores the inner mechanics of LLM, both to deepen my understanding and to help navigate the AI investment landscape with greater insight.

Today the transformer, an ML model architecture introduced in 2017, is the most popular architecture for building LLMs. How a transformer LLM works depends on whether the model is generating text (inference) or learning from training data (training).

LLMs during Training

During training, the transformer produces predictions for every token in a sentence. For each input position i, the model predicts the token that follows it, i + 1. Generating multiple predictions simultaneously allows for more efficient training.

These predictions are compared against the actual tokens in the training data, and the resulting errors are used to adjust the model’s parameters to improve its performance.

Step 1 - Tokenization: converting text into tokens

When the model receives a sentence such as “Bright stars shine tonight,” it first splits the text into smaller units called tokens. A token could be a complete word (e.g., “bright”), a segment of a word (e.g., “shine” and “s” from “shines”), or punctuation.

Each token in the model’s vocabulary is then mapped to a unique numeric ID. For example, “Bright stars shine tonight” might be represented as [21, 58, 77, 204]. This numeric sequence is the tokenized version of the text, produced by the tokenizer.

The model then adds positional embeddings to these token vectors to encode the order in which the tokens appear in the sentence..

Step 2 - Embeddings: giving meaning to tokens

After text is broken into tokens, each token is turned into an embedding vector, which is a list of numbers that represents its meaning. By multiplying the list of token IDs by an embedding matrix.

The embedding matrix has a size of[vocabulary size, embedding dimension], meaning:

• Each word in the vocabulary has one row.

• Each row is the embedding vector for that token.

So how do these vectors capture meaning?

During training, the model learns to assign similar vectors to words with similar meanings such as “see,” “look,” and “watch.” In this high-dimensional space, similar words end up pointing in similar directions, so the angle between their vectors is small.

What is an embedding vector?

An embedding vector is a list of numbers that represents each token’s meaning.

Step 3 - The residual stream: How data flows

Inside a transformer, information moves through something called the residual stream. They are a shared workspace where different parts of the model write down and read information. Each transformer layer takes the current information in the stream, updates it, and passes it along.

At the start, the residual stream only contains the individual meaning of each word, without context. As the data flows through the transformer blocks, each layer refines those meanings by taking previous words into account. Over time, the model builds a richer understanding of each token in context.

Residual Stream Technical Architecture:

This stream is simply a list of vectors, one for each token in the input. Its shape is [sequence length, model dimension], which matches the shape of the embedding layer’s output.

Source: Exploring the Residual Stream of Transformers

LLM architecture are an advance deep learning model, my intention is that this does not become overwhelming for readers, I keep dissecting their inner workings in following posts

If you want to dig deeper and check sources:

https://arxiv.org/html/2312.12141v1
https://arbs.io/2024-01-14-demystifying-tokens-and-embeddings-in-llm
https://www.lesswrong.com/posts/XGHf7EY3CK4KorBpw/understanding-llms-insights-from-mechanistic

Project Castellana: Safety implementation of a VC Agent

Alde — Sun, 25 May 2025 06:15:21 GMT

AI Agents are rapidly transforming how software is built. As this is the Agentic era and software engineering has change I wanted to create a project that teach me how to write software securely

Software ate the world and AI is eating software and venture capital is no exception. At our firm, writing memos is a core part of our investment process. These memos explain our analysis, due diligence, and investment thesis, regardless of the startup’s stage.

To support this process, I’ve been using tools like Perplexity to assist with market analysis. It has significantly reduced research time from around one week to just a few hours. However, while Perplexity is great for accelerating research, it doesn’t meet all the requirements needed for seamless integration into our internal investment workflows.

This led to Project Castellana, a prototype AI agent that can help write investment memos, built with safety engineering principles from day one.

The Problem: How Do We Actually Build Useful, Safe AI Agents?

To build a functioning AI agent, we need a few key components:

An agentic framework – a software development kit (SDK) that lets us orchestrate interactions between tools and large language models (LLMs).
A clear role and task division – defining what each agent in the system should do.
Tools – custom-built or external tools that each agent can use to complete its tasks safely and accurately.

Some of the most popular open-source agentic frameworks include:

LangChain
LlamaIndex
CrewAI
AgentStack

For Project Castellana, I chose CrewAI because it allows for structured, multi-agent collaboration in a modular way.

from crewai import Agent
from langchain_openai import ChatOpenAI
from langchain.tools import Tool
from crewai_tools import EXASearchTool

The Agent Architecture

The system follows a hierarchical multi-agent approach, where each agent has a well-defined responsibility:

Strategic Advisor Agent

Goal: Oversee and coordinate the crew’s work, ensuring high-quality, relevant, and non-generic outputs.
Context: Acts as an experienced project manager focused on aligning output with market-specific investment needs.

def get_strategy_advisor(trace_id=None):

return create_agent(
  role='Project Manager',

  goal='Efficiently manage the crew and ensure high-quality task completion with a focus on ensuring that the results are very specific and relevant and not generic and too zoom out',

  backstory="""You're an experienced project manager, skilled in overseeing complex projects and guiding teams to success. Your role is to coordinate the efforts of the crew members, ensuring that each task is completed on time and that the results are relevant and specific to  the market.""",

  tools=[],
  trace_id=trace_id,
  agent_name='strategy_advisor'
)

Competitor Research Agent

Goal: Identify and analyze real startups in defined AI subsegments.
Context: Specialized in spotting emerging, verifiable startups excluding well-known players like Google, Meta, Anthropic, OpenAI, etc.

def get_competitor_analyst(trace_id=None):
return create_agent(

  role='AI Startup Intelligence Specialist',

  goal='Identify and analyze relevant AI startups within specific AI subsegment markets',

  backstory="""Expert in mapping competitive landscapes for specific AI verticals. Specialized in identifying real, named emerging startups and scale-ups rather than tech giants like IBM, OpenAI, Google, META, Anthropic, HuggingFace. Known for finding verifiable information about startups' funding, technology, and market focus.""",

  tools=[exa_search_tool],
  trace_id=trace_id,
  agent_name='competitor_analyst'
)

Tools the Agents Use

To support the above agents, I developed the following tools:

Market Size Tool – Estimates the total addressable market for a given segment.

def estimate_market_size(data: str) -> str:

return f"Estimated market size based on: {data}"
  market_size_tool = Tool(
  name="Market Size Estimator",
  func=estimate_market_size,
  description="Estimates market size based on provided data."
)

CAGR Calculator – Automatically computes compound annual growth rates from public or private data sources.

def calculate_cagr(initial_value: float, final_value: float, num_years: int) -> float:

cagr = (final_value / initial_value) ** (1 / num_years) - 1

return cagr

cagr_tool = Tool(

  name="CAGR Calculator",
  func=calculate_cagr,
  description="Calculates CAGR given initial value, final value, and number of years."
)

Search Tool (via Exa) – Allows agents to access real-time web search results, optimized for sourcing startup-specific information.

class CustomEXASearchTool(EXASearchTool):
  def __init__(self):
  super().__init__(
    type='neural',
    use_autoprompt=True,
    startPublishedDate='2021-10-01T00:00:00.000Z',
    endPublishedDate='2023-10-31T23:59:59.999Z',
  excludeText=['OpenAI', 'Anthropic', 'Google', 'Mistral', 'Microsoft', 'Nvidia','general AI market', 'overall AI industry', 'IBM', 'Mistral'],
  numResults=10
 )

exa_search_tool = CustomEXASearchTool()

Embedding Safety Engineering Principles in Project Castellana

The objective of Project Castellana is that the agentic system is built with safety engineering principles to make AI agents reliable and deployable in high-stakes professional contexts like investment decision-making.

Risk Decomposition

Project Castellana starts by identifying potential failure points:

Data inaccuracy (e.g., hallucinated market size)
Non-compliant output (e.g., biased or misleading content)
Oversight failures (e.g., one agent missing red flags)

These are broken down in terms of likelihood, severity, and exposure, allowing the design to target the most impactful risks early.

Safe Design Principles

Redundancy

Outputs of the agent are in place to support cross-verification of key findings by triggering human-in-the-loop reviews of the sources used by the agent.

Separation of Duties

The multi-agent structure ensures no single agent performs all tasks. Each agent has a tightly scoped responsibility, which limits cascading failure risks.

Principle of Least Privilege

Agents only have access to the tools and data relevant to their roles. For instance, the Strategic Advisor cannot directly query Exa—it relies on outputs from specialized agents.

Fail-Safes (In Progress)

Future iterations may include uncertainty estimates that flag outputs for human review if the confidence falls below a defined threshold.

Transparency

Outputs include tool provenance (e.g., “Market Size sourced from X, calculated via Y”), and internal reasoning steps can be logged and reviewed. This improves human interpretability.

Defense in Depth

The system is being designed to include multiple validation layers before an output is accepted into a memo—agent-level verification, tool-level checks, and optional human review.

Systemic Safety and Accident Models

Rather than focusing solely on the reliability of individual components—such as the Get Competitors Agent—Project Castellana is being developed with systemic risk in mind: the kinds of failures that emerge not from a single malfunction, but from the interactions and dependencies between agents, tools, and user feedback loops.

This mirrors safety models used in high-stakes domains like aviation, where accidents typically arise from a chain of events rather than one isolated breakdown. In complex systems, failures rarely occur in isolation; they are often the result of cascading errors, misaligned assumptions, or silent coordination breakdowns.

Castellana applies principles from systems engineering and accident modeling to proactively manage these risks, ensuring the entire agentic workflow behaves robustly and predictably—even under pressure.

Here's how:

1. Agent-to-Agent Communication Monitoring

Each agent in Castellana operates with a well-defined role, but their outputs are often inputs for others. For example, the Get Competitors Agent provides findings to the Strategic Advisor, who integrates them into the memo. Systemic risks arise if:

The Get Competitors Agent misinterprets the prompt and outputs incomplete data.
The Strategic Advisor assumes the data is comprehensive and doesn't seek corroboration.

To counteract this, Castellana introduces explicit handoff protocols, where agents pass metadata along with their output (e.g., source quality, timestamp, uncertainty), giving downstream agents richer context to assess validity.

2. Tool-Agent Interaction Governance

Agents rely on external tools—like Exa for search or a CAGR calculator—for critical data. Systemic risk surfaces when tools fail silently, return outdated data, or are misused. For example:

If Exa delivers results from 2020 without date metadata, an agent might incorrectly interpret them as current.
A parsing error in the Market Size Tool could propagate false estimates across the memo.

Castellana addresses this by:

Adding tool wrappers that enforce input/output validation and context tagging.
Logging all tool interactions so anomalies can be traced post-hoc.

Tail Events and Black Swans

Even if 99% of memos are accurate, the 1% that are confidently wrong pose significant reputational or financial risk. Black swan scenarios could include:

A flawed valuation that makes it into a partner meeting
A hallucinated startup cited as a key competitor
An inappropriate thesis generated from faulty data

By embracing the precautionary principle and horizon scanning (e.g., agents flagging “unknown unknowns” or anomalous outputs), Castellana aims to mitigate such risks even if they can’t be predicted.

Implementation Gaps and Next Steps

While the structure and intent of Project Castellana align strongly with safety engineering principles, not all principles are fully implemented yet. For instance:

Fail-safe mechanisms and confidence thresholds are being explored.
Redundancy and defense in depth are currently manual but will be automated.
Comprehensive logging and explainability will require further development.

Application of Single Agent Safety

Beyond classical safety engineering, the sources describe AI-specific safety concerns such as monitoring, robustness, alignment, and systemic safety. Here’s how these apply to Project Castellana:

Monitoring

Monitoring involves identifying hazards, reducing exposure, understanding internal representations, detecting anomalies, and increasing transparency.

Project Castellana already emphasizes transparency as a safety feature, with outputs indicating the tool provenance (e.g., “Market Size sourced from X”) to improve human interpretability and accountability.

To support monitoring and observability, Project Castellana uses Portkey.ai, a platform for managing and monitoring LLM-based agents in production. Portkey provides telemetry, error tracking, and prompt/response inspection capabilities that align with the monitoring and systemic safety goals described above. This operational layer helps bridge theory (AI safety principles) and practice (safe deployment of Castellana agents)

try:
   from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL
   PORTKEY_AVAILABLE = True
except ImportError:
   PORTKEY_AVAILABLE = False
   print("Portkey not available, falling back to direct OpenAI usage")


def get_portkey_llm(trace_id=None, span_id=None, agent_name=None):
   if PORTKEY_AVAILABLE:
       headers = createHeaders(
           provider="openai",
           api_key=os.getenv("PORTKEY_API_KEY"),
           trace_id=trace_id,
       )
       if span_id:
           headers['x-portkey-span-id'] = span_id
       if agent_name:
           headers['x-portkey-span-name'] = f'Agent: {agent_name}'


       return ChatOpenAI(
           model="gpt-4o",
           base_url=PORTKEY_GATEWAY_URL,
           default_headers=headers,
           api_key=os.getenv("t")
       )
   else:
       # Fallback to direct OpenAI usage
       return ChatOpenAI(
           model="gpt-4",
           api_key=os.getenv("OPENAI_API_KEY")
       )

Future enhancements could include:

Developing benchmarks and evaluations to assess the accuracy and quality of investment memo outputs.
Implementing anomaly detection to flag unexpected or potentially hazardous agent behavior.
Exploring mechanistic interpretability to better understand agents’ decision processes, though this remains a challenging area.

Robustness

Robustness addresses vulnerabilities in AI systems, including resistance to adversarial examples and Trojans.

Project Castellana acknowledges key risks like data inaccuracies and non-compliant outputs.
It applies redundancy (cross-verifying information across sources) and defense in depth (multiple validation layers, such as automated consistency checks and human-in-the-loop reviews), both critical in mitigating robustness failures.
Further steps could involve:
- Ensuring adversarial robustness for the models and tools used.
- Auditing against Trojans, especially if open-source or externally trained models are incorporated.

Alignment

Alignment is about ensuring that AI agents act in line with human intent, avoiding deceptive or unintended behavior.

Castellana uses separation of duties and the principle of least privilege to constrain agent behavior.

A Strategic Advisor Agent oversees outputs for quality and specificity, supporting high-level alignment with the memo-writing goal.

AI Xhield

Alde — Mon, 27 Jan 2025 16:50:40 GMT

Subscribe now

Hello, network!
It’s been two years since OpenAI's ChatGPT was launched, and the world has embraced AI like never before.
We’ve seen tech giants investing heavily in infrastructure, particularly by purchasing NVIDIA H100s and making them available in their cloud services.

https://www.reddit.com/r/pcmasterrace/comments/1awtso6/nvidia_made_29b_from_gaming_last_quarter_vs_184b/

https://www.jika.io/post/c90a9ecc-427f-11ee-8080-80013ec0134c

https://sherwood.news/tech/meta-amazon-microsoft-massive-ai-capex-spending-quarterly-earnings/

But what really stands out to me is Meta’s approach. They’re acquiring the infrastructure (H100s) to train models and seamlessly integrating them into consumer apps, offering these models almost for free in a semi-open-source format. This strategy is pushing the industry forward in a big way.
From an investment perspective, we’ve witnessed an explosion of startups focused on both the application layer and infrastructure, simplifying the creation of AI-native companies and products with GenAI through various models and techniques.
However, the topic I’m most interested in is security in AI systems. This is where we need more innovation to enable enterprises, mid-sized businesses, and SMEs to integrate AI securely.
After meeting with many companies and reading cybersecurity blogs such as Francis Odum, Ross Haleliuk , Return on Security, Strategy of Security and Altitude Cyber, I believe there is a growing opportunity in AI Security and Security for AI. As AI becomes the next compute platform, this focus will be critical. The resources mentioned already provide great coverage of the current cybersecurity ecosystem, but more attention is needed in this specific area.

Please let me in the comments the topics you want us to discuss:

Generative AI Regulation and Compliance
State of the Art on AI Explainability
Data Security for Generative AI
Other