my research direction

the research I do is basically thinking about the papers I read, I do not have the math background yet to create my own research work. So do not consider this some AI reasearchers work :) ..it is just informal log of what I have been thinking and reading and testing

I think (and read that) XAI should answer these questions:

  • “Why did the AI output this result for a given input?” - this is what users and testers should ask.

  • “Which input features or factors contributed the most to this decision?” what was most important part of the model.

  • “Under what conditions is the AI reliable (or unreliable)?” - this is a big one as this will help us detect misalignment

  • “What are the limitations, biases, or risks of using this model in production?” - this is more for others

  • “How can we debug, audit, or improve the model behavior (especially for fairness / safety)?”

We can ttrace path of every neuron, as I learnt - this is why we need that. I have tried to divide the study into 4 parts - I dont know if that is good but it works for me. I am for now only looking into interpreting an already-trained model as-is. These are the broad ways:

Close-up of a researcher analyzing AI model data on multiple screens.
Close-up of a researcher analyzing AI model data on multiple screens.
Understand the dependcies

study conditional activation and attention dependency - what varies with what?

Visualization of AI decision pathways highlighting safety checkpoints.
Visualization of AI decision pathways highlighting safety checkpoints.
Understand the path in layers

Tracing, Connecting the layers - Basically understanding what sits between and influences the path of activations

Team meeting discussing reconnaissance techniques around a table.
Team meeting discussing reconnaissance techniques around a table.
Graphs and charts showing AI model vulnerability assessments.
Graphs and charts showing AI model vulnerability assessments.
Understand the Structure

What computation is happening, look at the residual stream and break it.

unchanged stuff

What stays unchanged between prompts