my proposed techniques for safe AI

Custom reconnaissance techniques to enhance AI model safety and reliability

in progress

I am building models with sole purpose of drawing circuit heat maps and flagging issues in the model under monitoring

in progress

I am building an array of visualization tools for analyzing the activation paths. Reduced data extraction is necessary to do that, I am working on a tool to do that. => https://github.com/modelrecon/mr-recon-tracer
This tool will extract data in the my proposed Activity Cube format

some existing techniques

Shapley Values

It is weird - this was a technique used in 54 - so old! It is a game theory that is applied to AI. I dont understand game theory very well, but what I know is that is here is takes a feature and find out how much it attributes to the output
by giving it values of different kinds.

ViTmiX

this one is just for image models. it is a mix and match of various techniques. That is good enough for now.

XAI‑Guided Context‑Aware Data Augmentation

it is just performance improvement technique for DATA - it may not be good to consider as a interpretability technique. What is does is that it iteratively makes least important features more important by changing tokens

Mechanistic interpretability / Circuit tracing & sparse decomposition

This is my favourite one, it is a set of techniques that attempt to peer inside deep networks (especially large language models or transformers) and decompose them into simpler, interpretable sub-components (e.g. “circuits,” “features,” “concepts”) rather than treating them as opaque black boxes.

LIME (Local Interpretable Model-agnostic Explanations)

This is the second fav one. It creates a surrogate model trained on input outputs of the core model:

It Takes one input instance

  • Creates many small changes around it

  • Gets predictions from the black-box model (the core model)

  • Fits a simple model on those perturbations - call this the surrogate model

  • Uses the surrogate for the explanation - pretty simple