my proposed techniques for safe AI
Custom reconnaissance techniques to enhance AI model safety and reliability
in progress
I am building models with sole purpose of drawing circuit heat maps and flagging issues in the model under monitoring
in progress
I am building an array of visualization tools for analyzing the activation paths. Reduced data extraction is necessary to do that, I am working on a tool to do that. => https://github.com/modelrecon/mr-recon-tracer
This tool will extract data in the my proposed Activity Cube format
some existing techniques
Shapley Values
It is weird - this was a technique used in 54 - so old! It is a game theory that is applied to AI. I dont understand game theory very well, but what I know is that is here is takes a feature and find out how much it attributes to the output
by giving it values of different kinds.
ViTmiX
this one is just for image models. it is a mix and match of various techniques. That is good enough for now.
XAI‑Guided Context‑Aware Data Augmentation
it is just performance improvement technique for DATA - it may not be good to consider as a interpretability technique. What is does is that it iteratively makes least important features more important by changing tokens
Mechanistic interpretability / Circuit tracing & sparse decomposition
This is my favourite one, it is a set of techniques that attempt to peer inside deep networks (especially large language models or transformers) and decompose them into simpler, interpretable sub-components (e.g. “circuits,” “features,” “concepts”) rather than treating them as opaque black boxes.
LIME (Local Interpretable Model-agnostic Explanations)
This is the second fav one. It creates a surrogate model trained on input outputs of the core model:
It Takes one input instance
Creates many small changes around it
Gets predictions from the black-box model (the core model)
Fits a simple model on those perturbations - call this the surrogate model
Uses the surrogate for the explanation - pretty simple
