I'm an AI Engineer with 5+ years building generative AI and agentic systems for workflows where a hallucination has a dollar cost. Most of my time goes to the unglamorous parts of the LLM stack: retrieval that won't invent policy numbers, eval harnesses that gate every rollout, inference paths that turn a 2.1s p95 into 780ms without changing the model, and agents that know when to route decisions to a human.
Recent work spans a LangGraph mortgage-risk agent at Fannie Mae over a governed policy corpus, a fraud detection pipeline for QuickBooks Online at Intuit serving millions of users on SageMaker, and a multi-agent fraud investigation system with calibrated verdicts and human-in-the-loop approval gates.
I care about retrieval that abstains, evals that gate, inference that's cheap, agents that don't lose the plot at 5K items of context, and high-recall systems that respect the analyst's time.
Each of these solves a problem that doesn't show up in vendor demos.
|
Autonomous cloud engineer A LangGraph agent that scans an AWS account, reasons about misconfigurations across 14 resource types, and proposes least-privilege fixes. The interesting problems were working memory at scale and cost of reasoning. Highlights
|
Personal AI research analyst Retrieval over 340 ML papers that knows when not to answer. Built after getting frustrated with RAG demos that confidently cite nonexistent sections. Highlights
|
|
LLM-powered payment fraud investigation A multi-agent fraud investigation assistant that replaces single-shot classification with a reasoning pipeline an analyst can audit. Three specialist agents, hybrid retrieval across three stores, and a human-in-the-loop approval gate for any freeze or escalate action. Highlights
|
|
Architected a LangGraph agent for mortgage risk analysis that chains document retrieval, policy lookup, and decision tools in a single reasoning loop, processing thousands of documents daily and cutting manual analyst review by 30%. Designed a hybrid BM25 + dense retrieval pipeline in OpenSearch that fixed dense-only hallucinations on policy numbers and lifted factuality by 52% on a 500-question compliance eval built with underwriters. Fine-tuned a domain-adapted open-weights LLM with QLoRA on SageMaker under strict data governance, cutting GPU hours by 60% vs full fine-tuning. Deployed vLLM on EKS with request batching and KV caching, dropping p95 latency from 2.1s to 780ms across 10M+ monthly document records. Built an LLM-as-judge eval harness with a 200-prompt regression suite that gated every rollout, catching 4 silent regressions across 2 release cycles.
Owned the end-to-end fraud detection pipeline for QuickBooks Online on SageMaker, training XGBoost on 400+ behavioral features and improving fraud precision by 18% at fixed recall. Re-architected feature computation with distributed PySpark on EMR, reducing runtime from 6h to 90min and enabling daily refresh for churn and LTV models. Built causal inference and uplift modeling pipelines with DoWhy/EconML, improving campaign targeting by 15% over traditional A/B testing. Designed a staged release framework with shadow testing, canary deployments, and automated rollback on drift metrics, reducing production incidents by 25%.
Built visual search and recommendation pipelines using PyTorch ResNet and OpenCV, improving recommendation CTR by 15%. Deployed real-time image classification and visual similarity models behind FastAPI. Engineered PySpark pipelines on EMR, reducing feature computation time by 25%.
Problems I find genuinely interesting right now. If you're working on any of these, reach out.
Abstention as a first-class capability. Most RAG systems optimize answer rate. The interesting metric is calibrated abstention — knowing when retrieved context isn't sufficient and saying so. PaperMind's confidence scorer is my current cut at this. I don't think it's solved.
Working memory for long-horizon agents. Context windows keep growing, but throwing everything at the model is the wrong move. I want agents that maintain an external working memory with principled eviction. CloudPilot's Qdrant-backed streaming memory is one approach; there are others worth trying.
Human-in-the-loop as infrastructure. In high-cost-asymmetry domains (fraud, healthcare, finance), the right agent architecture is not one that decides confidently — it's one that knows when to route a decision to a human with the right evidence pre-assembled. PayGuard's approval gate and audit log are my current cut at this.
Evals as deployment gates, not dashboards. An eval suite that runs offline and produces a slide deck is theater. An eval suite that blocks a rollout is infrastructure. The latter is harder and more important.
Inference cost as a modeling problem. The gap between "works on one request" and "works at $0.09 across 5K items" is where most of the real engineering lives.
M.S. Data Science · University of Central Oklahoma · 2023 – 2025 Oracle Cloud Infrastructure Data Science Professional · 2025 AWS Machine Learning Specialty
Best ways to reach me: LinkedIn DM or email. Reply time is usually within 24 hours. I'm open to senior AI engineer roles in the Bay Area or remote.
"Production LLM systems are 10% prompting and 90% retrieval, evaluation, and knowing when to abstain."