Paper

Charles Dana · Monce SAS · May 2026

A SAT-based predictive layer for the spreadsheet — seven sections, measured.

← Home Paper Economics Architecture Install Dashboard Full pitch →

Abstract

Microsoft Excel ships =TREND, =FORECAST.LINEAR, and =FORECAST.ETS — three regression families dating back to the early 1990s. None of them accept text features, none classify, none expose per-row reasoning. We propose adding a SAT-based predictive layer to the spreadsheet: five custom functions (=PREDICT, =PREDICT.CANDLE, =FILL, =AUDIT, =POTENTIAL) backed by the SnakeBatch distributed classifier. We argue that the right mental model is not "ML in Excel" but "Excel formulas, more of them" — the function returns a value, spills cleanly, and re-evaluates with the same recompute semantics as =FILTER or =XLOOKUP.

Claim. Of the three classes of decisions a spreadsheet user makes — look up,
aggregate, extrapolate — only the third is poorly served by Excel today on
mixed-type industrial data. SnakeBatch closes that gap with a single new formula family.

1. The gap in Excel's regression toolkit

Excel's native predictors share three limitations:

Function	Accepts text?	Classifies?	Per-row audit?	Held-out validation?
`=LINEST`	No	No	No	No (in-sample R²)
`=TREND`	No	No	No	No
`=FORECAST.ETS`	No	No	Confidence band only	Implicit
Trendlines	No	No	Equation only	No
Solver	No	Indirectly	Manual	No
Data Analysis ToolPak Regression	No	No	Residuals dump	No
`=PREDICT` (proposed)	Yes	Yes	`=AUDIT` sibling	`=POTENTIAL` sibling

Real-world workbooks are mixed-type. A glass quote sheet has "44.2 Silence" next to 180 next to "LGB Menuiserie SAS". =LINEST chokes on column 1 and 3. Snake doesn't.

2. The five functions

=PREDICT(train, target, test)              # SAT classification, spills predictions
=PREDICT.CANDLE(train, target, test)       # OHLC over lookalike values for Stock chart
=FILL(table)                               # one Snake per blank column
=AUDIT(train, target, test_row)            # top-5 lookalike rows + reasoning
=POTENTIAL(table, target)                  # 80/20 AUROC + optimal accuracy

Each formula is a thin Office.js custom function. Each posts same-origin to excel.aws.monce.ai/v6/*, where a small dispatcher decides on the spot: under 500 rows the job runs in-process on the add-in box (no Lambda fee, no cold start); at or above 500 rows it fans out to snakebatch.aws.monce.ai via the monceai SDK. The user types one formula. The system picks the cheapest path that answers in seconds.

3. Measured scaling (May 2026, t4g.micro + Lambda fan-out)

The dispatcher is not theory — the table below is wall-clock from today's add-in, hitting the production endpoint over HTTPS:

Rows in `=POTENTIAL`	Where	Wall-clock	AUROC	Optimal accuracy
100	local	200 ms	0.95	0.95
500	local	3.2 s	0.997	0.97
1,000	cloud	1.6 s	0.997	0.98
2,000	cloud	2.2 s	0.995	0.97
5,000	cloud	4.5 s	0.996	0.97
10,000	cloud	8.6 s	0.998	0.98

The 500 → 1,000 transition is the headline: going to cloud is faster than staying local at the threshold (3.2 s → 1.6 s). The local mode hits the t4g.micro CPU wall around 500 rows; cloud Lambda fan-out parallelizes across layers. AUROC stays at ~0.99 across all sizes, so the user gets the same answer either way — just on the cheapest hardware that finishes in time.

4. The Dana Theorem — why SAT, not regression

The intellectual foundation is the Dana Theorem (2024): any indicator function over a finite discrete domain can be encoded as a SAT instance in polynomial time, and decision-tree bucketing reduces this to linear time. The construction is direct — for each non-member f ∉ C, build a literal that excludes f while preserving every member of C; conjunct those literals across uncovered subsets. The result is a CNF formula that is the classifier, with no satisfiability search at inference.

Why this matters for Excel. The user never sees SAT. They see a cell that returns a value, a cell that returns 5 lookalike rows, and a cell that returns AUROC. The polynomial construction is the reason the cell finishes computing in seconds rather than minutes — SAT is not the cost, SAT is the speedup.

5. The candle — regression as a financial chart

Snake regression doesn't return one number. Every prediction is backed by N lookalikes, each with their own target value. That's a distribution. Collapsing it to a mean throws away the most useful information: how confident, how wide, how the SAT-routed peers compare to the population baseline. =PREDICT.CANDLE maps the distribution onto OHLC:

Candle part	Source	Meaning
Open	Mean of noise lookalikes (n)	Population baseline.
Close	Mean of core lookalikes (c)	SAT-routed peer estimate.
High	P95 of all lookalikes	Upper plausible.
Low	P05 of all lookalikes	Lower plausible.
Color	Close > Open → green; < → red	Specificity premium direction.

The output drops straight into Excel's built-in Stock chart with zero glue code. The salesperson reads a quote price the way a trader reads a chart.

6. The .xlsx as a self-contained ML artifact (planned, v0.6)

The .xlsx format is a zip of XML parts. Microsoft provides Custom XML Parts as the official mechanism for embedding arbitrary application data inside a workbook. The plan is to embed a stripped Snake model into every workbook that has trained one:

workbook.xlsx (zip)
  /xl/worksheets/sheet1.xml         — the data
  /xl/customXml/item1.xml           — model_stripped.json (gzip+b64) + SHA-256 header
  /xl/customXml/itemProps1.xml      — namespace urn:monceai:snake:v1

Train via the cloud once. Predict locally forever. Email the file — the model travels with it. This is what would make the artifact industrial: a quote-validation workbook trained on a factory's history could be sent to the sales team and used offline. The factory knowledge would ship with the file.

Today (v0.2) the model lives in S3 keyed by model_id; the workbook stores only the key. The Custom XML Part embedding is on the v0.6 roadmap, behind the per-user S3 prefix work that comes with the auth migration.

7. A closing comparison

	What Excel ships in 2026	What this proposal adds
Linear extrapolation	`=TREND`	(unchanged)
Time-series forecast	`=FORECAST.ETS`	(unchanged)
Mixed-type prediction	—	`=PREDICT`
Distributional regression	—	`=PREDICT.CANDLE`
Missing-data inference	—	`=FILL`
Per-row reasoning	—	`=AUDIT`
Predictability probe	—	`=POTENTIAL`

The argument isn't that Excel needs to become a data-science platform. It's that five formulas, each native to the cell, each five seconds away from a result, are enough to absorb the bulk of "can you predict this column?" decisions that happen in factories, sales offices, and finance teams every day. Excel is where those decisions are already made. The proposal is to give them a better predictor, on the surface where the work happens.