excel.aws.monce.ai / paper
A SAT-based predictive layer for the spreadsheet — seven sections, measured.
Microsoft Excel ships =TREND, =FORECAST.LINEAR, and =FORECAST.ETS
— three regression families dating back to the early 1990s. None of them accept text features,
none classify, none expose per-row reasoning. We propose adding a SAT-based predictive layer to the
spreadsheet: five custom functions (=PREDICT, =PREDICT.CANDLE, =FILL,
=AUDIT, =POTENTIAL) backed by the SnakeBatch distributed classifier. We argue
that the right mental model is not "ML in Excel" but "Excel formulas, more of them" — the
function returns a value, spills cleanly, and re-evaluates with the same recompute semantics as
=FILTER or =XLOOKUP.
Excel's native predictors share three limitations:
| Function | Accepts text? | Classifies? | Per-row audit? | Held-out validation? |
|---|---|---|---|---|
=LINEST | No | No | No | No (in-sample R²) |
=TREND | No | No | No | No |
=FORECAST.ETS | No | No | Confidence band only | Implicit |
| Trendlines | No | No | Equation only | No |
| Solver | No | Indirectly | Manual | No |
| Data Analysis ToolPak Regression | No | No | Residuals dump | No |
=PREDICT (proposed) | Yes | Yes | =AUDIT sibling | =POTENTIAL sibling |
Real-world workbooks are mixed-type. A glass quote sheet has "44.2 Silence" next to
180 next to "LGB Menuiserie SAS". =LINEST chokes on column 1 and 3.
Snake doesn't.
=PREDICT(train, target, test) # SAT classification, spills predictions =PREDICT.CANDLE(train, target, test) # OHLC over lookalike values for Stock chart =FILL(table) # one Snake per blank column =AUDIT(train, target, test_row) # top-5 lookalike rows + reasoning =POTENTIAL(table, target) # 80/20 AUROC + optimal accuracy
Each formula is a thin Office.js custom function. Each posts same-origin to
excel.aws.monce.ai/v6/*, where a small dispatcher decides on the spot: under 500 rows the
job runs in-process on the add-in box (no Lambda fee, no cold start); at or above 500 rows it fans out
to snakebatch.aws.monce.ai via the monceai SDK. The user types one formula. The system
picks the cheapest path that answers in seconds.
The dispatcher is not theory — the table below is wall-clock from today's add-in, hitting the production endpoint over HTTPS:
Rows in =POTENTIAL | Where | Wall-clock | AUROC | Optimal accuracy |
|---|---|---|---|---|
| 100 | local | 200 ms | 0.95 | 0.95 |
| 500 | local | 3.2 s | 0.997 | 0.97 |
| 1,000 | cloud | 1.6 s | 0.997 | 0.98 |
| 2,000 | cloud | 2.2 s | 0.995 | 0.97 |
| 5,000 | cloud | 4.5 s | 0.996 | 0.97 |
| 10,000 | cloud | 8.6 s | 0.998 | 0.98 |
The 500 → 1,000 transition is the headline: going to cloud is faster than staying local at the threshold (3.2 s → 1.6 s). The local mode hits the t4g.micro CPU wall around 500 rows; cloud Lambda fan-out parallelizes across layers. AUROC stays at ~0.99 across all sizes, so the user gets the same answer either way — just on the cheapest hardware that finishes in time.
The intellectual foundation is the Dana Theorem (2024): any indicator function over a finite
discrete domain can be encoded as a SAT instance in polynomial time, and decision-tree bucketing
reduces this to linear time. The construction is direct — for each non-member f ∉ C,
build a literal that excludes f while preserving every member of C; conjunct
those literals across uncovered subsets. The result is a CNF formula that is the classifier, with
no satisfiability search at inference.
Snake regression doesn't return one number. Every prediction is backed by N lookalikes, each with
their own target value. That's a distribution. Collapsing it to a mean throws away the most
useful information: how confident, how wide, how the SAT-routed peers compare to the population
baseline. =PREDICT.CANDLE maps the distribution onto OHLC:
| Candle part | Source | Meaning |
|---|---|---|
| Open | Mean of noise lookalikes (n) | Population baseline. |
| Close | Mean of core lookalikes (c) | SAT-routed peer estimate. |
| High | P95 of all lookalikes | Upper plausible. |
| Low | P05 of all lookalikes | Lower plausible. |
| Color | Close > Open → green; < → red | Specificity premium direction. |
The output drops straight into Excel's built-in Stock chart with zero glue code. The salesperson reads a quote price the way a trader reads a chart.
The .xlsx format is a zip of XML parts. Microsoft provides Custom XML Parts as
the official mechanism for embedding arbitrary application data inside a workbook. The plan is to
embed a stripped Snake model into every workbook that has trained one:
workbook.xlsx (zip) /xl/worksheets/sheet1.xml — the data /xl/customXml/item1.xml — model_stripped.json (gzip+b64) + SHA-256 header /xl/customXml/itemProps1.xml — namespace urn:monceai:snake:v1
Train via the cloud once. Predict locally forever. Email the file — the model travels with it. This is what would make the artifact industrial: a quote-validation workbook trained on a factory's history could be sent to the sales team and used offline. The factory knowledge would ship with the file.
Today (v0.2) the model lives in S3 keyed by model_id; the workbook stores only the key.
The Custom XML Part embedding is on the v0.6 roadmap, behind the per-user S3 prefix work that comes
with the auth migration.
| What Excel ships in 2026 | What this proposal adds | |
|---|---|---|
| Linear extrapolation | =TREND | (unchanged) |
| Time-series forecast | =FORECAST.ETS | (unchanged) |
| Mixed-type prediction | — | =PREDICT |
| Distributional regression | — | =PREDICT.CANDLE |
| Missing-data inference | — | =FILL |
| Per-row reasoning | — | =AUDIT |
| Predictability probe | — | =POTENTIAL |
The argument isn't that Excel needs to become a data-science platform. It's that five formulas, each native to the cell, each five seconds away from a result, are enough to absorb the bulk of "can you predict this column?" decisions that happen in factories, sales offices, and finance teams every day. Excel is where those decisions are already made. The proposal is to give them a better predictor, on the surface where the work happens.