Causal scenario engine for consumer research

Causal scenario modeling on data you already trust.

Simulacra fits a generative causal model to consumer-research data you already fielded. Move a measured variable, read the downstream effects, and see how every dependent variable in your study would respond. No new fieldwork. No LLM personas. No invented questions.

0.005 Categorical MAE
on 694 vars, Twin-2K-500
−64% Marginal Error vs
sampling-noise baseline
100% Novel rows,
zero memorization
+4.9 pp External classifier vs
real-vs-real floor

Built for consumer-insights, marketing, agencies, and the data-science partners you loop in for vendor review.

SOC 2 Type II ISO/IEC 27001 Single-tenant, in-memory No cross-tenant training
Three use-cases, one platform

Boost a cohort, adjust a variable, or reverse-engineer an outcome.

Simulacra's generative causal AI supports cohort boosts, interventions, and scenario models. Generated rows behave the way your respondent population behaves, within your schema.

01: Conditional generation

Reliably boost any cohort.

Generate synthetic observations conditioned on any subset of your variables: a low-incidence cohort, a hard-to-reach segment, or the cut leadership wants but the field came in short on. The generated rows behave the way your respondent population behaves, including the rare and contradictory patterns that thin samples lose.

Seed cell38 obs
After boost190 rows
Marginal MAE0.009
Novelty100%

Boosted rows are statistically indistinguishable from the seed respondents: on the Twin-2K-500 benchmark, an external random-forest classifier separates synthetic from real at only +4.9 pp above the real-vs-real noise floor.

02: Causal intervention

See the outcome at every do(price).

Move price from $4.99 to $5.49 and regenerate the downstream variables under that intervention, for example: volume, share, top-two-box, segment composition, and channel mix. The generated data predicts the way your wider target population would actually respond - a coherent prediction of behavior under the intervention.

Volume 100 idx
Share 22.0%
Top-Two-Box 64%

Every dependent variable shifts together; the whole respondent population re-equilibrates under the new condition. Predictions generalize along the price-response surface learned from your study.

03: Scenario modeling

Model success in your data.

Give the AI a target: top-two-box at 75% in the West, switch rate above benchmark, or Gen Z share up 5 pp, for example, and the AI returns the combination of upstream conditions most likely to produce it on your data. Simulacra gives you the most likely path to your desired outcome.

The engine doesn't score conditions one at a time. It runs a many-to-many probability cascade across every variable in your study and returns the causal flow: the joint set of upstream conditions most likely to land the target on your data.

Causal AI ≠ "synthetic respondents"

A different kind of synthetic data, built on different math.

LLM-persona vendors and the CTGAN family of tabular generators are sometimes useful — for problems outside of quantitative consumer & market research. These tools have different failure modes that frequently show up in the validation data, not the marketing.

Property
Simulacra
LLM personas
CTGAN-family
digital twins & GPT-based respondents
CTGAN, TVAE, CopulaGAN, and TabDDPM
What it learns
The causal flow inside your respondent population's response structure
Generic patterns from pretraining text: not your respondents
Marginals plus low-order joints of one tabular dataset; no causal structure
How it answers
do(X) interventional inference on the fitted causal flow
Text completion conditioned on a persona prompt
Conditional or rejection sampling from the learned distribution
Effect of a condition
The population re-equilibrates; every downstream variable shifts causally
A prompt biases the next token; no population behind it
Rejection sampling: non-matching draws are discarded; the trained distribution itself stays unchanged
Infeasible queries
Refused, with an explicit feasibility report
Answered anyway — confident outputs unsupported by data
Sampled silently from the trained support; no out-of-domain signal
Validated fidelity
53.7% under-dispersed variables on Twin-2K-500;
50% = neutral variance preservation
93.9% under-dispersed variables on the same benchmark
No Twin-2K result;
published tabular benchmarks show CTGAN 32.2× worse than a real-data noise floor, and 99.7× worse on shared-entity structure
Min training data
Tens of observations on high-dimensional studies; scales with population diversity, not row count
None from your study! That is the problem...
Thousands of rows for stable training; performance degrades sharply on typical survey n's
Cross-tenant data use
Never: Single-tenant, in-memory, deleted on retention promise.
Often: your prompts and pasted study data may be retained or used in training
Vendor-dependent; a shared model fit across customers is common

LLM digital-twin baseline: Toubia et al., arXiv:2509.19088, 2025. CTGAN-family context: Xu et al., NeurIPS 2019; Sajja, arXiv:2604.13125, 2026.

The line we hold

What Simulacra will not do.

Most of the trust gap in synthetic data is created by vendors who won't quantify what their product can't do. We validate our claims and methods against public benchmarks and blind customer data.

We do not invent unasked questions.

Simulacra generates data inside your empirical schema. Every generated value preserves the measured structure of your research, and the AI refuses any combination the generative model assigns zero prior predictive probability — no invented questions, no manufactured cohorts.

How the Generative Causal AI works

We do not train on one customer's data to inform another.

Every model fit is single-tenant: trained zero-shot on your study, processed in an isolated session, never combined with another customer's data.

Data security, retention, and privacy

We do not publish accuracy claims without the validation paper.

Every benchmark we cite ships with the methodology, the holdout protocol, and the data points behind the curve. Including the gaps.

Browse the validation papers

We do not replace fieldwork that warrants fielding.

New topics, new claims, new audiences, new categories: all still need real respondents. Simulacra extends a study you already trust. It does not pretend to be one you haven't fielded.

The principles behind Simulacra
Portrait of Jason Cohen, founder & CEO of Simulacra JC
Founder & CEO

Jason Cohen

Fifteen years building causal AI for sensory perception and consumer preference at the world's largest food & beverage manufacturers, as Founder & CEO of Gastrograph AI (Analytical Flavor Systems) — acquired by NielsenIQ in 2025. Now building Simulacra.

“I built Simulacra because the same teams that used Gastrograph kept asking, in different ways, for a way to run scenarios on the surveys they'd already paid to field…”

More about the team

Validate on your data

Bring a study you already fielded. We'll predict the holdout.

Validate on your data: we hold out a portion of your data, train Simulacra on the remainder, generate predictions over the holdout, and send you the scorecard. Standard NDA, no contract required.

How we run blind validations

1You send a completed quantitative dataset. We sign your NDA.
2We hold out a portion you specify. You keep the truth.
3We fit Simulacra on the remainder, generate predictions over the holdout, return the comparison.
4You decide. No further commitment required.