Description
In this episode of The Curiosity Current, Stephanie and Molly sit down with Jason Cohen, Founder and CEO of Simulacra, to explore how causal AI and synthetic data can accelerate insight without removing real consumers or human judgment from the process. Jason traces his thinking back to his early work in sensory science and his decade building Gastrograph AI. Traditional statistical methods measured well, but they struggled to predict what people would actually like. The turning point came when models began forecasting taste and preference with consistency. That shift transformed academic research into a commercial platform and revealed a broader industry issue: many studies quietly fail to deliver statistically valid or decision-ready results. At Simulacra, Jason applies those lessons through causal AI and scenario modeling. Rather than positioning synthetic data as a replacement for research, he frames it as a way to recover value from existing datasets, especially when samples are small, uneven, or incomplete. By conditioning models on outcomes instead of correlations, teams can see what truly changes and which levers they can control. Through practical examples, Jason explains how this approach improves product optimization, reveals insight in undersampled populations, and passes validation by matching causal structure, not surface patterns. The episode closes with a clear message: tools remove technical friction, but clarity, judgment, and better questions now matter more than ever.
Episode Resources
- Jason Cohen on LinkedIn
- Simulacra Synthetic Data Studio Website
- Stephanie Vance on LinkedIn
- Molly Strawn-Carreño on LinkedIn
- The Curiosity Current: A Market Research Podcast on Apple Podcasts
- The Curiosity Current: A Market Research Podcast on Spotify
- The Curiosity Current: A Market Research Podcast on YouTube
Transcript
Jason - 00:00:01:
The market's fragmented. People have become more diverse. There's more products and competition that's ever existed, and companies are responding to this by doing less research. They want the research to be faster. They want their research to be more efficient. They wanna get more decision value per dollar. So, instead of actively doing greater levels of recruitment, more diverse recruitments, wider product sets, they're trying to shrink down the research.
Molly - 00:00:24:
Hello, fellow insight seekers. I'm your host, Molly, and welcome to The Curiosity Current. We're so glad to have you here.
Stephanie - 00:00:31:
And I'm your host, Stephanie. We're here to dive into the fast-moving waters of market research where curiosity isn't just encouraged, it's essential.
Molly - 00:00:41:
Each episode, we'll explore what's shaping the world of consumer behavior from fresh trends and new tech to the stories behind the data.
Stephanie - 00:00:49:
From bold innovations to the human quirks that move markets, we'll explore how curiosity fuels smarter research and sharper insights.
Molly - 00:00:58:
So, whether you're deep into the data or just here for the fun of discovery, grab your life vest and join us as we ride the curiosity current.
Molly - 00:01:09:
Jason's journey into the world of AI and data science began with his work at Gastrograph AI, where he spent ten years as the founder and CEO, building a SaaS company that leveraged sensory science and topographical data insights for machine learning. After Gastrograph AI was acquired by NielsenIQ BASES, he founded Simulacra to push the boundaries of what causal AI and synthetic data can do in modern market research.
Stephanie - 00:01:34:
At Simulacra, Jason and his team are helping researchers understand cause and effect to consumer behavior more accurately and quickly than ever before, shifting the focus from traditional research models to one that can simulate reality mathematically.
Molly - 00:01:49:
Today, we'll explore how causal AI is revolutionizing research, how synthetic data can unlock hidden insights from existing datasets, and why understanding cause and effect is the key to truly understanding consumer behavior.
Stephanie - 00:02:02:
Jason, welcome to the show. We are thrilled to have you.
Jason - 00:02:06:
Thank you for having me. I'm happy to be here. Wonderful.
Stephanie - 00:02:08:
Well, let's just kick this off. You have had, and I say this a lot, but today, I really mean it. You have had such a fascinating journey from studying politics in China to becoming a tea expert to founding two successful AI companies in the consumer research space. So, to kick us off, I'm curious, was there like a specific moment that you realized that AI and consumer research could truly change the game? How did that spark ignite for you personally?
Jason - 00:02:38:
So, that happened really early. That actually happened before we started the first company. So, I was doing my research at Penn State. I had the Tea Institute at Penn State, and I was attempting to make predictions around what people taste and like and dislike in products. And started with tea and expanded to coffee and then to beer in order to be able to collect enough data. And what I kept finding was that the frequentist statistical hypothesis testing, the off-the-shelf traditional models that have been in place in sensory science programs since sensory science began late 1950s, early 1960s, that these hadn't really evolved into predictive methodologies, that these were ‘collect and measure’ methodologies, but these weren't ‘predict into the future’ methodologies. And so we started to think to ourselves, well, why couldn't we take this data and go a model-building approach? What would prevent us from actually attempting to use traditional machine learning models that could actually make predictions about the future? And then it turns out that sensory is really complicated and that it couldn't use traditional machine learning models, right? Traditional machine learning models require some type of ground truth. Like, if I ask you, do you taste lemon? You know, what are my options? I can believe you, or I don't believe you. I have no way of proving if that's true or false. And so as we got our hands and minds around the difficulty in these topics, we started to build these models. And after about two and a half years, three years, these models actually started to work. And when we could actually make a prediction around what someone was gonna taste in a product and if they're gonna like or dislike it, we realized that didn't belong in the university, and we spun that off. That's the company that became Gastrograph AI. So, yeah, I would say it almost happened backwards that we had this goal in mind, and we looked for tools that would work. And, yeah, we founded the company on that working technology.
Molly - 00:04:19:
It's fascinating that you say that because that's a little bit about the aytm origin story, too. Our founder wanted to actually start a different company and realized that there weren't market research tools capable of doing what he needed it to do. He's like, I'm gonna do this instead. And many years later, here we are. And, just a sidebar, I have matcha tea on a tea warmer right now. So I love tea, but I feel like I'm gonna I'm gonna have to backtrack on, like, my tea usage. I am definitely not a proper tea user. I think I don't even know how if there was a proper matcha process for this.
Jason - 00:04:52:
Matcha's going through a little bit of a boom right now, but if you're buying nice matcha, it's probably real matcha.
Molly - 00:04:58:
So, turning back to specifically that sensory science background that you have in topical data is super interesting because you're totally right. How do we actually predict this kind of behavior ahead of actually trying to prove this out? So, how did your work in that area at your previous company shape how you think about consumer behavior in market research? And was moving towards causal AI a natural next step, or did something kinda happen along that journey that made you rethink how consumer research should be done?
Jason - 00:05:27:
Yeah. Two things happened. One, we've gone through a technological transformation. We've gone through a full step change with generative AI. Whereas before, Gastrograph, which is a company that I love, I'm very happy that it's found a home at NielsenIQ. But Gastrograph was a proprietary data play. Gastrograph was a very difficult company to run because it ran panels in New York, Romania, Shanghai. It had a team that circled the world recruiting consumers to taste products, and all of that data was in-house proprietary data. We controlled the data flow from end to end, and that allowed us to build up this massive data set, the largest data set of consumer perception and preference that has ever existed. And what we really did is we built a foundation model for flavor before this idea of foundation models existed. Now, the good things about that are it worked, right? The bad things about it, it was hugely expensive to do. And you never knew if you had coverage or if you could make predictions about a category, about a demographic, about a new product set until you tried. There was no way to know before, say, surveying Italy, are we going to be able to make accurate predictions about pasta? Are we gonna be able to make accurate predictions about red tomato sauce? Are we gonna be able to make accurate predictions about white tomato sauce? Are we gonna be able to make accurate predictions around vodka sauces or ping sauces? There had to be coverage in the data. There had to be relevant flavors, and we had to capture that right demographic. And from this, what I realized working with these big companies is that this problem is not just a Gastrograph problem, this problem is all of consumer research problem. That more than 50% of surveys return no statistically valid results. That you spend X tens of thousands of dollars, and you were hoping that your product was at the top of the stack rank, you were hoping that this marketing worked, you were hoping that you had the greatest purchase in 10 versus a competitor, and you either get back negative results from competitors, you know, ahead of you, or you get back no results. You say, okay, well, this data set is, you know, it's N equals 200 data set. I have no statistically valid findings in here. And so with that, when I started to think about, okay, well, what is actually possible to achieve? You know, originally, this idea of synthetic data bluntly didn't work. The last two generations of synthetic data would not work for consumer market research. Right? You had your early statistical copulas and other types of distribution-based generative functions; those were really not of interest to anyone. You then had GANs, generalized adversarial networks, generators and discriminators; those worked in very specific instances, but they generally weren't good enough to do direct statistical inference from, you know, ACT GAN and Tab GAN. And now, finally, you have this newest generation of generative AI for synthetic data generation. And that's really where these two problems came together, where we said, okay, well, actually, if you have existing data, even if you didn't get the results that you want, that snapshot of a population is enough for us to now generate for you, you know, scenario model, the results that you wish that you were able to achieve. And so, yeah, it wasn't one moment. It was two confluences of things. It was having experienced this problem in the last company, and it was knowing that this problem was much broader than the flavor, aroma, and texture focus that we had there, that inspired me to branch off and start this new company.
Molly - 00:08:32:
Yeah. Because it's also trying to not just navigate based on assumptions, which is like you can have some data, but if you just aren't operating based on assumptions, that's not gonna be great either.
Jason - 00:08:44:
Yeah. Well, you guys know as a research company, right? You guys know what we see happening is that the market's fragmented. People have become more diverse. There's more products and competition that's ever existed. And companies are responding to this by doing less research. They want the research to be faster. They want their research to be more efficient. They wanna get, you know, more decision value per dollar. So, instead of actually doing greater levels of recruitment, more diverse recruitments, wider product sets, they're trying to shrink down the research. And that means, and then when they look at cross tabs, they don't have the data that they need. And so why would I love to change that behavior and say a company should invest more in research overall? Yes. Absolutely. But what we're trying to do at Simulacra is to give them a tool that allows them to continue to get an ROI of whatever level of investment that they're willing to make. So, if you want an N equals 200 research in a diverse population with a broad product set using an incomplete block design, right, where you're likely not going to have the cross tabs in order to look into any population of likers or who it was that was responsive to an ad. Now you can use something like Simulacra to run that scenario model to say, you know, I only got 12 people in this data set. I only got 9 people who like this product through this data set. What can they tell me about, you know, the broader population who likes this product?
Stephanie - 00:09:54:
For sure. And, I mean, you know, just even you talking about some of that stuff, I mean, even today, there are many times where I'll be chatting with a customer, or client about, like, they'll be like, well, there are no differences by gender, and I'll just think, like, it's just horribly underpowered. We actually don't know. And it comes down to those trade-offs that people are making all the time around speed, cost, etcetera. Right?
Jason - 00:10:15:
Exactly.
Stephanie - 00:10:16:
I wanted to get a little bit concrete with you because I think for a lot of people, it's not ‘woo-woo’, but it feels outside of what of their knowledge base. Right? So, you've worked with Fortune 100 CPG companies where millions of dollars ride on getting the why right. Can you kinda walk us through maybe a moment or a project where causal AI identified a critical cause and effect relationship that maybe traditional research, where we're using correlational, even predictive, like progression types of analysis might have missed?
Jason - 00:10:48:
Yeah. It's every project. It's every day, every project.
Stephanie - 00:10:51:
Every project. I love it.
Jason - 00:10:53:
But it's true. You know? When you think about, like, correlative analysis, you think about things like, is there an effect here? Right? We see these two things correlate or covary together. Some companies are making decisions off of that. Some companies are going a little bit further and doing linear regressions or various types of regressions and looking at R-squared. But R-squared is even; still, it's a measure of goodness of fit, of correlative effect, right, with error. So, you know, the thing that we find is that if you look, I'll give you three examples from three different things that we do. The first example is in a very simplistic form. If you're doing product development frequently, the regression is gonna tell you to add more sugar. We see this in any basic analysis for almost any type of product, even if the product's not very sweet, right? Add sugar, and it will have a great effect. And most of the time, these companies can't add more sugar. They're not going to. They're not gonna change the nutrition facts. They're not going to release something that's way overly sweetened, right? And so using techniques like Simulacra, is that you can actually see, well, either I can limit the amount of sugar and optimize around that, or, you know, actually, it's not just a straight line for adding sugar. It's if I'm going to add sugar, if I'm going to add some level of sweetness, it needs to be counterbalanced by these other flavors, and there's going to have other cascading effects. Right? I can't just add sugar and assume that the flavor profile is gonna balance. There are other balances around the acidity levels are going to change because high concentrations of sugar is gonna make the product more acidic. There's the bitterness is gonna change. The total intensity of the flavor profile needs to counterbalance the additional sugar. Right? And those are types of things that you're not going to get that type of cascading effect from any type of correlational analysis. The second one is looking at frequently, the major issue, like we were talking about before, is that you don't have the data that you want. You're doing these correlations, or you're doing these regression analyses. But, actually, what you're trying to look into is a population that you didn't sample, that you failed to sample, that didn't show up for the test. So, you know, one of the most common use cases of Simulacra is looking into undersurveyed, undersampled populations, either because of low incidence or just because you got unlucky. So, you know, one of the things that Simulacra can do is it can predict for these undersampled groups. An example of that is, let's say that you, for whatever reason, you surveyed men and women, married and unmarried, and you had married men and single men and married women, but you had no single women, or you had very few single women. What could you do? Well, traditional methods can't do anything. Didn't sample them. You don't have any information. Using Simulacra, the AI is going to learn the difference between men and women, married and unmarried, and you're going to be able to predict for that undersampled or out-of-sample population. And that is the type of thing that we can guarantee strong convergence for because it's causal AI, because this is not just a correlative or a regression analysis, right, or a fill-in-the-blank analysis. This is actually looking not just at those two variables, but however many variables that you collected and whatever the overlap between these populations is, it's going to learn those differences and how these cascading interactions affect each other. How do these behaviors predict other behaviors, or how do these personal identifying attributes affect other attributes?
Stephanie - 00:13:52:
And is the idea too that, like, it can handle any number or maybe not any number, but just like a much heavier number of interactive effects than, say, we can in analysis before, again, we immediately become underpowered.
Jason - 00:14:04:
Yeah. Exactly. At max, right now, the most number of variables ever put into the platform, I think, is somewhere around, like, 600, 700.
Stephanie - 00:14:13:
Oh, wow.
Jason - 00:14:14:
Yeah. The platform runs wonderfully well for 400 to 500 variables. It starts to slow down a little bit from real-time above 500 variables. Most of the time, the companies give us datasets with 500 variables. It's like some type of longitudinal analysis, and half the data's missing anyway. Okay. They changed questions two years ago and never dropped them from the dataset. You know? So, we try to coach companies into doing a little bit of a bigger cleanup, into really thinking about which of these variables we're gonna keep. But we have gotten datasets, massive datasets on, yeah, 400, 500 variables that the platform predicts on quite well.
Molly - 00:14:47:
I wanna pull out a bit and, like, talk a bit more about the concepts behind correlations and data and how causal insights can take it a step further. I mean, the famous example that I feel like we always hear when talking about correlation versus causation is the fact that ice cream sales go up around the same time that drownings go up. I mean, super famous.
Stephanie - 00:15:09:
Or shark attacks. Yeah.
Molly - 00:15:10:
Shark attacks. Right. And it doesn't have anything to do with ice cream luring the sharks to the beach; it's that it's hot out. There's another extraneous variable that impacts both the fact that people are buying more ice cream, and they're at the beach and spending more time in the water. And so I wanna talk about perhaps maybe one of the biggest misconceptions that people may have when they talk about correlation versus causation in market research, and perhaps how this causal AI is changing the way that researchers and businesses can think about these things?
Jason - 00:15:41:
Yeah. It's a good question. I mean, causal AI that allows you to actually look into what the model is predicting and why the model is predicting it is important for a number of reasons. There’s ideas around understanding and intuition can interpretable AI. There's ideas around making decisions based on the levers that you have. If you know that these things affect these other things. Right? These are your inputs, and these are the outputs. And then there's ideas around confounding variables and other factors. And I think that those three things together often get muddled. They get kinda mashed together. And to an extent, I think that all three of them are really important, but we generally when we're working with companies, we try to frame it a little differently. We try to frame things as scenario models for them to think in terms of scenario models. Because frequently, companies don't know really what's input and what's output, and that might sound strange, but if you think about, if you're trying to increase purchase intent, then you have an ongoing marketing budget. And, you know, some products with higher marketing budgets have higher purchase intents than other products. There's autocorrelation. Right? These things feedback. You spend more on marketing in order to raise awareness of this product. You get more purchase intent. And so some of these things are compounded together. There's a bunch of different variables that are going to affect these things, and some of them are slightly cyclical. And so instead of trying to get companies to think fully in terms of causation, we try to get them to think in terms of outcomes. We want a higher purchase intent. So, you put that into the platform as a scenario model, and you say, you know, what changes? And of these things that change, what do I control? What are my levers? So, instead of trying to set up large hierarchical models or trying to automatically separate things into inputs and outputs, all variables in our platform predict all variables. And sometimes those predictions are one-way predictions. Right? This thing happens first. You start the sale, and then your sales from that ongoing O&D sale are an effect of that, but sometimes these things go two ways. You increase the marketing budget. You're going to increase brand awareness and purchase intent. Right? So, that was maybe not the answer that you were expecting. But, yes, the foundations of our AI is causal, but the interaction with the company, the way that we try to help companies gain utility and value from the platform is really by thinking in terms that, you know, you can condition on any variable. And these could be sets of variables. We want this demographic. We want married women, and we wanna increase purchase intent, right? So, I'm gonna put that into the platform, and all variables are gonna change in response to this. And then me, based on my marketing team, based on my distribution team, based on my branding team, etcetera, we can then look at this and say, well, I can control that. That's gonna have an impact. Right? I can control that, and I can make a decision based on that. But that's really how we try to help companies think through it.
Stephanie - 00:18:34:
I think that makes a ton of sense. I think that the idea of, like, you know, thinking, you know, that everything predicts everything is daunting. But when you think about scenario planning, it becomes highly actionable. Right? And that's what businesses want. Well, it's clear that, you know, this is a super exciting space, and especially right now. I wanna bring up one thing, which is that I think, you know, for researchers who've spent their careers running surveys and focus groups with real people, the idea of synthetic data can feel a little bit suspicious. I think you've explained how it's used and kind of how it works. And so I had a question about that that I don't think we need to necessarily chat through. But I am curious about your approach or philosophy around its validation.
Jason - 00:19:16:
So, we're a weird synthetic data company in that I give a lot of talks about how skeptical I am of synthetic data. I think that any type of unseated, vaguely prompted synthetic data is likely to be wrong. If you go into an LLM, you know, a chatbot, LLM, Claude, ChatGPT, Gemini, any of them, and you say, “tell me about Americans between the ages of 20 and 30 years old who live in the United States. What do they like about Coca-Cola?” You're gonna get all sorts of answers. You're gonna get it's sweet. It's fizzy. Right? And they have no idea Coca-Cola is composed of 20 plus aromatized citrus oils in the headspace, which is, right, predominantly driving variations on preference. They have no idea the amount of vanilla, rye, and kola, not another stuff, whether natural or artificial flavors. So, the point is that that data is not in the training set. It's not in the training set for the LLMs, and the data that does exist for these LMs is not labeled. So, when LLMs are scraping books and they're scraping Reddit, they don't know if that person's a Coca-Cola drinker. They don't know if that person is male or female. They don't know if that person lives in the United States. And you can, in fact, test this. You ask the same question multiple times, you're gonna get slightly different answers, sometimes very different answers. If you wanna try a really fun one, ask in English, what do Hispanic Americans think about Coca-Cola, and then switch to Spanish. What do Hispanic Americans think about Coca-Cola? Totally different answers. It's activating different parts of the network. And there are companies out there that are right now that are claiming that they're gonna get around this by they're doing fine-tuning, they're doing rags, they're doing supervised cross-training. There's just not enough data. I mean, whoever has even the most data, even, you know, gGstrograph level data, these models are now at, you know, billion plus parameters, a couple of billion parameters for some of them. Right? I think one of the Chinese companies just released a trillion-parameter model. We don't have enough data to reweight those models. So in contrast, what we're doing is we don't do anything unseated. We always take your data, real data, data directly from individuals, and we build a model zero-shot on that data. So, we're not trying to pull publicly available information or relevant information using foundation models on models trained on Reddit. We're actually using your data in order to train a model. And this comes back now to causal AI, how we know it works. Our internal validation is that we measure the causality of that input data set, that empirical data set. Right? When you can run the causal reasoning, causal learning methodology, these are things, like, Judea Pearl developed. And so we run that. And then on the generated data, we run it again, and they have to match. The causality must be the same in the input data and the generated data. And if it is the same, then we have strong guarantees of convergence that any predictions that we're making, any conditions or scenarios that we're running are accurate.
Molly - 00:21:57:
I feel like when I first heard the term synthetic data, I thought it was way more complicated than what a lot of people are thinking that it is, and I had no idea that just asking ChatGPT about a particular audience type was considerations of, like, early-stage synthetic data. And to your point, also how you train the model and how the model gets used to you can also impact it, also, like, I have to repeatedly tell Claude to stop being so nice to me because it just tells me every one of my ideas is game-changing. So, there's a lot of influence that users can have, and I feel like there's more of a chance to have user bias also in pulling from synthetic datasets.
Jason - 00:22:40:
There certainly is. Yeah. There's companies out there that are trying to solve it, but I personally don't think that LLMs are the route towards viable synthetic data for consumer and market research. You're going to get a tendency towards the beams. Right? The data that you generate is not going to be representative of any one cohort that actually exists. And the real problem is that it looks good. I love reading stuff that Claude or ChatGPT bites. Right? And in areas with verifiable ground truth where it has good training data, it's amazing. I mean, the reason that the number one application that's really taken off with the LLMs are coding is because you can test if the code works, you can run tests, you can run verifications, you can play with the software that it builds yourself, but that's a sandbox. If you think about that as a sandbox, as long as the code compiles, maybe some errors here and there. Right? But you can test if it works. Consumer research, you can't do that. You're running the consumer research because you don't already know the user.
Stephanie - 00:23:30:
‘Cause you don’t have that already.
Jason - 00:23:31:
Right? And that's the difference.
Molly - 00:23:34:
I wanna talk about a specific use case of perhaps how synthetic data can be utilized in a way to help product developers unlock more insights from existing datasets, as you mentioned. If there's a great foundation and training data, then there's definitely interesting places that it could go. Could you share a time, perhaps, where you were able to extract a new actionable insight from data that was already there or perhaps seemed underwhelming or even incomplete?
Jason - 00:24:03:
I'll share two because there are two really exciting ones. One is just pure product optimization. Right? You have a demo target, product is underperforming, or you had a competitive set and product is underperforming, and you plug that into Simulacra, and you say, “Show me what this would look like if our product was the most preferred.” And it takes into account everything. Sometimes it's new demo targets. You need to be targeting younger males or mothers with children. Sometimes it's brand intrinsics. You need to increase the amount of tomato flavor, bacon flavor, right? And it can do this around constraints. So, sometimes there's limitations, the amount of sugar, the amount of salt, and so you can constrain that and continue to get these types of optimizations. But what one company has done, what one of our longest-running customers has done, we thought this was a genius application, is they built a specialized internal data set. They had their employee panel, which is cheap to run. It's people who are already in the building, and they had an external consumer panel, very expensive to run. Every single product went through the employee panel, and the employees scored it well enough that it went to the consumer panel, and those scores obviously did not often match. So they said, well, can we use Simulacra so that we only have to taste on the employee panel, and then we can predict what the consumer panel says? And to do this, they started with 5 products. So, they took five products and paneled it on both panels. They paneled it on the employee panel, and they paneled it on the consumer panel. And then they took a new prototype and paneled it only on the employee panel, and then used Simulacra to say, “What would this look like if we had put this on the consumer panel?” And then they went and validated that. And after that worked, then they started to grow that data set. So, they started with one flavor category. So, they started with, say, tomato soups, and then they extended it to mushroom soups, and they extended it to squash soups, and then they extended it to chili bean soups. And so now they've just been extending this further and further and further. And some of these products have four overlaps, some have five overlaps, and eventually, maybe we won't need additional new overlaps. So, that's always good to have coverage of the new flavor category, a new subcategory in this product. And it's just what I love about that example is that it, one, it shows the promise of the time saving. It shows the promise of the cost savings and other types of things, the value of building up proprietary data, in-house data. The second thing that I love about it is that it proves the predictability, the veracity of predicting for an undersampled population. Here's some data from an internal panel. I wish we had this on a consumer panel, but that's too expensive and time-consuming. So, I'm going to use this model to predict what my employee data would look like in terms of consumer data. And the fact that that's been validated, that they validated that first, they validated it on one, then they validated it on the next three or four flavor families, and now it's up and running. You know, I think that it's a complex prediction that has to predict across different demographics. The demographics between the two panels are not the same as in predict across different preferences. The preferences of these different demographics are not the same. And the fact that it can do that accurately, multivariate, I think, is really important. I think that that's a pretty amazing proof point for the platform.
Stephanie - 00:27:05:
For sure. I mean, I couldn't agree more. That's very powerful. And I think another point there is that, you know, an employee panel, I think, you know, of course, it has its own biases, but you can know that these are your real employees taking the survey. And unless you're using your CRM database, often you're talking to your customers via traditional panels and market research, and there are a lot of lost faith in those panels right now. So I think, you know, the idea of this is quite appealing.
Jason - 00:27:31:
If you can even get your consumers, I mean, we see that problem with luxury products all the time.
Stephanie - 00:27:36:
Or new brands, right, or smaller brands. Yeah.
Jason - 00:27:49:
Small brands, luxury products. I mean, if you're in the premium or ultra premium, say, alcohol space or wine space, you know, bottles that are 100 plus dollars a bottle, those people, buyers of that often don't show up to consumer panels.
Stephanie - 00:27:53:
Very fair. I remember trying to do a conjoint once on luxury watches. It did not go well.
Molly - 00:27:59:
Yeah. Because no one is giving their feedback for, you know, a dollar or 50 about their quarter-million-dollar purchase.
Jason - 00:28:05:
Yeah. Exactly. They bought an Automars from a Swiss dealer.
Molly - 00:28:09:
Right. But that's what I spend my free time on, of course. I pull up those surveys all the time.
Stephanie - 00:28:13:
Well, we have spent a lot of time talking about the absolute, like, upside of synthetic data, and I think that's important, also the potential. But it is likely not without its limitations. I would love for you to talk to us a little bit about, like, what are some of the challenges or limitations that are encountered when using synthetic data in research? And how do you address those, particularly when working with clients or stakeholders who are new to the concept? Like, how do you give them guardrails, things like that?
Jason - 00:28:42:
Yeah. So, synthetic data has gone through, we, in a way, right, the company is called Simulacra Synthetic Data Studio. In a way, we almost regret calling ourselves synthetic data because synthetic data has gone through the whole arc in just the last two years. Right? It went from what is synthetic data and why would I trust synthetic data and why would I use that versus real data to, you know, synthetic data is going to change everything, I don't have to pay panelists anymore, this is amazing, to realizing, okay, well, maybe I do need some panelists, maybe I need to go back to the things I was doing. So, it's gone through this entire cycle once already, and I think it'll continue going that cycle. And I think the biggest problem is that companies are trying out some of these LLM-based synthetic personas, synthetic users that don't work, that are not representative of any consumer cohort that they actually care about. And so, you know, the thing that bothers me the most is that the term synthetic data has unfortunately become synonymous with LLM-generated data, which is not what we do. Yeah. We've been leaning further and further into talking about the scenario modeling capabilities, versus the synthetic data capabilities. And on top of that, you know, to be clear, I don't see that much value in simply generating larger datasets. A lot of these companies claim, you know, give us your N equals 100, N equals 500, N equals 1000 rows, and we'll give you back half a million rows. Like, what use? If you didn't get the answers that you wanted, you didn't get the survey population that you wanted in the original data, what use does blowing the population up by half? And the real use is in that conditional ability, the ability to see into the areas that you could not survey or that you didn't survey enough of or where you didn't get the results if you want. So, leaning into that and really being able to explain that to customers is a core differentiator for us. And one of the things that we promise is that we will never predict for a population that shouldn't exist or doesn't exist. So, in an impossible combination. They're asking us for 16 to 20-year-old males with six children and a million dollars in savings. The platform's gonna say there's no causal evidence there, right? That prediction is impossible. So, we think about it to an extent in terms of confidence. Confidence has a slightly different meaning, but the idea of, what is the probability of this population? Is there any probability in the datasets that you've given us supporting this population? And so I think of that as an unmitigated good thing that we're not gonna predict for things that, you know, are totally unsupported. On the other hand, it does make a real difference because if you ask an LLM to do that, here's some data from the United States, guess what this would look like in Canada? The LLM will likely do that. Right? It would love to be helpful. It loves nothing more than to say, “Well, here's my best guess.” Right?
Molly - 00:31:11:
And it's so confidently wrong sometimes. It just cracks me up. I want ChatGPT's confidence.
Jason - 00:31:17:
Yeah. So, confidently wrong. So, we don't do that. The platform is guardrails, and it's never gonna hallucinate. It's never gonna make a prediction for which it doesn't have supporting evidence for. And the difficult part about that is getting companies to be comfortable with the idea that there's all this promise about us being able to predict for out-of-sample, low-sample, low-incidence populations. But if the quality of their data, that they collected isn't there, either because of internal reasons or because of whatever vendor they chose or because of whatever their budget was or whatever their field timing was, right, not every question is going to be answerable with their data. And, you know, that's never a fun conversation to have. That's I don't think it's any different than conversations that you guys have to have. But having this high-powered cutting-edge AI platform and then saying, “your data doesn't support these questions.” That's an area where you have to get companies to be comfortable with the idea that it all comes back to data quality. That seed data set has to be of good quality. Doesn't have to be perfectly representative. Right? If you were targeting 50-50 men, women and you got, you know, 70-30 men, women, we can likely fix that. We can likely use the AI to rebalance that dataset as a core application. Right? But we're not gonna predict for men if you had no men in the dataset.
Molly - 00:32:28:
You said something in the kinda earlier part of that where you talked about just a million rows of data is not useful. And I don't think that that's you know, it's great. Congratulations. But I don't think that that's actually useful or actionable. We had a guest on recently where we spoke about that executives are wanting more information in a smaller package. So, they want three to five actionable bullet points. They don't need the sausages-made presentation. So, that just added here's more data is not actually providing any actionable business value.
Jason - 00:33:03:
Yeah. Exactly.
Molly - 00:33:05:
I wanna pivot a bit to talk more about the perhaps still essential irreplaceable humans in this equation. You know, we've had a lot of talk about humans in the loop, and I think you kind of alluded to this concept when it comes to AI research. What's your take on that? How do AI-driven tools and perhaps other synthetic data enhance human judgment in the research process, but where are real human beings with real feelings, nuances, and reactions? Where are they still essential in the equation?
Jason - 00:33:35:
I would say there's more than one aspect of that. One of the areas where we see probably the least amount of empathy that a lot more human effort could go into right now is really thinking through the questions that are being asked. A lot of problems come down to question quality, where you said, “have you consumed this product in the last three months or, which of these brands do you consume most often in the last three months?” Most consumers have no idea. If you ask me, like, what rum have I consumed most often in the last three months? I don't know. I like rum. I have some rum. If I'm at a bar or I'm ordering rum and a cocktail, I don't know what they put in it. So, I think that there's a lack of empathy on a lot of these questions, or there's just way too many questions. Right? You're gonna ask someone. Every brand manager, every product manager, you know, has to put their finger in the pie and add two or three questions. You wind up with these surveys with, you know, 45, 50 questions. And people are rapidly clicking through, particularly when it's not an in-person survey. So, I think that there's room for a lot more empathy in really thinking through what the minimum question set is and making sure that the questions make sense and making sure that the questions are something that you could answer if we asked you. And that's an area that I don't know that, I think AI can help, I think it's really good at the ideation, but I don't think it's very good yet at the big picture, particularly when there's a lot of business context that doesn't exist in easy-to-convey forms. Right? If you're gonna be making trade-offs around multiple brands, production sites, futures contracts, marketing campaigns, you know, what a CPG company is gonna have, you know, 6 different teams, 7 different functional groups, sometimes more. And, you know, the biggest ones have way more than that. They're looking at some of the breweries on their own hop farms. You know? So, how many steps up the ladder do we have to go? Right? There are things that the AI is not aware of because it's not public information, and it might not even really be encoded information in a way that the AI can access it or think to access it. So, I would say that after, the first one is on that input, is on designing the questionnaire. The second one is on thinking through the bigger picture. If you're going to try to truncate the questions, that's just the relevant questions. Well, then you have to know what your decision-making levers are. And if there are decisions you know you're not gonna make because they're not in brand, you know, we were working with, this is way back when at Gastrograph, we were working with children's beverage company, and they wanted us to run optimizations. And we kept saying, you know, should we add any constraints on these optimizations? We can run unconstrained, and it'll find greenfield, you know, blue sky, greenfield optimizations. And sometimes these things are fun. Sometimes they're not gonna be, you know, aligned with the brand, but, you know, you get three optimizations in this run. You know, tell us what you want us to do. And they're like, “No. No. No. We're big picture. We're looking for new, exciting things, and we're like, unconstrained.” Okay. So we run it, and, of course, like, two of them come back like apple cider. Like, you gotta spike this with spiced rum and apple cider, turn this into a spice drink. It'll be perfect. And I'm like, yeah. Okay. Children's beverage, not a fit to brand. I was like, yep. We told you.
Molly - 00:36:35:
Who's answering this question about adding rum to a child's drink? Unless it was, you didn't know it was for a child's drink.
Jason - 00:36:43:
It's the AI because we didn't put the constraints on it because we asked the team what constraints do you want. They said no constraints. We want the broadest idea possible. And, of course, it comes back with an idea. Now, the funny thing was that adults were actually doing that. That was already something that adults were doing. They were actually mixing this children's beverage and alcohol, and things like concerts and stuff to sneak it in. And this was actually already flagged as a brand risk by the brand. So they were, like, particularly unhappy that the AI was like, this is gonna be great.
Molly - 00:37:15:
That is a hilarious use case. Like, you know, I'm taking my kids trick or treating, and there's leftovers, I'm gonna pour it in and have a little mixed drink here.
Jason - 00:37:25:
Yeah. So, that's an area where thinking through the constraints, thinking through the levers that you have, that the AI is not going to consistently do a great job of. And on the last side, on the human augmentation side, on the human in the loop side, I'm not the first person to say this, but I think the thing that it really comes down to is this sense of taste, this idea of taste. And this is kinda become a little bit of a buzzword right now in AI circles, and people are pining on the future of human work and human jobs, right, in an age of increasing automation. But I think that there's a real point to it. If you think through the questions that you can ask and you think through the decisions that you can make, then it comes down to which of those questions and which of those decisions, right, actually have leverage, actually have a chance of making an impact on a fit for the broader picture of where you see things going. And so I think that this idea of humans as tastemakers, which is really hard for AI, right? I mean, the reason that companies need Gastrograph and one of the reasons companies need Simulacra is because preference is always changing. Where we don't dress the thing, we don't look the same, we're not buying the same things that we were 5 years ago, let alone 10 or 25 years ago. Are bell-bottoms back? Well, it depends. It's who's gonna make a cool pair of bell bottoms and who's gonna wear them? A musician, a star wears it on the the the Met Gala runway, okay, suddenly, now bell-bottoms are back in vogue. Right? So, who's actually forming these inputs that create preference in the broader population? All of these AIs have a lag in their training. There's a little bit of real time. They can do some Internet searches. They can pull some new information, but that's different than the way that we integrate new information to our training, right? When an AI has to go out and update itself on very recent events, and you can test this. You can ask the AI about Venezuela.
Stephanie - 00:39:08:
I did that. I was like, no, 2026, please.
Jason - 00:39:11:
Yeah. Exactly. Right? And it doesn't know. And it'll even tell you things like this is highly unlikely. This is highly unusual. This is probably speculative, like, it doesn't even wanna believe the news reports. And like, say what you will about our current timeline and everything else. Right? But the idea is that this is an area where humans are able to rapidly integrate information around our environment and update our ideas of what's possible and what's fashionable or in trend, right, much faster than an AI is. And so this idea of having the taste in order to do this, having the sense of things in order to do this, is an area, where I don't think that we're going to get automated away. Right? Because that taste is an ever-shifting set of preferences shared by not everyone, but different cohorts, different segments of the population. And that's something that AIs have a lot of difficulty with, the messiness of human preference. And our preferences has already been sensical; I like dogs more than I like cats, I like cats more than I like gerbils, I like gerbils more than I like dogs. That's the cyclical preference. Right? That shouldn't be possible. Arrow's impossibility theorem: it means there's no unified set of preferences that can satisfy all conditions.
Molly - 00:40:15:
And the fact that trends move so quickly, too. I mean, even beforethe age of AI and automation, everything, it was, we had a two-week news cycle, and that was as fast as you needed to come up with a product, test that product, and bring it to market. And even by then, you're at the tail end of capitalizing on that trend because then it's the next thing.
Jason - 00:40:34:
Yeah. It's only getting faster, micro trends, micro influencers.
Molly - 00:40:38:
Oh, yeah. And, I mean, so many different types of social media influence. I mean, in just how fast things can happen. And sometimes I'm learning about things that were, like, so six weeks ago, and I'd, like, I didn't even know that this was a thing. Like, when you're talking about our bell-bottoms back, I'd be like, absolutely not. But that's just, like, the millennial in me.
Stephanie - 00:40:58:
I was like, I'm wearing them right now. They're back, Molly. Okay?
Molly - 00:41:02:
They're back. See, I need to get with it because I like, when, you know, true millennial fashion, the height of fashion was skinny jeans, and I will die on that hill. I refuse to buy any sort of other pants because I think that I look like upper echelon, but maybe that's my issue. Well, this has been a super interesting conversation, Jason, and we love to wrap up with a quick round of what we call on the show, current 101, where we ask all of our guests the same question. So, to you, what is one market research practice that you think it's time for the industry to let go of? What should we stop doing? And what approach do you think is going to be essential for researchers as we move forward? What's something that we should start doing?
Jason - 00:41:47:
Something that would make my life a lot easier is not to do skip block questions. If you say yes to question 7, then you get questions 9, 10, 11. If you say no to question 7, you get questions 12, 13, 15. The skip block questions, those routed questions are a nightmare, and you frequently don't have enough data to analyze them, and you wind up just filling it in with NA's or zeros or whatever, and you almost never wind up making full use of those questions. I think that's generally an anti pattern, and you should find a way to create more condensed question sets that are more analyzable. That was pretty personal, obviously.
Stephanie - 00:42:23:
I was gonna say, yeah, I think it makes sense for you, but it's funny because I think in the context of, like, better user experience, we wanna ask some of those questions, but we don't wanna ask the people to whom it doesn't apply. Right? So, I hear that.
Jason - 00:42:35:
Yeah. There's right ways of doing it. There's right ways of doing it. I mean, we definitely handle that data. Comes back to thinking really carefully about which questions you're gonna ask.
Molly - 00:42:44:
I was having flashbacks to my time as a project manager, so that's where that reaction came from.
Jason - 00:42:49:
And one thing that we should start doing, I think we should be casting wider nets on screeners. I think that screeners are frequently, where companies have screeners, I think they're frequently too tight, and they're overused in predicting for populations outside of the screened population. So, you know, a lot of companies that are targeting heavy users and saying this person has to, you know, someone in this panel has to buy this product three times a week or consume this product three times a week. That's a lot of consumption. And I think broader, looser screeners for a more gen pop population or even some people that lack brand affinity specific to the brand being recruited for, but it consume competitors. I think those types of broader-based, more diverse populations lead to better model building and lead to better outcomes.
Stephanie - 00:43:38:
That makes a lot of sense. And we beat that drum quite a bit too for different reasons, but I do think there's such a tendency to wanna, like, get really narrow in your focus. And I think, you know, people have to think about how much utility do you really have when you're being that narrow in the way that you're approaching things? So, Jason, to close this out, you know, you've been at the cutting edge of AI and synthetic data for years. For someone who's just starting out in market research today, or even somebody who's been in the industry for a while but has less exposure, what do you think is the most important lesson that you've learned as the field has evolved? And what advice would you give to help others adapt to the rapid tech and AI-powered changes we're seeing?
Jason - 00:44:21:
I'm gonna make this question kind of narrow in its focus, but I think that the thing that people need to get comfortable with is that software and programming skills and even things as basic as Excel skills are no longer the bottleneck. That you can now use natural language with things like Claude Code and Codex on the command line, which command line looks scary, it looks like the matrix terminal, but I highly recommend that people get really, really comfortable with the idea of using Claude Code or Codex, which basically can do anything that you can do on your computer. And with that, software and data cleaning and analysis skills is no longer the bottleneck. So, if you get back a bad data set or you get back a really messy data set or you get back a data set in a format that you can't analyze, if you can't run the correlation analysis, the regression analysis that you wanna run, just have that file point Claude code at it and say, “this is what I'm trying to do. This is what I wanna do.” And in fact, even if you don't know exactly how to do it, you could say, “this is the file. This is what I wanna do. My overall goal is this. Have at it.” And you can talk to it in natural language. And that ability, and that means that suddenly it's not about you writing the code or your personal capabilities, but your vision for outcomes. That's a real step change. That's something that that is really, it changes what it means to be a team member and an employee. That the future is here. It's just not evenly distributed. This is one of those moments. This is already what's happening at all of the big tech firms, you know, this is already what we're doing. We don't use it for core algorithm development, that's still, you know, still one step beyond what it's doing. But I'm a data scientist before I'm a software engineer. And six months ago, I was still a better software engineer than Claude Code, and now probably were, probably not. Particularly on things that I don’t specialize in, frontend engineering and fixing little UI, UX/UI bugs, writing highly specific CSS classes. Like, I have no skills in that, right? Let me go do what I'm good at, which is the math and the algorithms, alright? And I'll let Claude Code do what it's good at. And that is true for something as simple as coding, but it's true writ large. It doesn't have to end in software application. We've built tons of automated data cleaning, utilities, and capabilities internally, but now, often, I let Claude Code orchestrate which things. Skim the data, and, again, it stays on device, so it's safe, and pick, you know, which of these utilities that we've built and verified is trustable will most rapidly clean the data. And, yeah, it's a life changer. You go from spending hours cleaning data to a couple of minutes.
Stephanie - 00:47:01:
I like that.
Molly - 00:47:02:
Everybody likes that.
Stephanie - 00:47:03:
Awesome. Well, Jason, this has been an absolute pleasure and one that I think Molly and I will continue to think about. Thank you.
Jason - 00:47:10:
Amazing. Thank you, guys.
Molly - 00:47:12:
Thank you so much for joining us. Take care.
Stephanie - 00:47:15:
That's it for today's episode. Thanks for listening.
Molly - 00:47:19:
The Curiosity Current is brought to you by aytm, where curiosity meets cutting-edge research. To learn more about how aytm helps brands stay in tune with their audiences, head over to aytm.com.
Stephanie - 00:47:32:
And don't forget to follow or subscribe to The Curiosity Current on YouTube, Apple Podcasts, Spotify, or wherever you like to listen.
Molly - 00:47:41:
Thanks again for joining us, and remember, always stay curious.
Stephanie - 00:47:46:
Until next time.


















.jpeg)


