What “situated” AI looks like in practice - part 1

Written by Marta Handenawer and Pau Aleikum, edited by Jaya Bonelli

It starts with a face. Or maybe a dress. Or the color of a wall in a kitchen that no longer exists. You sit down with someone and ask them to tell you about a memory, and you watch how they look up into the distance, as if the image were floating above them. And then, they tell you about their mother sewing at night. About the smell of oil paint in their grandfather’s workshop. About the feel of a plastic tablecloth that would stick to your arms in summer, a light discomfort that you couldn’t quite shrug off. These are details that live in the folds of the mind, fragile but sharp. Then you type those details into a text-to-image AI model and wait for it to paint them back to you. It’s an act that feels almost magical: the possibility of seeing something that no longer exists. But the first surprise is that the image you get back often feels wrong. Not completely wrong, just… tilted.

PROMPT: Young girl, riding bike, courtyard, Barcelona, 1980

When Maria described the courtyard where she learned to ride a bike, the AI returned a clean, symmetrical patio with polished stone tiles and neatly trimmed hedges. It had the shape of a courtyard, yes, but it had been sanitized, scrubbed free of the rusted gate, the faded laundry lines, the chipped paint she swore was a sickly green. The software had replaced her working-class reality with something closer to a glossy real-estate listing. Similarly, Mari Loli, remembering her mother’s kitchen in the 1970s, described a metal pot with dents and a lid that never fit. But the AI gave her a shiny stainless-steel pot, the kind you’d find in a lifestyle magazine, and a backsplash that looked imported from Italy. There was no oil stain above the stove, no pile of mismatched cups. It had quietly decided that memory is better when it’s well-polished.

PROMPT: woman in the kitchen, holding a dented metal pot, oversized lid, Barcelona, 1970s

Perhaps, now you might start to see the pattern: the bias isn’t just about skin tone or gender; it’s about class, about what counts as “proper” and “beautiful” in the collective archive that the AI pulls from. Objects get upgraded. Clothes get ironed. Skin gets smoothed. And when you push for imperfection, the generator often translates it into the wrong kind of imperfection, chaos that feels staged, like a fashion shoot trying to look “authentically messy.” You ask for the old sewing machine Maria’s mother used, with its chipped enamel and squeaky pedal, and you get something retro-chic, framed by soft sunlight, ready for Instagram. It’s not that the machine can’t technically create the right image, it’s that it doesn’t know which version of reality to privilege when there’s a gap. And in that gap, the defaults rush in.

Now, if this used to be wholly true for all T2I (text-to-image) image generation models, in today’s state-of-the-art generators, these kinds of “gaps” happen a lot less - often when the model is embedded with a reasoning layer. Though this raises its own question: at what point do “added” imperfections tip over into stereotyping? That being said - take Nano Banana, for example. Give it a time and a place where things ought to be slightly imperfect, and there will be a certain level of detail - of wear, of rundown-ness, of patina - that will feel closer to what we have in mind. This is merely a result - and a testament - of these models getting “better”: being trained on more data, more parameters, more effectively, with more refined training methods, especially in the case of reasoning models. But in most cases (for instance: Kling 01, Seedream, Flux), the defaults of perfected aestheticization remain.

PROMPT: blurry image of an old stone farmhouse, two story house, a stone staircase, interior view, a little girl on the staircase looking at a bat, Spain, 2000

The defaults come from the training data, and the training data comes from the visual history we’ve put online. Which means that the world the AI knows is already tilted toward the most photographed, most shared, most “representable” fragments of life. A bucket in a farmer’s yard is more likely to be the kind of bucket a tourist took a picture of than the one that was actually there. A market stall in Lagos gets rendered with colors dialed up to “vibrant,” even if the real market was in the dusty lull of late afternoon.

Once you notice this, you begin to suspect that the AI is not really returning your memory per se, it’s returning what the internet thinks your memory should look like. And because its training set overrepresents certain aesthetics - for instance, middle-class European interiors, tourist-friendly exteriors, images fitting the visual language of advertising - it tends to push every image in that direction. There’s a soft erasure in this, a quiet redrawing of the past into something aspirational. It’s the same thing that happens when a TV show aims to recreate a historical period, but then, accidentally, dotes everyone with a perfect pair of pearly whites. The result is believable until you remember how it actually felt to live there - and you feel the absence like a missing tooth.

PROMPT: spanish village, crowded street, market, 1980s (Nano Banana)

Even when we try to fight the defaults by adding qualifiers: “poor kitchen,” “worn-out walls,” “crowded street”, we hit another limit: these terms are themselves shaped by external gazes. “Poor” in the training data might mean a romanticized hut with warm lighting; “crowded” might mean a picturesque market rather than the suffocating press of bodies. You can keep refining the prompt until you’re blue in the face, but the generator is still working with a map where your street doesn’t exist on it.

This is why these experiments are not just technical exercises - they’re a way to expose the visual politics embedded in machine vision. The AI doesn’t only misrepresent the past because it’s clumsy; it misrepresents it because its source material is already selective, shaped by who holds the camera and whose images are valued enough to circulate. In that sense, using AI to reconstruct a memory is like asking a stranger to paint your portrait using only second-hand gossip. They might get the hair color right, but the expressions, the demeanor, the things that make you you, are almost guaranteed to be off.

There’s a strange intimacy in this failure. When the image comes back wrong, you become more aware of what you remember, of the small things you didn’t think to mention until you see them missing. When recalling a birthday party, this made Emerund remember that, in fact, the cake was lopsided - because someone had bumped into the table. The AI gave her a perfect cake, dusted with icing sugar in an artful way. Looking at it, she laughed, then frowned. It was as if her own story had been dressed up for company. MAnd maybe that’s the most unsettling thing about AI-generated memories: they’re not lies, exactly, but they’re not yours either. And there is value in this mismatch. The distance between what you remember and what the AI offers can act like a mirror, showing you the assumptions that live in the machine and in the culture that trained it. The wrong image sharpens the real one in your mind. It forces you to articulate the difference. And in that act, we believe the memory becomes more yours than before.

That being said, we began to wonder: is there a way to diminish the distorting effect of this “mirror”? Can we not try for an AI model that restitutes an image that is more deeply informed by the cultural and social context that it’s working in?

Now, as we’ve mentioned above, one way that this has been addressed is through adding a reasoning layer to text-to-image models. It’s a strategy that’s been shown to improve the image outputs - and, as models become more capable, it will probably appear more and more. But we have two key caveats against it: first off, it’s costly, both to train and to use. We wouldn’t be able to build our own custom reasoning + text-to-image model from scratch. Secondly, although the images from, say, Nano Banana or ChatGPT’s image generator tend to go towards a more nuanced view of the surface of things - i.e., they’re able to render imperfection in somewhat fitting detail - they’re almost too detailed, too accurate. We wanted Synthetic Memories to lead to a certain emotional truth - and, unless blessed with an eidetic memory, we only remember certain, isolated details, not each and every one. The Nano Banana outputs are almost eerie in their carefully edited detail, like a perfectly minute reconstruction - surgically accurate, to the detriment of the arbitrary, subjective inexactitude of human memory.

This is what led us to finetune our general Synthetic Memories model. Throughout the past few years of the project, we’ve worked across a diversity of settings - from nursing homes in Barcelona, to migrant communities in Brazil, street markets in Chile, living rooms in Tokyo. Over time, we’ve built a large archive of memories, a collection of images and words - but not only. Across these different contexts over time, we also came to gather a visual texture of the memories, like little details, nuances: the chipped walls, the uneven tiles, the way sunlight falls differently in a kitchen in Valparaíso than in Barcelona. We had already experimented with fine-tuning the model - for instance, when working with people from Barcelona, we re-trained the model with images from the city, to try to capture something of the je-ne-sais-quoi of the city - but this was different, because at that moment, we were working with a somewhat specific context and city. What we wanted, now, more broadly, was a way to tweak our global Synthetic Memories model to account, not for a specific place and time, but for the trail of details and imperfections that is common to all human environments and activities.

Aiming for a more universal form of nuanced irregularities, this became a months-long process: we hand-picked some of our generated images, and iterated over them at length: revising them, arguing over little details and tweaking them until we had something we could work with. In the end, we chose 50 of these images, and used them to train our custom fine-tuned version of Stable Diffusion. By doing this, the model learnt to develop an “inclination” towards realistic imperfection. Now, with the fine-tuned version, we are able to get those irregularities we missed in the beginning. When Maria’s courtyard goes through our custom model, the gate stays rusted, the laundry line is still faded, and the green paint remains sickly green.

It’s not quite that the machine suddenly developed a sense of taste or common sense - rather, we fed it a different archive to learn from, one built not from stock photography and picture-perfect tourist snapshots, but from the actual visual residue of remembered lives. The affective, specific, picture-imperfect details that real people really do remember.

PROMPT: Little girl looking turning a corn tortilla in a frying pan in Mexico in the 1990s in a yellowish kitchen with a hot light

Beneath this process lies a very simple, but subtle, and somewhat quiet fact: that this problem we encountered when trying to recreate a memory with Gen-AI is not due to an inherent flaw in the technology. Rather, the issue is that the baseline AI model has been trained on a world that was specifically curated for a kind of generic, abstracted idea of a viewer. That’s why the images it outputs are so polished: the baseline AI model outputs images of an idealized world, because it thinks that that’s what the user wants. By fine-tuning the model, therefore, we’re trying to shift the world the AI is trained on - gearing it more towards someone who will have a subjective memory of it, with its irregularities.

This is where the decision to finetune a model seeks to be a mere technical step, and becomes a political and human one. Fine-tuning a model is what will determine the reality the model works in. It’s a very loaded decision: whose reality should be represented? Representation is power - by choosing to center a certain perspective, you give credit to the world view it accounts for, you’re telling the machine “this is what you should know, this is important”. So, by training the model on images that don’t look like they belong in a décor magazine, we effectively taught the model that a dented pot, or a creased skirt are not mistakes, and that their “perfect”, polished counterparts - a shiny silver pot, a perfectly ironed skirt - are not inherently better.

PROMPT: “family watching TV, two grandparents, two parents and two little kids, small living room, TV embedded on a dark wooden cabinet, carpet and bright colour sofa, small window, curtains closed, night time, 2004, Hiroshima”

Synthetic Memories Model (Stable Diffusion)

This led us to think about the question of “situated AI” in more general terms. It’s a real problem - by virtue of how its training data is sourced, selected and structured, the AI models we use every day are biased to over-represent certain conceptions of the world, and under-represent others. The issue at heart, therefore, is not about improving the “intrinsic” algorithms in AI technologies, but rather, about finding better archives and data to train them on. They’re machines that can only imagine and represent the world that they’ve been shown - so, when they depict a limited range of reality, it’s not that they’re “underperforming”: it’s that the view of the world that they’ve had access to, the view of the world that they’ve been trained on, is limited.

PROMPT: “blurry image of a mom dropping snacks from a balcony, POV from under the balcony , Turin, Italy , 1980”

This opens up Pandora’s box of data representativeness, and the ethical questions it brings forth. Whose memories are we willing to let disappear into the defaults? And what do we lose about ourselves when the rough edges of our past get polished away by someone else’s idea of how life should look? That’s why we need to build a more theoretical framework around situated AI - so that we answer these questions in an informed, intentional way. If you’re interested in learning more about how we thought about this question, check out the second article in this series - coming soon!

What “situated” AI looks like in practice - part 1

Next