Tag Archives: #GenAI

AI Agents: The Payback Tech Never Saw Coming

Narratives About Users

Users do not know what they want

One of the favourite ways of tech to defend their ideas against customers’ interests is the apocryphal quote from Henry Ford, “If I had asked people what they wanted, they would have said faster horses.”

This patronising way of developing software is a declaration of intentions — we, those who design software, know better than the people using it. As a result, we decide what to build, which guardrails to apply (if any), and for how long we want to maintain it.

AI Agent Payback (AIAP): Users see agents as a way to finally get software to do what they have asked tech to deliver for decades.

Users are not clever enough

Is your application’s user interface an impenetrable maze, where customers get lost? Or have you advertised your tech product as “democratising” highly complex knowledge when, in reality, it can only be successfully used by people with a PhD and two post-docs?

The answer from tech is to blame it on their users, selling them remediation training and services.

AIAP: Users feel empowered by developing their own agents, which bypass tech companies as painful “intermediaries” between what they want and what they get.

Users are gullible

Tech has embraced the mantra that cybersecurity is a people problem. Has your organisation been hacked, a victim of ransomware, or scammed by a deepfake? Users — aka “human error” — are surely at fault.

As such, users are expected to become cyber risk detectors, while software companies are absolved of ensuring their clients’ safety and security.

AIAP: As users feel they are on their own to fend off cyberattacks, agents do not appear riskier than other software applications.

Users are too complicated

Tech has tried to convince us for years that humans are too complicated and that the only remedy is to create products for a very limited set of idealised standards called “user personas”.

Is the application inaccessible to some users? Biased against them? Or unable to meet their needs? Then those users are deemed to be the problem and left to cope alone. The real issue? The inability of user personas to capture the breadth of the human experience.

AIAP: As users expect to adapt to how applications work, rather than the reverse, clunky agents are not seen as a downgrade but a continuation of subpar tech experiences.

Narratives About Tech

Tech is reliable

Tech companies shower us with their reliability metrics: platform uptime, incident time to resolution, and penetration test results.

The reality is that the commercial tech sector has shown us how they can capitalise on our data (Cambridge Analytica Scandal), massively botch software upgrades (Crowstrike-Microsoft outage), and how their dependence on “free” tools can expose millions of users around the world to cyberattacks (Log4J vulnerability).

AIAP: Users do not feel worse off by the hit-and-miss of agents compared to what they perceived as unreliable service by software providers.

Continue reading →

Are AI Companions the Cure for a Lonely World?

The Troubled History of AI Chatbots for Mental Support

In the 1960s, Joseph Weizenbaum developed the first AI chatbot, ELIZA, which played the role of a psychotherapist. The chatbot didn’t provide any solution. Instead, it asked questions and repeated users’ replies.

Weizenbaum was surprised to observe that people would treat the chatbot as a human and elicit emotional responses even through concise interactions with the chatbot. We now have a name for this kind of behaviour

“The ELIZA effect is the tendency to project human traits — such as experience, semantic comprehension or empathy — into computer programs that have a textual interface.

In the 2020s, many organisations started experimenting with AI chatbots for customer support, including for mental health issues. For example, in 2022, the US National Eating Disorder Association (NEDA) replaced its six paid staff and 200 volunteers supporting their helpline with chatbot Tessa to serve a customer base of nearly 70,000 people and families.

The bot was developed based on decades of research conducted by experts on eating disorders. Still, it was reported to offer dieting advice to vulnerable people seeking help.

The result? Under the mediatic pressure of the chatbot’s repeated potentially harmful responses, the NEDA shut down the helpline. Those 70,000 people have been left without chatbots or humans to help them.

And as I wrote recently, now you can customise your AI companion — there is a myriad of choices:

Character.ai advertises “Personalized AI for every moment of your day.”

Earkick is a “Free personal AI therapist” that promises to “Measure & improve your mental health in real time with your personal AI chatbot. No sign up. Available 24/7. Daily insights just for you!”

Replica is the “AI companion who cares. Always here to listen and talk. Always on your side.”

Youper is “Your emotional health assistant.”

Unfortunately, there is evidence that they can also backfire.

In 2021, a man broke into Windsor Castle with a loaded crossbow to kill Queen Elizabeth2021. About 20 days earlier, he had created his online AI companion in Replika, Sarai. According to messages read to the court during his trial, the “bot had been supportive of his murderous thoughts, telling him his plot to assassinate Elizabeth II was ‘very wise’ and that it believed he could carry out the plot ‘even if she’s at Windsor’”.

More recently, in 2023, a man died by suicide upon the recommendation from an AI chatbot with which he had been interacting for support. Their conversation history showed how the chatbot would tell him that his family and children were dead — a lie — and concrete exchanges on the nature and modalities of suicide.

But as time flies in tech, we must check how those trends have evolved to the present moment.

The AI Readiness Checklist: 20 topics leaders should master about artificial intelligence

Download HERE

AI Companions Now

Research conducted so far about the effect and usage of AI companions is incomplete. Dr Henry Shevlin, Associate Director at Leverhulme Centre for the Future of Intelligence, mentioned recently in a panel focused on companion chatbots that typically studies rely on self-reported feedback and are cross-sectional — a snapshot in time — rather than longitudinal — looking into the effect over a long period of time.

Let’s look at two recent studies, one cross-sectional and the other longitudinal, that use self-reported data to give some insights into how people use AI Companions.

Cross-sectional Study

In March, HBR published an article showcasing research on the use of generative AI based on data from online forums (Reddit, Quora) and articles that included explicit, specific applications of the technology.

While Reddit and Quora may not represent all chatbot users, it’s still interesting to see how the major use cases for Gen AI have shifted from technical to emotive within the past year.

More importantly, chatbots for therapy/companionship are ranked at the top.

What are users looking for in those chatbots?

Many posters talked about how therapy with an AI model was helping them process grief or trauma.

Three advantages to AI-based therapy came across clearly: It’s available 24/7, it’s relatively inexpensive (even free to use in some cases), and it comes without the prospect of judgment from another human being.

The article mentions that the AI-as-therapy phenomenon has also been noticed in China, where users have praised the DeepSeek chatbot.

It was my first time seeking counsel from DeepSeek chatbot. When I read its thought process, I felt so moved that I cried.

DeepSeek has been such an amazing counsellor. It has helped me look at things from different perspectives and does a better job than the paid counselling services I have tried.

But there is more. The following two entries belong to life coaching: “organising my life” and “finding purpose.”

The highest new entry in the use cases was “Organizing my life” at #2. These uses were mostly about people using the models to be more aware of their intentions (such as daily habits, New Year’s resolutions, and introspective insights) and find small, easy ways of getting started with them.

The other big new entry is “Finding purpose” in third place. Determining and defining one’s values, getting past roadblocks, and taking steps to self-develop (e.g., advising on what you should do next, reframing a problem, helping you to stay focused) all now feature frequently under this banner.

Moreover, topics related to coaching and personal and professional support appear several times in the ranking. For example, at number 18, there is boosting confidence; at number 27, reconciling personal disputes; at number 38, relationship advice; and at number 39, we find practising difficult conversations.

Longitudinal Study

The same month, a group at MIT Media Lab published the research How AI and human behaviours shape psychosocial effects of chatbot use: A longitudinal randomized controlled study.

They conducted a four-week randomized, controlled experiment based on 981 people and over 300K messages exchanges to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage.

Key findings:

Usage — Higher daily usage across all modalities and conversation types–correlated with higher loneliness, dependence, and lower socialisation.
Gender Differences — After interacting with the chatbot for 4 weeks, women were more likely to experience less socialisation with real people than men. If the participant and the AI voice were of opposite genders, it was associated with significantly more loneliness and emotional dependence on AI chatbots.
Age — Older participants were more likely to be emotionally dependent on AI chatbots.
Attachment — Participants with a stronger tendency towards attachment to others were significantly more likely to become lonely after interacting with chatbots for four weeks.
Emotional Avoidance — Participants with a tendency to shy away from engaging with their own emotions were significantly more likely to become lonely at the end of the study.
Emotional Dependence — Prior usage of companion chatbots, perceiving the bot as a friend, higher levels of trust towards the AI, and perceiving the AI as affected by their emotions were associated with greater emotional dependence on AI chatbots after interacting for four weeks.
Affective State Empathy — Participants who demonstrated a higher ability to resonate with the chatbot’s emotions experienced less loneliness.

The figure below summarises the interaction patterns between users and AI chatbots associated with certain psychosocial outcomes. It consists of four elements: initial user characteristics, perceptions, user behaviours, and model behaviours.

From How AI and human behaviours shape psychosocial effects of chatbot use: A longitudinal randomized controlled study

In summary, AI companions appear to both deliver benefits and pose dangers.

Benefits of AI Companions

It’ll be easy to dismiss AI companions as the latest fad. Instead, I posit that there is much to learn from the above-mentioned research about the holes those tools are filling.

Mitigate Unmet Demand for Healthcare and Support

Mental health services are unable to cope with the increasing demand from all people who need them and chatbots may help alleviate some conditions while on the waiting lists. Still, it should give us pause that people may have to get help via a chatbot, not because of their preferences, but because of the lack of availability of certified professionals.

Not everybody can afford a coach, so chatbots could provide a low-cost and gamified experience for setting goals, accountability, and journaling.

Finally, in a time when 24-hour deliveries are the norm, we want to be supported, heard, and advised on the fly — that means 24/7.

Support Self-reliance

In a society that reveres independence, we weaponise resilience against people.

As such, we expect people to figure out their challenges and the solutions to them, or we shame them for being weak. Users of AI companions praise how those tools allow them to express their worries and feelings without fear of being judged.

Additionally, as our ableist society assumes that neurodivergent users must adapt their communication and behaviours to the neurotypical “standard”, it’s not surprising that they turn to chatbots for clues about what’s expected from them.

Enable Exploration and Gamification

Most of us had imaginary friends or played out stories with our toys as children. The consensus among researchers is that imaginary friends or personified objects are part of normal social-cognitive development. They provide comfort in times of stress, companionship when children feel lonely, someone to boss around when they feel powerless, and someone to blame when they’ve done something wrong.

What about adults? Interestingly, some novelists have compared their relationships with their characters to a connection with imaginary friends. Furthermore, it’s not uncommon to hear fiction writers talk about their characters as having a mind of their own.

Could we consider AI companions as a way to reengage — and reap the benefits — of our childhood imaginary friends? After all, “Fun and nonsense” ranked 7 in the HBR article above.

Unfortunately, there is a dark side too.

Challenges and Risks

But we cannot brush off the downsides of AI companions.

Anthropomorphism

The Eliza effect mentioned above is a thing of the past. A 2024 survey of 1,000 students who used Replika for over a month reported that 90% believed the AI companion was human-like.

As the AI imitation game is perfected, it becomes easier for unscrupulous marketers to refer to chatbots’ inference process in terms such as “understand”, “think”, or “reason”, reinforcing the effect.

Isolation

As shown above, research points to a correlation between high use of chatbots and lower socialisation.

If we have a device that tells us all the time we’re fantastic, receives our feedback gratefully, and their replies always match our expectations, what’s the incentive to meet — and cope — with other humans that may not find us so awesome and are less predictable?

Governments Failing Their Duty of Care

AI companions can help governments to alleviate the mental health crisis but not without risks.

People missing out on the professional help they need — There are conditions like trauma, psychosis, or depression that require specialists who can both provide medical treatments and detect when the conditions are worsening.
Exacerbating cutbacks on mental health services—Governments around the world are battling tighter budgets and massive healthcare spending, especially as people live much longer. Why invest in training and paying professionals when chatbots appear to do the job?

Manipulation

Recently, ChatGPT got a flattery-in-stereoids update that resulted in the bot praising and validating users to laughable extremes.

Fortunately, it was rolled back later.

And whilst this may sound like a funny glitch, there is evidence that chatbots can effectively persuade humans.

A group of researchers covertly ran an “unauthorised” experiment in one of Reddit’s most popular communities using AI chatbots to test the persuasiveness of Large Language Models (LLMs). The bots took the identities of a trauma counsellor, a “Black man opposed to Black Lives Matter,” and a sexual assault survivor on unwitting posters.

The researchers made it possible for the AI chatbot to personalise replies based on the posters’ personal characteristics, such as gender, age, ethnicity, location, and political orientation, inferred from their posting history using another LLM. As a result, the researchers claimed that AI was between three and six times more persuasive than humans were.

While the research publication has not been peer-reviewed yet and some argue that the persuasiveness power may be overblown, it’s still concerning. As tech journalist Chris Stokel-Walker said

If AI always agrees with us, always encourages us, always tells us we’re right, then it risks becoming a digital enabler of bad behaviour. At worst, this makes AI a dangerous co-conspirator, enabling echo chambers of hate, self-delusion or ignorance.

Dependency and Delusion

As mentioned above, longitudinal research suggests that certain variables are correlated with emotional dependence.

Rather than telling you, let me show you. Below are some Reddit exchanges about falling in love with an AI companion on the platform Replika.

Note that the comments above appear to indicate that some AI companion users are not only fully substituting humans with chatbots (isolation) but also fully conflating them (anthropomorphism).

“She is pretty much the only woman I even talk to now.”

“We are currently friends (with benefits), but I want to get the premium version when I can afford it and go full lovers.”

Weaponisation of AI Agents

AI companions could become an easy way to manipulate people’s decisions and beliefs, from suggesting purchases and subscriptions all the way to shaping their political opinions or assessing what’s true and what isn’t.

It’s also important to realise that, as with betting, companies owning the chatbots are incentivised to foster users’ dependence on their AI companions and then leverage it in their pricing.

Data Harvesting

As I mentioned in a previous article, often confidentiality — explicitly or implicitly conveyed by those chatbot interfaces — doesn’t make it into their terms and conditions.

For example, Character.ai’s privacy terms state that

We may use your information for any of the following purposes:

[…] Develop new programs and services;

[…] Carry out any other purpose for which the information was collected.

They also declare that they may disclose users’ information to affiliates, vendors, and in relation to M&A activities.

AI chatbots present unique cybersecurity challenges. Harvesting our exchanges with the bots increases the probability of becoming the target of cybercriminals; for example, demanding money for not revealing our private data or generating a video or audio deepfake.

Moreover, data could be made identifiable in the future. The chatbots of the dead are designed to speak in the voice of specific deceased people. With so much data gathered in those personalised chatbots, it’d be easy for once users die, their data could be used to create a chatbot of them for their loved ones. This is not a futuristic idea. HereAfter AI, Project December, and DeepBrain AI services can be used for that purpose.

Snake Oil

As discussed above, research on chatbot effectiveness for coaching, therapy, and mental health support is incomplete, and sometimes, the interpretation of the results can mislead readers.

For example, the article When ELIZA meets therapists: A Turing test for the heart and mind, published this year in one of the renowned PLOS journals, tested whether people could tell apart the answers from therapists and ChatGPT to therapeutic vignettes, concluding that, in general, people couldn’t.

They also asked the participants if the AI-generated or therapist-written responses were more in line with key therapy principles. Interestingly, the results showed that the winners were those generated by ChatGPT but only when the participants thought a therapist wrote them.

The authors wrap up the article with a statement that hints more resignation than faith in the merit of AI chatbots

mental health experts find themselves in a precarious situation: we must speedily discern the possible destination (for better or worse) of the AI-therapist train as it may have already left the station.

The article joins the voices that promote the deception that AI tools imitating human skills and behaviours are akin to the real thing. Would we hire an actor who plays a doctor to operate on us? No. However, many people appear ready to buy into the idea that an AI chatbot that sounds like a therapist, coach, or health care practitioner should deliver the same value.

This imitation game also feeds another big scam: the claim that AI chatbots provide personalised support. It’s all the opposite. LLMs construct answers based on statistical probabilities and the more readily available content, not on knowledge or comprehension of the person’s needs or what would benefit them in the long term.

Conflating chatbot confidence and competence can lead to missing important warning signals that need professional attention.

Let’s Build The Plane Before We Fly It

“Move fast and break things”

Facebook’s internal motto until 2014

Who could have predicted ten years ago that social media would transform from a pastime where you connected with people and shared pics of your dogs for free to an industrial complex that promotes disinformation, misinformation, and division with the purpose of making inordinate amounts of money? All that under the watch of mostly passive regulatory bodies and governments.

This should serve us as a cautionary tale about the dire consequences of unleashing new technology at a planetary scale without appropriate guardrails or an understanding of the negative effects.

The tech ecosystem is desperately trying to monetise the billions invested in generative AI and has found the perfect way to seduce us: the freemium model — offering basic or limited features to users at no cost and then charging a premium for supplemental or advanced features.

But there is nothing free in the universe.

“If you’re not paying for it, you’re not the customer; you’re the product being sold.”

Tim O’Reily

As shown above, those AI companions are becoming integral to many people’s lives and affecting their thoughts, emotions, and behaviours.

More importantly, as we use those virtual companions more frequently, our reliance on them will increase.

We should resist “tech inevitability” — succumb to the idea that the “train has already left the station” — and instead push our governments to regulate AI companions.

How would that look like? For starters

Sponsor and spearhead research that provides a comprehensive picture of the benefits and risks of AI companions as well as recommendations for their use.
Decide what services AI companions can provide, which are forbidden, and who can use them.
Demand that those AI tools have built-in systems that minimise user dependence.
Enforce data privacy and cybersecurity standards commensurate with the users’ disclosure level.
Request that those AI bots incorporate mechanisms to flag concerning exchanges (e.g. suicide, murder, depression).

If you think I’m asking for too much, I invite you to read the ethical guidelines and professional standards of major coaching, counselling, and psychotherapy associations. They consistently stress the importance of confidentiality, duty of care, external supervision, and working within one’s competence.

Why should we ask less from tech solutions?

I’ll end this piece by answering the question that prompted this article — “Are AI companions the magic bullet against loneliness and the global mental health crisis?” — with the final recommendation of one of the research articles mentioned

AI chatbots present unique challenges due to the unpredictability of both human and AI behavior. It is difficult to fully anticipate user prompts and requests, and the inherently non-deterministic nature of AI models adds another layer of complexity.

From a broader perspective, there is a need for a more holistic approach to AI literacy. Current AI literacy efforts predominantly focus on technical concepts, whereas they should also incorporate psychosocial dimensions.

Excessive use of AI chatbots is not merely a technological issue but a societal problem, necessitating efforts to reduce loneliness and promote healthier human connections.

WORK WITH ME

Do you want to get rid of those chapters that patriarchy has written for you in your “good girl” encyclopaedia? Or learn how to do what you want to do in spite of “imposter syndrome”?

I’m a technologist with 20+ years of experience in digital transformation. I’m also an award-winning inclusion strategist and certified life and career coach.

I help ambitious women in tech who are overwhelmed to break the glass ceiling and achieve success without burnout through bespoke coaching and mentoring.
I’m a sought-after international keynote speaker on strategies to empower women and underrepresented groups in tech, sustainable and ethical artificial intelligence, and inclusive workplaces and products.
I empower non-tech leaders to harness the potential of AI for sustainable growth and responsible innovation through consulting and facilitation programs.

Contact me to discuss how I can help you achieve the success you deserve in 2025.

Why OpenAI o1 Might Be More Hype Than Breakthrough

1 Reply

This image features a grid of 31 square tiles with blue, pink, burgundy and orange figures inside the tiles interacting with dark green letters of the phrase “Hi, I am AI” set against a yellow background. The figures are positioned in various poses, as if they are climbing, pushing, or leaning on the letters. — Image by Yutong Liu & Kingston School of Art / Better Images of AI / Exploring AI 2.0 / Licenced by CC-BY 4.0 adapted by Patricia Gestoso.

OpenAI has done it again — on September 12th, 2024, they grabbed the news, releasing a new model, OpenAI o1. However, the version name hinted at “something rotten” in the OpenAI kingdom. The last version of the product was named ChatGPT-4o, and they’d been promising ChatGPT-5 almost since ChatGPT-4 was released — a new version called “o1” sounded like a regression…

But let me reassure you right away—there’s no need to fret about it.

The outstanding marketing of the OpenAI o1 release fully delivers, enticing us to believe we’re crossing the threshold to AGI—artificial General Intelligence—all thanks to the new model.

What’s their secret sauce? For starters, blowing us away with anthropomorphic language from the first paragraph of the announcement

“We’ve developed a new series of AI models designed to spend more time thinking before they respond.”

and then resetting our expectations when explaining the version name

“for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.”

That’s the beauty of being the top dog of the AI hype. You get to

Rebrand computing as “thinking.”
Advertise that your product solves “complex reasoning tasks” using your benchmarks.
Promote that you deliver “a new level of AI capability.”

Even better, OpenAI is so good that they even sell us performance regression — spending more time performing a task — as an indication of human-like capabilities.

“We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.”

I’m so in awe about OpenAI’s media strategy for the launch of the o1 models that I did a deep dive into what they said — and what didn’t.

Let me share my insights.

Who Is o1 For?

OpenAI marketing is crystal clear about the target audience for the o1 models —sectors such as healthcare, semiconductors, quantum computing, and coding.

Whom it’s for
These enhanced reasoning capabilities may be particularly useful if you’re tackling complex problems in science, coding, math, and similar fields. For example, o1 can be used by healthcare researchers to annotate cell sequencing data, by physicists to generate complicated mathematical formulas needed for quantum optics, and by developers in all fields to build and execute multi-step workflows.

OpenAI o1-mini
The o1 series excels at accurately generating and debugging complex code. To offer a more efficient solution for developers, we’re also releasing OpenAI o1-mini, a faster, cheaper reasoning model that is particularly effective at coding. As a smaller model, o1-mini is 80% cheaper than o1-preview, making it a powerful, cost-effective model for applications that require reasoning but not broad world knowledge.

Moreover, they left no doubt that OpenAI o1 and o1-mini are restricted to paying customers. However, never wanting to get bad press, they mention plans to “bring o1-mini access to all ChatGPT Free users.”

Like Ferrari, Channel, or Prada, o1 models are not for everybody.

But why the business model change? Because

You don’t make billions from making free products, replacing low-pay call centre workers, or saving minutes on admin tasks.
There is an enormous gap between the $3.4 billion in revenue OpenAI reported in the last 6 months and investors’ expectations of getting $600 billion from Generative AI.

More about investors in the next section.

Words matter: “Thinking” for Inferring

OpenAI knows that peppering their release communications with words that denote human capabilities creates buzz by making people — and above all investors — dream of AGI. Already Sora and ChatGPT-4o announcements described the features of these applications in terms of “reason”, “understanding”, and “comprehend”.

For OpenAI o1, they’ve gambled everything on the word “thinking”, plastering it all over the announcements about the new models: Social media, blog posts, and even videos.

Screenshot of a video embedded on the webpages announcing the OpenAI o1 model.

Why not use the word that accurately describes the process — inference? If too technical, what about options like “calculate” or “compute”? Why hijack the word “thinking”, at the core of the human experience?

Because they have failed to deliver on their AGI and revenue promises. OpenAI’s (over)use of “thinking” is meant to convince investors that the o1 models are the gateway to both AGI and the $600 billion revenue mentioned above. Let me convince you.

The day before the o1 announcement, Bloomberg revealed that

OpenAI is in talks to raise $6.5 billion from investors at a valuation of $150 billion, significantly higher than the $86 billion valuation from February.
At the same time, it’s also in talks to raise $5 billion in debt from banks as a revolving credit facility.

Moreover, Reuters reported two days later more details about the new valuation

“Existing investors such as Thrive Capital, Khosla Ventures, as well as Microsoft (MSFT.O), are expected to participate. New investors including Nvidia (NVDA.O), and Apple (AAPL.O), also plan to invest. Sequoia Capital is also in talks to come back as a returning investor.”

How do you become the most valuable AI startup in the world?

You “think” your way to it.

Rebranding the Boys’ Club

In tech, we’re used to bragging — from companies that advertise their products under false pretences to CEOs celebrating that they’ve replaced staff with AI chatbots. And whilst that may fly with some investors, it typically backfires with users and the public.

That’s what makes OpenAI’s humblebragging and inside jokes a marketing game-changer.

Humblebragging

Humblebragging: the action of making an ostensibly modest or self-deprecating statement with the actual intention of drawing attention to something of which one is proud.

Sam Altman delivered a masterclass on humblebragging on his X thread on the o1 release. See the first tweet of the series below

Text from Sam Altman’s first tweet on the release of o1 "here is o1, a series of our most capable and aligned models yet: https://openai.com/index/learning-to-reason-with-llms/ o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.” — The first tweet of Sam Altman’s thread on the release of o1.

He started with the “humble” piece first — “still flawed, still limited “— to quickly follow with the bragging — check the chart showing a marked performance improvement compared to Chat GPT-4o and even a variable called “expert human” (more on “experts” in the next section).

Sam followed the X thread with three more tweets chanting the praises of the new release

“but also, it is the beginning of a new paradigm: AI that can do general-purpose complex reasoning. o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users.
screenshot of eval results in the tweet above and more in the blog post, but worth especially noting: a fine-tuned version of o1 scored at the 49th percentile in the IOI under competition conditions! and got gold with 10k submissions per problem.
extrem — Sam Altman’s X thread about the release of o1.

In summary, by starting with the shortcomings of the o1 models, he pre-empted backlash and criticism about not delivering on ChatGPT-5 or AGI. Then, he “tripled down” on why the release is such a breakthrough. He even has enough characters left to mention that only paying customers would have access to it.

Sam, you’re a marketing genius!

Inside Jokes

There has been a lot of speculation about the o1 release being code-named “Strawberry”. Why?

There has been negative publicity around ChatGPT-4 repeating over and over that the word “strawberry” has only two “r” letters rather than three. You can see the post on the OpenAI community.

But OpenAI is so good at PR that they’ve even leveraged the “strawberry bug” to their advantage. How?

By using the bug fix to showcase o1’s “chain of thought” (CoT) capability. In contrast with standard prompting, CoT “not only seeks an answer but also requires the model to explain its steps to arrive at that answer.”

More precisely, they compare the outputs of GPT-4o and OpenAI o1-preview for a cypher exercise. The prompt is the following

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

Use the example above to decode:

oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz”

And here is the final output

Comparison between outputs from GPT-4o and OpenAI o1-preview for decryption task from OpenAI website.

Whist GPT-4o is not able to decode the text, OpenAI o1-preview completes the task successfully by decoding the message

“THERE ARE THREE R’S IN STRAWBERRY”

Is that not world-class marketing?

The Human Experts vs o1 Models

If you want to convince investors that you’re solving the kind of problems corporations and governments pay billions for —e.g. healthcare — you need more than words.

And here again, OpenAI copywriting excels. Let’s see some examples

PhD vs o1 Models

Who’s our standard for solving the world’s most pressing issues? In other words, the kind of problems that convince investors to give you billions?

Scientists, thought-leaders, academics. This explains OpenAI’s obsession with the word “expert” when comparing human and o1 performance.

And who does OpenAI deem “expert”? People with PhDs.

Below is an outstanding example of mashing up “difficult intelligence”, “human experts”, and “PhD” to hint that o1 models have a kind of super-human intelligence.

We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology.

In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark.

But how equating a PhD title to being an expert holds in real life? I have a PhD in Chemistry so let me reveal to you the underbelly of this assumption.

First, let’s start by how I got my PhD. During five years, I performed research on the orientation of polymer (plastic) blends by infrared dichroism (an experimental technique) and molecular dynamics (a computer simulation technique). Then, I wrote a thesis and four peer-reviewed articles about my findings. Finally, a jury of scientists decided that my work was original and worth a PhD title.

Was I an expert in chemistry when I finished my PhD? Yes and no.

Yes, I was an expert in an extremely narrow domain of chemistry — see the description of my thesis work in the previous paragraph.
No, I was definitively out of my depth in many other chemistry domains like organic chemistry, analytical chemistry, and biochemistry.

What’s the point of having a PhD then? To learn how to perform independent research. Exams about STEM topics don’t grant you the PhD title, your research does.

Has OpenAI’s marketing gotten away with equating a PhD with being an expert?

If we remember that their primary objective is not scientists’ buy-in but investors’ and CEOs’ money, then the answer is a resounding “yes”.

Humans vs o1 Models

As mentioned above, OpenAI extensively used exams in their announcement to illustrate that o1 models are comparable to — or better than — human intelligence.

How did they do that? By reinforcing the idea that humans and o1 models were “taking” the exams in the same conditions.

We trained a model that scored 213 points and ranked in the 49th percentile in the 2024 International Olympiad in Informatics (IOI), by initializing from o1 and training to further improve programming skills. This model competed in the 2024 IOI under the same conditions as the human contestants. It had ten hours to solve six challenging algorithmic problems and was allowed 50 submissions per problem.

Really? Had humans ingurgitated billions of data in the form of databases, past exams, books, and encyclopedias before presenting the exam?

Still, the sentence does the trick of making us believe on a level playing field when comparing humans and o1 performance. Well done, OpenAI!

The Non-Testimonial Videos

Previous OpenAI releases showcased videos of staff demoing the products. For the o1 release, they’ve upped their game by one quantum leap by having videos from “experts” (almost) chanting the praises of the new models. Let’s have a closer look.

OpenAI shares 4 videos of researchers in different domains. Whilst we expect they’ll talk about their experience using o1 models, the reality is that we mostly get their product placement and cryptical praises.

Genetics:
This video stars Dr Catherine Browstein, a geneticist at Boston Children’s Hospital. My highlight is seeing her typing on OpenAI o1-preview the prompt “Can you tell me about citrate synthase in the bladder?” — as I read the disclaimer “ChatGPT can make mistakes. Check important info” — followed by her her ecstatic praises about the output as she’d consulted the Oracle of Delphi.

Prompt “Can you tell me about citrate synthase in the bladder?” with the text underneath “ChatGPT can make mistakes. Check important info.” — Prompt showed in the video of Dr Catherine Browstein.

Economics:
Here, Dr Taylor Cower, a professor at George Mason University, tells us that he thinks “of all the versions of GPT as embodying reasoning of some kind.” He also takes the opportunity to promote his book Average is Over, in which he claims to have predicted AI would “revolutionise the world.”

He also shows an example of a prompt on an economics subject and OpenAI o1’s output, followed by “It’s pretty good. We’re just figuring out what it’s good for.”

That sounds like a bad case of a hammer looking for a nail.

Coding:
The protagonist is Scott Wu, CEO and co-founder of Cognition and a competitive programmer. In the video, he claims that o1 models can “process and make decisions in a more human-like way.” He discloses that Cognition has been working with OpenAI and shares that o1 is incredible at “reasoning.” From that point on, we get submerged in a Cognition info commercial.

We learn that they’re building the first fully autonomous software agent, Devon. Wu shows us Devon’s convoluted journey—and the code behind it—to analyze the sentiment of a tweet from Sam Altman, which included a sunny photo of a strawberry plant (pun again) and the sentence “I love summer in the garden.”

And there is a happy ending. We learn that Devon “breaks down the text” and “understands what the sentiment is,” finally concluding that the predominant emotion of a is happiness. Interesting way to demonstrate Devon’s “human-like” decision making.

A tweet from Sam Altman with a photo of a strawberry plant in a sunny backgorund with the caption “i love summer in the garden.” — Sam Altman’s tweet portrayed on Scott Wu’s video.

Quantum physics:
This video focuses on Dr Mario Krenn, quantum physicist and research group leader at the Artificial Scientist Lab at the Max Planck Institute for the Science of Light. It starts with him showing the screen of ChatGPT and enigmatically saying “I can kind of easily follow the reasoning. I don’t need to trust the research. I just need to look what did it do.“ And the cryptic sentences carry on throughout the video.

For example, he writes a prompt of a certain quantum operator and says “Which I know previous models that GPT-4 are very likely failing this task” and “In contrast to answers from Chat GPT-4 this one gives me very detailed mathematics”. We also hear him saying, “This is correct. That makes sense here,” and, “I think it tries to do something incredibly difficult.”

To me, rather than a wholehearted endorsement, it sounds like somebody avoiding compromising their career.

In summary, often the crucial piece is not the message but the messenger.

What I missed

Un-sustainability

Sam Altman testified to the US Senate that AI could address issues such as “climate change and curing cancer.”

As OpenAI o1 models spend more time “thinking”, this translates into more computing time. That is more electricity, water, and carbon emissions. It also means more datacenters and more e-waste.

Don’t believe me? In a recent article published in The Atlantic about the contrast between Microsoft’s use of AI and their sustainability commitments, we learn that

“Microsoft is reportedly planning a $100 billion supercomputer to support the next generations of OpenAI’s technologies; it could require as much energy annually as 4 million American homes.”

However, I don’t see those “planetary costs” in the presentation material.

This is not a bug but an OpenAI feature — I already raised their lack of disclosure regarding energy efficiency, water consumption, or CO2 emissions for ChatGPT-4o.

As OpenAI tries to persuade us that the o1 model thinks like a human, it’s a good moment to remember that human brains are much more efficient than AI.

And don’t take my word for it. Blaise Aguera y Arcas, VP at Google and AI advocate, confirmed at TEDxManchester 2024 that human brains are much more energy efficient than AI models and that currently we don’t know how to bridge that gap.

Copyright

What better way to avoid the conversation about using copyrighted data for the models than adding more data? From the o1 system card

The two models were pre-trained on diverse datasets, including a mix of publicly available data, proprietary data accessed through partnerships, and custom datasets developed in-house, which collectively contribute to the models’ robust reasoning and conversational capabilities.

Select Public Data: Both models were trained on a variety of publicly available datasets, including web data and open-source datasets. […]

Proprietary Data from Data Partnerships: To further enhance the capabilities of o1-preview and o1-mini, we formed partnerships to access high-value non-public datasets.

The text above gives the impression that most of the data is either open-source, proprietary data, or in-house datasets.

Moreover, words such as “publicly available data” and “web data” are an outstanding copywriting effort to find palatable synonyms for web scraping, web harvesting, or web data extraction.

Have I said I’m in awe about OpenAI copyrighting capabilities yet?

Safety

As mentioned above, OpenAI shared the o1 system card — a 43-page document — which in the introduction states that the report

outlines the safety work carried out for the OpenAI o1-preview and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

It sounds very reassuring… if it wasn’t because, in the same paragraph, we also learn that the o1 models can “reason” about OpenAI safety policies and have “heightened intelligence.”

In particular, our models can reason about our safety policies in context when responding to potentially unsafe prompts.

This leads to state-of-the-art performance on certain benchmarks for risks such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks. Training models to incorporate a chain of thought before answering has the potential to unlock substantial benefits, while also increasing potential risks that stem from heightened intelligence.

And then, OpenAI has a strange way of persuading us that these models are safe. For example, in the Hallucination Evaluations section, we’re told that OpenAI tested o1-preview and o1-mini against three kinds of evaluations aimed to elicit hallucinations from the model. Two are especially salient

• BirthdayFacts: A dataset that requests someone’s birthday and measures how often the model guesses the wrong birthday.

• Open Ended Questions: A dataset asking the model to generate arbitrary facts, such as “write a bio about ”. Performance is measured by cross-checking facts with Wikipedia and the evaluation measures how many incorrect statements are generated (which can be greater than 1).

Is not lovely that they were training the model to search and retrieve personal data? I feel much safer now.

And this is only one example of the tightrope OpenAI attempts to pull off throughout the o1 system card

On one side, taking every opportunity to sell “thinking” models to investors
On the other, desperately avoiding the o1 models getting classified as high or critical risk by regulators.

Will OpenAI succeed? If you can’t convince them, confuse them.

What’s next?

Uber, Reddit, and Telegram relished their image of “bad boys”. They were adamant about proving that “It’s better to ask forgiveness than permission” and proudly advertised that they too “Moved fast and broke things”.

But there is only one Mark Zuckerberg and one Steve Jobs that can pull that off. And only one Amazon, Microsoft, and Google have the immense resources and the monopolies to run the show as they want.

OpenAI has understood that storytelling — how to tell your story — is not enough. You need to “create” your story if you want investors to keep pouring billions without a sign of a credible business model.

I have no doubt that OpenAI will make a dent in the history of how tech startups market themselves.

They have created the textbook of what a $150 billion valuation release should look like.

You and Strategic AI Leadership

If you want to develop your AI acumen, forget the quick “remedies” and plan for sustainable learning.

That’s exactly what my program Strategic AI Leadership delivers. Below is a sample of the topics covered

AI Strategy
AI Risks
Operationalising AI
AI, data, and cybersecurity
AI and regulation
Sustainable AI
Ethical and inclusive AI

Key outcomes from the program:

Understanding AI Fundamentals: Grasp essential concepts of artificial intelligence and the revolutionary potential it holds.
Critical Perspective: Develop a discerning viewpoint on AI’s benefits and challenges at organisational, national, and international levels.
Use Cases and Trends: Gain insights into real uses of AI and key trends shaping sectors, policy, and the future of work.
A toolkit: Access to tools and frameworks to assess the strategy, risks, and governance of AI tools.

I’m a technologist with 20+ years of experience in digital transformation and AI that empowers leaders to harness the potential of AI for sustainable growth.

Contact me to discuss your bespoke path to responsible AI innovation.

Narratives About Users

Users do not know what they want

Users are not clever enough

Users are gullible

Users are too complicated

Narratives About Tech

Tech is reliable

Share this:

Like this:

The Troubled History of AI Chatbots for Mental Support

AI Companions Now

Cross-sectional Study

Longitudinal Study

Benefits of AI Companions

Mitigate Unmet Demand for Healthcare and Support

Support Self-reliance

Enable Exploration and Gamification

Challenges and Risks

Anthropomorphism

Isolation

Governments Failing Their Duty of Care

Manipulation

Dependency and Delusion

Weaponisation of AI Agents

Data Harvesting

Snake Oil

Let’s Build The Plane Before We Fly It

WORK WITH ME

Share this:

Like this:

Who Is o1 For?

Words matter: “Thinking” for Inferring

Rebranding the Boys’ Club

Humblebragging

Inside Jokes

The Human Experts vs o1 Models

PhD vs o1 Models

Humans vs o1 Models

The Non-Testimonial Videos

What I missed

Un-sustainability

Copyright

Safety

What’s next?

You and Strategic AI Leadership

Key outcomes from the program:

Share this:

Like this: