A Slew Of Research On Generative AI

by allsparkinfinite on 2025-03-08

Self-Reported Impact of Generative AI on Critical Thinking

A survey by Microsoft and Carnegie Mellon University collected information on how knowledge workers using Generative AI at least once a week self-report shifts in their critical thinking habits.

Caveats

This was a survey, so it is highly dependent on participants' judgment about themselves.
The exact definition of critical thinking is not unanimously agreed upon, and they chose the Bloom et al. framework, which breaks it down into 6 categories - knowledge, comprehension, application, analysis, synthesis, and evaluation.
Microsoft, with a significant stake in OpenAI, has a conflict of interest in this study.

Methodology

The responders came from the Prolific platform, and were quite well balanced in gender, and over two-thirds were under 35. They were given a little training on how to identify critical thinking. They were then asked free-text questions, multiple-choice questions, and rating (on a scale of 1-5) questions. Some participants were excluded for low effort free-text responses, and some answers to individual questions were discarded for duplication.

Findings

Workers used critical thinking when:

Forming a goal before interacting with GenAI
Forming the query
Inspecting response and ensuring quality and factuality
Integrating the response - both in content and tone

Users were more likely to think critically when they had higher confidence in doing the job themselves, higher confidence in evaluating the response, lower confidence in GenAI's ability to do the job, and higher propensity to reflect on their work. Overall trust in GenAI had no significant effect.
They cited work quality, risk of negative outcomes, and upskilling as motivators for critical thinking, while citing confidence in GenAI, time, job scope, and their own abilities as inhibitors for critical thinking.

Workers reported larger decreases in cognitive load when using GenAI when they were more confident in GenAI and less confident in themselves. Their propensity to reflect on their work had no significant effect on this difference.

Australian Treasury Trial Of Copilot

The Department of Treasure in Australia trialled Copilot, and published an evaluation.

Caveats

The scope of this experiment was limited to the use of Copilot within a single Department.
Participants were volunteers, who would likely skew enthusiastic about the experiment.

Methodology

218 employees volunteered and were surveyed at the end of the 14-week trial.

Findings

Volunteers reported that Copilot was less helpful in both quantity and quality than they expected. The Treasury concluded that expectations were probably over-optimistic and neglected employee training requirements, but that savings of even 13 minutes per week per employee would yield a return on investment. It was also found to be most useful at basic administrative tasks like information gathering, meeting summarization, and drafting content. Automatically generated meeting minutes helped with inclusion for employees who work part time, are neurodivergent, or suffer from chronic conditions.

That said, the Treasury was strongly sceptical of security with AI-as-a-service, suggesting that on-prem AI infrastructure would be better suited for the needs of a government.

BBC Research On Representation Of News By Chatbots

An experiment by BBC measured how accurately chatbots represented its news.

Caveats

BBC believes chatbots repurpose content from news publishers without consent.

Methodology

Access to the BBC news website was granted to ChatGPT, Copilot, Gemini, and Perplexity. The chatbots were then asked 100 questions about the news and the responses were evaluated by BBC's journalists.

Findings

51% of answers contained significant issues of some form
91% of answers contained some issues
19% of answers contained factual errors, citing hallucinated statements, numbers, and dates
13% of quotes attributed to BBC were either altered or completely made-up.
There were 23 instances of a chatbot presenting commentators' or debaters' opinions as fact.