Self-Reported Impact of Generative AI on Critical Thinking
A survey by Microsoft and Carnegie Mellon University collected information on how knowledge workers using Generative AI at least once a week self-report shifts in their critical thinking habits.
Caveats
- This was a survey, so it is highly dependent on participants' judgment about themselves.
- The exact definition of critical thinking is not unanimously agreed upon, and they chose the Bloom et al. framework, which breaks it down into 6 categories - knowledge, comprehension, application, analysis, synthesis, and evaluation.
- Microsoft, with a significant stake in OpenAI, has a conflict of interest in this study.
Methodology
The responders came from the Prolific platform, and were quite well balanced in gender, and over two-thirds were under 35. They were given a little training on how to identify critical thinking. They were then asked free-text questions, multiple-choice questions, and rating (on a scale of 1-5) questions. Some participants were excluded for low effort free-text responses, and some answers to individual questions were discarded for duplication.
Findings
Workers used critical thinking when:
- Forming a goal before interacting with GenAI
- Forming the query
- Inspecting response and ensuring quality and factuality
- Integrating the response - both in content and tone
Users were more likely to think critically when they had higher confidence in doing the job themselves, higher confidence in evaluating the response, lower confidence in GenAI's ability to do the job, and higher propensity to reflect on their work. Overall trust in GenAI had no significant effect.
They cited work quality, risk of negative outcomes, and upskilling as motivators for critical thinking, while citing confidence in GenAI, time, job scope, and their own abilities as inhibitors for critical thinking.
Workers reported larger decreases in cognitive load when using GenAI when they were more confident in GenAI and less confident in themselves. Their propensity to reflect on their work had no significant effect on this difference.
Australian Treasury Trial Of Copilot
The Department of Treasure in Australia trialled Copilot, and published an evaluation.
Caveats
- The scope of this experiment was limited to the use of Copilot within a single Department.
- Participants were volunteers, who would likely skew enthusiastic about the experiment.
Methodology
218 employees volunteered and were surveyed at the end of the 14-week trial.
Findings
Volunteers reported that Copilot was less helpful in both quantity and quality than they expected. The Treasury concluded that expectations were probably over-optimistic and neglected employee training requirements, but that savings of even 13 minutes per week per employee would yield a return on investment. It was also found to be most useful at basic administrative tasks like information gathering, meeting summarization, and drafting content. Automatically generated meeting minutes helped with inclusion for employees who work part time, are neurodivergent, or suffer from chronic conditions.
That said, the Treasury was strongly sceptical of security with AI-as-a-service, suggesting that on-prem AI infrastructure would be better suited for the needs of a government.
BBC Research On Representation Of News By Chatbots
An experiment by BBC measured how accurately chatbots represented its news.
Caveats
BBC believes chatbots repurpose content from news publishers without consent.
Methodology
Access to the BBC news website was granted to ChatGPT, Copilot, Gemini, and Perplexity. The chatbots were then asked 100 questions about the news and the responses were evaluated by BBC's journalists.
Findings
- 51% of answers contained significant issues of some form
- 91% of answers contained some issues
- 19% of answers contained factual errors, citing hallucinated statements, numbers, and dates
- 13% of quotes attributed to BBC were either altered or completely made-up.
- There were 23 instances of a chatbot presenting commentators' or debaters' opinions as fact.