Trends in AI for student assessment – A roller coaster ride

The greatest challenges facing universities and educators in using generative AI for student assessment are the sheer speed at which the higher education sector is evolving and an exponential increase in the capabilities of AI tools, a University World News-ABET webinar heard this week.

“I don’t recommend individual AI tools because they change so quickly and the capabilities move so fast,” said panellist Dr Nigel Francis, winner of prestigious awards for teaching excellence and digital education lead for the school of biosciences at Cardiff University in the United Kingdom.

“So using AI in assessment is really about playing around with offerings and seeing what works for you and what produces the best outcomes in your particular context.

“In terms of incorporating AI into assessment, it’s about educating students on how to use AI tools in a transparent, ethical manner, so that AI supports student learning rather than allowing them to offload part of the learning to the tool.”

At some higher education institutions, how students use AI is up to each faculty member, said Dr Gloria Rogers, senior adjunct director for professional offerings at ABET, the United States-based but globally operating non-profit quality assurance and accreditation agency in the STEM fields, and an assessment and data analyst at Indiana State University.

So if a student is taking four or five courses, they may have four or five different kinds of requirements on the use of AI. “Things are moving so rapidly. We’re chasing AI, that’s for sure,” Rogers said. A next step will likely be artificial general intelligence, which aims to create AI with human-like cognitive abilities.

The webinar

“Trends in AI for student assessment – Learn from experts” was held on 21 January and is the first in a series of University World News webinars being held in partnership with ABET.

 results that support the integrity of the programme.

A second audience poll asked whether participants had used a large language model (LLM) for their work, such as GPT, Claude or Llama? “It’s a lot of my workflow,” said 52% of poll responders, while 30% said that they had used an LLM once or twice and 18% said no. So, LLMs are very much part of academic life now.

Challenges in using AI for assessment

The webinar audience was also asked about the main challenges they face in using generative AI for assessment. Interestingly, only 6% said they did not face significant challenges.

But most people do. More than half (53%) identified one challenge as being ‘verifying the accuracy and validity of AI generated results’, and 49% said they lack training or expertise in using generative AI tools. ‘Difficulty integrating AI tools within current assessment systems’ was listed by 45% while 41% were challenged in addressing ethical concerns.

Of less concern to the webinar participants was ‘ensuring fairness and reducing bias in AI based assessments’ (30%), ‘protecting student data privacy and security’ (25%) and ‘resistance to adopting AI-driven assessment’ (19%).

“AI isn’t a shortcut. We still need to have students with the base level of understanding and the ability to critique knowledge that they gain in their discipline, and to then take an AI output and use it in an effective manner,” said Francis.

“Where we’re struggling is that there are a large number of institutions around the world that do not have clear AI assessment guidelines. They do not tell students what is acceptable and what’s not acceptable. Students are quite fearful sometimes, because they don’t know where that line is.” They worry about being pulled up for unfair practice or academic integrity issues.

“We really need to be quite clear now about the way that AI can be used within assessments, and that’s going to change between different assessments.”

For example, there is in-class assessment, where AI cannot be used at all. There is assessment where an educator advises students to use AI in particular ways. There are also assessments where AI can be embedded and used in any way.

A cautionary tale, said Francis, is that a lot of AI tools have premium versions which you pay for, with outputs of far higher quality than from the free versions. “If you are going to use AI in any assessment, you need to make sure you are providing access to a premium version of that tool for all students. Otherwise, you’re going to disadvantage those that can’t afford to pay for it.

“An evolving question over the next few years will be how we best embed AI and allow AI to be used to enhance the student learning journey as opposed to replacing it.”

For now, said Francis, people must appreciate that AI tools are statistics models without understanding that they have been trained on the internet. “Wikipedia has been scraped, Reddit has been scraped, a lot of copyrighted material from publishing has been scraped. The quality of the data that these models have been trained on is very variable.”

Also, AI still presents a very white, Westernised, male-dominated perspective on a lot of matters, because it reflects the large body of literature that it has been trained on. “If you prompt GPT at a basic level, it can be incredibly biased in its output. But there are ways that you can prompt it more effectively, to address certain cultural or address gender biases within an answer.”

Rogers has been disappointed sometimes in the output of AI tools, which undermines confidence in their ability to produce valid and reliable responses. “To use generative AI without applying human intelligence to it, is going to be real problematic,” she told the webinar. As with the internet, it is important to be thinking critically when using AI.

Generative AI is becoming part of everyday life, said Francis. “From an educator perspective, it is really important to teach students to be transparent about the usage of AI, because that’s how to make sure it doesn’t become this taboo underground technology that has to be hidden from academics.

“Using AI should be about openly embracing the technology and learning to work with it, and just being really open and honest about how you’ve used it to do things.”

Process versus product assessment

Francis is interested in process versus product (or outcome) assessment. Does the end product matter, or is the process by which students reach the outcome becoming more important to assess? This can lead to discussion on competency-based assessment.

“The process has always been important because that’s how students are able to demonstrate that they have cognitively engaged with the material. What’s changed dramatically in the last couple of years is that we cannot, with any certainty, guarantee how much of a product is AI generated and how much is human generated,” he told the webinar.

“As technology evolves, it’s going to become even harder to determine where AI starts and where human starts. With process versus product assessment, it’s about trying to ensure that students are able to evidence the process they’ve gone through in order to validate information.

“We all know that these AI models are not perfect. They make mistakes, they hallucinate. I want to make sure students are able to critically analyse AI outputs.”

Francis continued: “What we’re seeing is that the more students engage with this process trail, the better the end product is; there’s a correlation between the two. It’s win-win all around in terms of introducing an element of robustness to AI, students using it ethically and appropriately within an assessment, and then the final product being of higher quality.”

Francis does ‘process trails’. Students are asked to fill in a form detailing how they have used the AI tool. They are asked, for example, what tool they used, prompts they used and how they engaged with the resulting output.

AI in programme-level assessment

Gloria Rogers spoke about how academics are using AI in programme-level assessment, which also starts with student assessment in the classroom.

Academics, over and above the roles of content expert and facilitator of learning, have a role as a member of a learning community. “By the end of the programme, can students demonstrate their particular knowledge and skills? If not, then we need to do something to improve the programme.”

Rogers calls this high-stakes assessment. “Because at the programme level, with the data that we gather, we’re going to identify strengths and weaknesses in student learning. As a programme, we have to make decisions about what to do to improve student learning over time.”

The quality of the output is very important, but generative AI makes mistakes and ‘hallucinates’, sometimes simply making things up. Using AI in her own work, Rogers has also found that AI does not work well for some processes.

“Some of that is honing my own skills for developing prompts. But I’ve done qualitative analysis by hand, and then used it in ChatGPT, and the results are very different. It’s really important that we take this seriously because somebody is going to act on the results. That’s a big concern. And people are absolutely using ChatGPT and other AI tools.”

Francis stressed the need for different types of assessments. “We need a rich assessment diet, with different types of assessment modalities to ensure that we are hitting learning outcomes from as many different angles as possible, to ensure that students can evidence what they’re doing.”

The question of competencies

“I’m a big believer that assessment is about what a student can do and not what they can remember or know. Almost all of us have a smartphone in our pocket with access to the internet. If we don’t remember something, we can look it up very easily,” Francis said. What is done with that knowledge is what must be assessed.

Rogers stressed the importance of students knowing exactly what is expected of them around competencies. Also, how to define and evidence broad skills such as critical thinking?

“ChatGPT can be very helpful with what I call the AI filter – moving from artificial intelligence to the programme level output, which is human intelligence.”

Rogers helps faculty to think about what kinds of questions they should ask. What indicators of performance show that a student has obtained a competency? Are the performance indicators appropriate for the programme?

“ChatGPT does a great job of producing performance indicators. You tell the AI how many you want, and what the competency is, and it’ll generate a bunch of them.

“Then the question becomes, are the indicators appropriate for your programme? Is there only one performance in each indicator? Are the indicators at the right domain level for your programme? What key performance indicators are missing? Are the indicators discipline-specific?”

Another fruitful area of assessment using AI is rubrics, the scoring guides used to evaluate the quality of a student’s work. “ChatGPT does a pretty good job of creating rubrics, if you give it the performance indicators that make up that rubric. I’d give it three-and-a-half stars out of five.”

What university academics and staff need to understand is that ChatGPT is not going to replace effort. Rather, said Rogers, ChatGPT reduces the amount of ‘busy-work’ for faculty. Instead, they will spend time critically analysing the output of ChatGPT to make it appropriate to their programme. “We’re asking faculty to spend their time more wisely.”

Will generative AI replace human educators?

No, said Nigel Francis. “The reason I say that is based on experience of teaching through the pandemic. Students were craving interaction with human educators. The vast majority of institutions moved everything online, and it wasn’t a great learning experience for students.”

Students are clearly going to be using AI when they graduate, and so it is important that they learn how to use it effectively.

“But I think there will always be an element of humanity in education, because the tools at this point in time do not have human-level intelligence or creativity or empathy. They can’t counsel the student who comes to my office in tears. There’s a very long way to go before AI can replace a human educator.”

Rogers pointed out that the quality of what comes out of ChatGPT depends largely on the quality of the human input it receives. “There are times when people say, well, I’m discouraged about what I’m getting from AI. But the more you talk with them, the more you recognise that much of the problem is about the quality of the prompt that is put into the AI tool.”

Thus, helping faculty to understand how to write prompts and what spreadsheets and variables and analysis they need, is important. There are several libraries of prompts that have been built to guide people in creating prompts. “We’re going to see more of those kind of tools as well,” Rogers said. “There is a great deal of potential in working with academics in this way.”

Francis predicted that a lot of universities are going to have to appoint specialists “to come up with policies and to take the onus off people like myself, who are enthusiastic amateurs who want to play with these tools and explore what they can do”.

The University of Sydney is designing assessments that can be used alongside AI tools to support learning, rather than trying to restrict their use. “That’s a fantastic position to take, and that’s ultimately where lots of places will go,” said Francis. “We’re going to have to constantly update and try to keep abreast as best we can.”

Rogers said that for a college or university, a first priority is identifying how generative AI is currently being used by students and staff. A second priority is for institutions to insist on transparency in the use of AI.

“What we are really afraid of? Are we afraid that students are going to graduate not knowing much because they get everything on ChatGPT? Are we afraid that faculty aren’t going to work anymore, they’ll just let some AI system generate their syllabus and grade papers? We’ve got to think about how can we best utilise the good things about AI.”

Bloom’s Taxonomy and AI

One participant asked which levels of learning outcomes, in the revised Bloom’s Taxonomy, would be assessed effectively with AI.

Bloom’s starts at the knowledge level of recall and goes up to the ‘create’ level, said Rogers. “You can use multiple choice true and false tests to measure whether or not a student knows something and whether they understand it. When you come to analysis, it gets more complex.

“I’m not sure that you can assess creativity using AI, because AI tools don’t know anything about creativity in terms of knowing it when they see it, right?” The kind of assessment, and whether it is quantitative or qualitative, also makes a difference.

This opens up an interesting question about what we mean by understanding and knowledge, Francis agreed, “because in my view, an AI tool doesn’t have understanding. It predicts the next token that it’s going to use. It doesn’t really understand it.”

AI could probably be used up to about the fourth level of Bloom’s, which is ‘analyse’, Francis said. “It’s certainly very good at analysing data. But once you get to the evaluation and creativity levels, AI really struggles. If you look at the outputs when you ask AI to create a new idea, a lot of the ideas are either unfeasible or they are just plain wrong. It might get better.”

Another problem with using AI to assess high-order thinking skills, is that it “dramatically overestimates” student performance compared to a human marker.

“There’s going to be a place for creating quick, generic style feedback to students before they get more detailed feedback from the academic, possibly. But at this stage, I certainly wouldn’t be relying on an AI tool to provide the feedback that you give to a student.”

Generative AI, Francis concluded, “is very much a black box. We can’t see inside that box, and we don’t know how these models are doing what they do. I don’t think some of the people who created them really understand how they’re doing some of the things that they’re doing now. We need to have much tighter regulations around how AI can be developed and used.”

The webinar highlighted how rapidly generative AI is evolving, and how much is yet to happen that we don’t know, said Brendan O’Malley. He wrapped up with a cheeky audience question: “Do you think that one day AI could substitute for a missing panellist?”

Source: Karen MacGregor