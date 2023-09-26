An interesting article in Venture Beat, “Why exams intended for humans might not be good benchmarks for LLMs like GPT-4“, touches on an issue I think about every time I read something like “ChatGPT passes the exam for so and so”, and now the topic of much ill-informed discussion.

We’re surprised and perhaps fearful when an algorithm breezes an exam we would need to spend a lot of time preparing for. That’s understandable. But if we look at how an algorithm learns, it’s not so surprising: first of all, we are talking about algorithms trained with an enormous amount of information, practically everything that’s online, with a few obvious exceptions. Their developers make a special effort to separate the data they use for training and for subsequent testing. Nevertheless, the amount of data used in training is so enormous that it is very difficult to ensure that the examples used later to evaluate the test model are not somehow included in the training data. This sets up a problem, commonly known as training data contamination: since the algorithm’s memory is, in principle, very large and perfect (digital), the data included in its training set up questions that the algorithm always answers well, although it would be a mistake to expect the same from other data that are not, in principle, included there, and that it uses derivatively.

The problem here is that the human brain works differently to an algorithm, and with obvious limitations: we don’t have unlimited memory, and our ability to relate information to real world situations is built up over time and experience. In fact, it is not even clear — or rather, it is very unclear — if the exams and tests designed to assess our knowledge are appropriate: an exam based on a huge syllabus is useless for qualifying a judge or a notary, because rote memorization, which is extensively tested by having examinees regurgitate topics, is far less important than their relational ability, which is harder to test conventionally. In practice, our memory has its own algorithm: we remember what is most recent, what we encounter most frequently, or what we ascribe most importance to (Recency, Frequency, Value: RFV). Hence, long-established exams such as SAT, GMAT, GRE or in the case of Spain, the MIR (for medical doctors seeking specialization) are good at assessing our ability to memorize things, but no indicator of future professional competence.

Confronting an algorithm with poorly designed tests for humans produces unsurprising conclusions: if we store a bunch of answers in a database, an algorithm is perfectly capable of returning them when a simple search for terms allows it the time to do so. It is primary: it stores and retrieves. Were the exam to require more deductive, relational or other types of skills, then we would be right to be surprised if the algorithm aced it, but this is not usually the case in the exams we are referring to, because they are based on an educational model that prioritizes the ability to memorize information.

In short, algorithms are always going to be much better at “augmenting” humans based on their infallible memory or a huge repository of data, rather than replacing us to carry out really important tasks in a given job definition that requires a minimum of intelligence. But above all, we should avoid concluding that because an algorithm can pass exams for doctors or lawyers, we will soon be treated for medical conditions or defended in court by robots.

What the ability of algorithms to pass exams should be telling us as a society is that maybe the time has come to stop assessing candidates on their ability to memorize information and instead explore new approaches, and then see how ChatGPT gets on with such models. This would allow us to understand what we are good at as humans, what makes good professionals — which is rarely the ability to parrot reams of information — and how we can, thanks to that knowledge, train better algorithms.

Sadly, a combination of a closed mindset, vested interests, and outdated social attitudes mean that it is highly unlikely that we will ever know the full potential not just of algorithms, but the human mind.

