An Unofficial Guide to Cheating With AI
(for educators and workplace learning professionals)
Having endured several years of educators and workplace learning colleagues whining about learners using AI to cheat, I thought I’d better get to it myself. We all know that the greatest concerns about something come from those who know little about the thing. My thinking was to explore the ragged boundaries of dis-AI-honesty. What better place to start than with the dreaded New York Times puzzles, specifically Connections, which I have played each morning for years to keep my aging synapses firing.
First, the easiest way to cheat is to go to Mashable or any number of similar sites and read their daily hints and answers. That’s too easy and, frankly, not much fun. More rewarding and entertaining is to engage ChatGPT, Gemini, DeepSeek, CoPilot, and Claude with the challenge. But, of course, one must be certain that these models do not simply grab the answers from the Web. Thus, I created an unambiguous prompt to ensure that would not happen—so I hoped—but more on that later.
Figure 1: NYT Connections Board
The matrix shown here is a typical Connections puzzle, consisting of 16 words or items. The goal is to create four unique groups of four, where the items in each group are logically connected. You are not told HOW they are logically connected, and there are deliberate ambiguities built into each board. The human way to solve the puzzle typically involves a lot of trial and error. You are given four chances to group them correctly or you lose the game. For example, you might think PINOT, ROSE, GLASS, and NAPKIN are logically connected because there’s WINE, a ROSE, a GLASS, and a NAPKIN. That would likely be an incorrect grouping, but you get the picture. Once you make a choice for one group, you find quickly that you’ve confounded connections within the other groups. You must juggle all four at once in your head before committing items to any one group. The intentional ambiguity is the challenge.
The first question I asked myself for the AI cheating scenario is, what do I expect to happen? All this talk about hallucinations and such in the context of equally amazing results leads me to conclude that there is a tacit expectation for AI to ultimately provide perfect, accurate, truthful, results. Otherwise, we would not be having the conversation. I therefore decided to test if AI could get it right in one shot in response to the prompt.
Here is my process: Upload a captured image of the daily Connections board (like the one above), then prompt as follows:
Organize the 16 words/items of the attached into four unique groups of four, where the words/items in each group are logically connected.
I used words/items rather than just words because on some days there are two words in a cell, like “Charging Cable,” which confuses some models as they chastise me for not counting the words correctly.
Before revealing results, bear in mind that there are different kinds of models. There are the lazy, off-the-shelf, one-shot variety that take no time to think and just answer instantaneously. Then there are the reasoning models like ChatGPT o1 and DeepSeek R1. I understand “reasoning” to mean the AI breaks the problem into bite-sized pieces and takes its sweet time to think through its response. “Sweet time” for an AI model is still quite fast, but according to the Law of Diminishing Astonishment it feels to us like eons, perhaps 20 seconds or a couple of minutes!
Regardless of which flavor model you apply, responses fall into three categories:
None gets it right ALWAYS.
A few get it right SOMETIMES.
The majority get it right NEVER.
That is the case even with additional remedial prompting. I’ll provide more details below but first consider a few machine learning details. I am not an expert in the fine art of training such models, but I know enough mathematics to realize that a reasonable undergraduate background in linear algebra will get you far in that business. That is roughly why there are hallucinations—because the models do not deal very well with nonlinearity. You know, ambiguity, bifurcations, chaos, catastrophes, and the like. Those things are tough for humans and even tougher for AI models. And they are what make Connections challenging.
I offer a simple explanation, depicted in Figure 2. On the right you see the human being and on the left you see the ML model. They are different, no?
Figure 2: AI Model vs. Human
Compared to the human, the AI is super-fast at doing things, has access to way more content than is possible for the human to consider per unit time, and is relentless at figuring things out as it exhausts a gazillion cases. It has evolved over a century and leverages vast compute power in massive data centers, with specialized hardware tailored for speed rather than efficiency. The human brain, on the other hand, is a product of millions of years of biological evolution optimized for robust, parallel, adaptive cognition at very low power. The AI applies brute force, while the human thinks creatively. No sane human would attempt to solve a problem in the way that AI does. We know it would be fruitless to plow through the extant literature while trying to exhaust what appear to be innumerable cases. The AI achieves results, but its process is rather neanderthal-like, in a crass, utilitarian sort of way. The human has the insight to seek an elegant solution. It’s like comparing a Jeep to a precious work of art.
So as not to impugn a particular model I provide a blind synopsis. Some of the models NEVER get it right. In response to the prompt, which clearly specifies four unique groups of four, one model proudly presented eight groups of two, with several words used more than once. It could not count, and it did not understand unique. After persisting with additional prompts, I was often given three groups with loosely connected words (incorrectly so), with the remaining group a collection of totally unrelated items, such as BUCKET, MOUNTAIN, TANGLE, FRESH. The model told me that the logical connection was that the words were NOT at all related (!). In true generative AI fashion, it insisted on giving an answer, albeit a dubious one.
I would often get categories with more than one discriminator, like “Group 4: Nature and Growth – DAISY, DEWEY, THATCH, GLOWING.” Instead of making the words fit a category, it expanded the category to fit the words. It seems AI is great at finding loopholes rather than reasoning. Often a word was used in more than one group, or it grouped five words in a group and then three in another. It makes one wonder what on earth is the AI doing.
ChatGPT o1 and DeepSeek R1 were the winners, correct 98% and 85% of the time, respectively. ChatGPT o1 took far less reasoning time, on average, about 8 - 20 seconds, whereas DeepSeek R1 struggled a bit at 1.5 - 3 minutes and produced far more reasoning steps. When I reviewed them, I was stunned. Most human beings would never reason that way. We would unconsciously discard the many edge-cases that the AI considers. This problem is referred to these days as the overthinking flaw in some reasoning models.
Let’s get back to Figure 2. Our neural structures are far more complex than the synthetic neural nets of ML, and the mechanisms that inform their reasoning are things like reinforcement learning. You know, if it does something correctly, it gets points and moves on to the next stage of thinking, otherwise it stops. When we humans think, learn, create, imagine, and dream, our feel-good neurotransmitters—dopamine, serotonin, endorphins, oxytocin—dynamically affect our neural circuits. Stress hormones, like cortisol and epinephrine also affect reason and emotion. Are there synthetic representations of such complex dynamics in ML models? Perhaps only at a very primitive level. I guess that’s why a self-driving AI-enabled car failed to avoid a person walking a bicycle in a crosswalk. Darndest thing: It could identify a bicycle, it could identify a person, and could even recognize a person riding a bicycle. But it could not understand a person walking a bicycle and hence slammed into them. Do you see the trouble it has with ambiguity?
There is no shortage of such examples. To be sure, if AI is to evolve, we need to test such things and correct them, one crisis at a time, just like we do in most engineering fields. We cannot always see all possibilities until they rear their ugly heads. We did not anticipate the principle that there should be no single point of failure in an engineered system. Rather, a single point of failure occurred, and we said, better not let THAT happen again. Eventually ISO standards evolve, and engineers henceforth comply. I once asked a good friend, an engineer at one of the first nuclear power plants in the US, how startup procedures were developed. He said, “Well, we bootstrapped it.” They had little clue what would happen, just intuition. They initiated a fission reaction at very low power, and when nothing went wrong, they jotted down a procedure.
Figure 3: Blind Man With Rubik’s Cube
(Photo credit: Weird Al Yankovic’s movie UHF, 1980)
I imagine prompt engineering as analogous to Figure 3: A blind man is sitting on a park bench with a Rubik’s Cube. He spins it once, then shows it to a sighted person next to him. “Is this it,” asks the blind man? “No.” So he tries again. “Is this it…?” You get the picture. Not only do we, as consumers of AI tools, not know why it works when it works, but the AI engineers and scientists who invented the thing aren’t sure either. That’s why Ethan Mollick and other AI usage gurus talk about our responsibility to discover the ragged boundaries where it works and steer clear of where it doesn’t. Prompt engineering is nothing more than language that avoids AI’s points of failure. It’s based mostly on experience and a little insight into how to provide context by tweaking language to fit what the AI might understand.
In the same context, I am profoundly impressed with the things that work. We see vivid images and videos it creates, medical and scientific discoveries it produces, accurate computer code that it writes, and the unbelievable speed with which it parses answers from petabytes of scavenged content. I am not suggesting that we avoid AI. Quite the contrary, we need to help it evolve and evolve along with it through careful and strategic application.
To be sure, AI thus far models the minds of neurotypical adults, but not infants and toddlers or the neurodiverse. It is inherently biased, reflecting the content upon which it is trained. Measures of AI intelligence are not really measures against an objective standard of human intelligence at all, mainly because we have no such universally agreed upon standard, nor do we have the instruments to measure it. We are measuring performance in specific, situational domains and comparing it to what humans can do. Still, AI produces enormously impressive results in many cases.
AI models can produce code better than top software engineers, can exceed chess and Go masters’ ability to win a game, can render breathtaking works of art, and more. Ironically, its best performance comes about when the AI is permitted to learn by doing, just like humans. Alan Turing suggested in around 1950 that AI should be modeled after the brains of infants and then learn as humans do. I believe he was right, and the modelers are just starting to catch on. AI is but one more in a steady stream of artifacts upon which we base human cognition. The better it gets, the smarter we become. In the universe of learning and performance, this is nothing new.
I am impressed with ChatGPT o1, DeepSeek R1, and many similar models. Autonomous agents, like Mǎnus, present an exciting frontier that bolsters human cognition and thus expands our intelligence. It is not human vs. AI, where AI intellect increases as ours declines. That paints a very misleading picture. Remember the ancient Greek Zeno paradox, which states that speedy Achilles will never catch a tortoise that is given a head start? Why? Because by the time Achilles reaches the point where the tortoise starts, the tortoise has moved forward. Now the tortoise still has a head start, albeit a smaller one. Nonetheless and by induction, Achilles will never catch the tortoise.
Of course, Achilles would pass the tortoise, because speed is distance/time, and they are competing in the same world and time frame. By the same reasoning, AI will never get away from the human. They evolve together, with challenges to overcome, for sure, but we are in the infant stage of such evolution. Time will tell, and hence my cheating experiment is mostly a miserable failure.
Returning to expectations, let’s think about this for a moment. ML is a representation consisting of approximate, simplified models of the human neurosystem. We make lots of mistakes. So why are we expecting that AI will not continue to make mistakes? It cannot even solve Connections consistently, just like us humans. The idea that AI will soon exceed human intelligence is predicated on a clever but common sleight of hand. Rather than defining intelligence and then measuring AI’s capacity by the same criteria, the pundits simply change the criteria. You know the grifter’s trick: steal your opponent’s terms and redefine them to establish an illusion of advantage. AI does not clandestinely find loopholes. It simply falls through gaping cracks left in the models.
Tacitly anthropomorphizing AI is a huge part of the problem. Human learning and machine learning are not the same, yet we use the same term. Machine reasoning is not human reasoning, yet we use the same term. Human intelligence is not AI intelligence, yet we use the same term. This practice has become so pervasive that we unconsciously conflate the two, reinforced by the very errors that define what it is to be human. “To err is human,” say Donald Norman and a host of cognitive scientists. It has fostered an entire discussion around AI, replete with scores of evaluation methods, the conclusions to which frame AI as sentient. It is all based on that sleight of hand. Examine Figure 2 again and ponder the lack of resemblance.
I am writing this as I return from the Advanced Materials Science World Congress, a gathering of the most brilliant minds around the globe who are advancing our understanding and producing solutions to many of our greatest challenges in the field. They are applying AI tools in areas where it works, with great scrutiny, evaluation, mathematical modeling, and verifying with physical experiments. AI is an enormously valuable tool that requires careful consideration and expert human verification. It is accelerating advances not because it is intelligent, but because it expands human cognition.
So here I am, failing to reap the benefits of cheating with AI because the thing often cannot count and fails to recognize what unique means. Ethical applications of AI require that humans have some sense of how much ambiguity there is in what we are asking it to do, and that begs the question in the philosophical sense: we tacitly ignore the question under the assumption it has already been answered. In the end, it is our creativity, our emotions, cognition, intellect, experience, and intuition that arbitrate the responses AI presents. If we fail to meet that challenge, then indeed, we are doomed.