@0x56 @paxterrarum Because these systems don't actually understand anything -- they just look for the next word to complete a sentence given particular context -- they can't actually do math. They're egregiously bad at it, and the longer and more complex the problem, the worse they get. They also can't deduce information and are easily tripped up by complex instructions and the need to plan or deduce.
My worry is how tool-assisted generation may factor into these gaps, though.
@lenaoflune @paxterrarum - I would have expected basic math to be doable, perhaps not anyone past basic algebra. The fact that it can get me 90% to a completed unit test means that it can handle some logic.
It's still a smokey box to me.
@0x56 @paxterrarum One of the most recent studies shows that as you increase the number of digits in a basic arithmetic problem, accuracy falls rapidly off a cliff.
What the LLMs are really good at is replicating things they've seen before. Since unit tests aren't usually that complex, and lots of people discuss them in the corpus (i.e., StackOverflow, Google Groups), it's "easy" for the model to create them because the word co-occurrences are frequent enough to bubble up from the perceptron.
@0x56 @paxterrarum Also, much like when I talk about Cassandra, I always feel like I'm discussing D&D or MMO systems when talking about LLMs 😅
OH DEAR. THIS POST WAS SET TO SELF-DETONATE 💣 💥 🔥
Ą̷͇̀l̵̩̓̕l̸̩͘ ̸̭̪̈́ť̷̝̍̆h̶̡̛̰̯̏͌a̷͕̞͋̂t̵̩͙͑̈́͝'̵̛̍́ͅͅş̴̬̱͝ ̷̗̊͠l̵͚̕͠ē̸̻͓̐͝f̷̧͙̀̑͝t̶͓̓͊̚ ̶̜̱̓͌́a̴͉͊r̶̡̩͛̀é̵̦̞͕ ̶̮̾ṫ̷̡͈̍ḧ̸̛͍́̊e̴̫̅ş̶̥̰̓e̴̟̪͌͂̇ ̷̞̅͊̚h̷̰͕͈͂e̶̡̹̜̚ŗ̸̗͈̾̇e̴̩̍͐ ̷̪͉̩̀a̵̡̱̐͑͝s̴͎͖̈́h̸͈͌́͜e̴͕̝̐̌ś̶͓̆ͅ.̵̩̉ ̵̱͊͑̀