Today's Large Language Models Are Essentially BS Machines

57 points by tomlin 1 year ago | 46 comments
  • 1vuio0pswjnm7 1 year ago
    It's silly to flag this submission. Back in April researchers at Stanford reported that less than half of the results from AI-powered search corresponded to verifiable facts. What do we call the remaining portion. "BS" seems reasaonable.

    https://aiindex.stanford.edu/report/

    "As internet pioneer and Google researcher Vint Cerf said Monday, AI is "like a salad shooter," scattering facts all over the kitchen but not truly knowing what it's producing. "We are a long way away from the self-awareness we want," he said in a talk at the TechSurge Summit."

    https://www.cnet.com/tech/computing/bing-ai-bungles-search-r...

    • ftxbro 1 year ago
      probably it's flagged because it's so obviously true
      • j16sdiz 1 year ago
        Does HN have a way to "unflag" a submission?

        Or, what does a flag in HN actually do?

      • httpz 1 year ago
        To be honest, I hated writing essays in English classes because I felt like I'm forced to write BS to fill up the space when my argument can be summed up in several bullet points.

        Since I'm not a student anymore, I can just give ChatGPT a few bullet points and ask it to write a paragraph for me. As an engineer who doesn't like writing "fluff", it's great I can now outsource the BS part of writing.

        • jjice 1 year ago
          As an engineer I'd hope you wouldn't have to write fluff. Brevity (while retaining full content) should be praised.

          I'm interested what parts of your job require the fluff? Is it communication with non engineering teams?

          • httpz 1 year ago
            Not necessarily technical writings but more like emails to ask for something, emails to decline something or even a birthday card.

            It's also great for writing a professional sounding complaint letter to your utility company.

            • pclmulqdq 1 year ago
              A lot of people seem to think that details that other people think are relevant are actually "fluff."
            • DennisP 1 year ago
              Yep it's great for work emails. Incoming too, since they can summarize a long email into bullet points.

              The future is people typing bullet points, expanding into polished prose for transmission, and compressing down to bullet points on the other end.

            • GoodJokes 1 year ago
              [dead]
            • mlsu 1 year ago
              So what?

              Today, ChatGPT helped me write a driver.

              The driver either compiles, or it doesn't; it compiled. The driver either reads a value from a register, or it doesn't; it read. The driver either causes the chip to physically move electrons in the real world in the way that I want it to, or it doesn't.

              The real world does not distinguish between bullshit or not. Things either work or they do not. They either are one way, or they are another way. ChatGPT produces things that work in reality. We humans live in reality. Reality is what matters.

              I notice a thread through all of the breathless panicking about LLMs: it does not correspond to REALITY. It's a panic about a fiction. The fiction that the content of text is reality itself. The fiction that the LLM can somehow recursively improve itself. The fiction that the map is the territory.

              • dekhn 1 year ago
                During the big GPT-4 news cycle I think a bunch of folks posted claims that were outrageously good- "language model passes medical exams better than humans", etc. When I looked into them, in nearly all cases, the claims were boosted far beyond the reality. And the reality seemed much more consistent with a fairly banal interpretation: LLMs produce realistic looking text but have no real ability to distinguish truth from fabrication (which is a step beyond bullshit!).

                The one example that still interests me is math problem solving. Can next-token predictors really solve generalized math problems as well as children? https://arxiv.org/abs/2110.14168

                • haimez 1 year ago
                  LLM’s are spitting out responses based on their inputs. It is (or was) shockingly effective, but there is no generalized math processing going on. That’s not what LLM’s are, that’s not how they work.
                  • dekhn 1 year ago
                    And yet, trained on a large corpora of correct math statements, they produce responses that are more often right than wrong (I am taking this for true- it might not be)- which simply raises more questions about the nature of math.
                    • haimez 1 year ago
                      …or the nature of the question and corpus?
                • ggm 1 year ago
                  To me, this is the quintessential risk: It's plausible enough it will fool somebody with authority to act, but lacking competency to understand the information is low grade. Boom! "oh man.. but the computer said it was ok"
                  • thghtihadanacct 1 year ago
                    We already barely question the BS spilled by politicians and corporations. Now they have a scapegoat that wont ever die.
                    • j16sdiz 1 year ago
                      People have already accepted they can't do anything to stop the bullshitting.

                      It's not only in America, not only in government or large corporate. It's everywhere.

                    • boringg 1 year ago
                      100% going to happen in the near future if it hasn't already happened.
                      • HillRat 1 year ago
                        Those attorneys who blindly used ChatGPT to generate a brief regarding MC99 liability (a field they had no experience in) are a good example, I think. Of course, in that case opposing counsel started looking at the cites and quickly had questions for them...
                        • chefandy 1 year ago
                          It's really incredible how plausible sounding ChatGPT's legal BS is. Completely hallucinated cases, arguments, citations (properly formatted for real reporters!), people, ideas... but if you just skim it, there wouldn't be any immediate way for a layman to tell it was total bullshit, and I'll bet an overworked attorney could be taken off guard, too. Obviously won't get you too far in actual litigation, and I really feel for the clients of any attorney that pulls such shit.
                    • austinkhale 1 year ago
                      I like to think of all responses from LLM's like the top-rated post on Stack Overflow or a top five blog post from a Google search. It's helpful information that _may_ be correct but needs to be verified. A lot of the time, it's spot on. Some percentage of the time, it's straight up incorrect. You have to be willing to compare various sources of data and find what's accurate. It's a nice, easy-to-use starting point, essentially.
                      • dmezzetti 1 year ago
                        While there is truth here, they can be quite effective as a logic engine vs a fact engine. One of the most popular LLM use cases is retrieval augmented generation (RAG), where the LLM is limited by a provided context.

                        Do you need 7B/13B/33B/77B parameters to do this? That is a question up for debate and something I'm exploring with the concept of micro/nano models (https://neuml.hashnode.dev/train-a-language-model-from-scrat...). There is the sense that today's LLMs could be overkill for a problem such as RAG.

                        • danenania 1 year ago
                          Using LLMs to write code, particularly in a statically typed language, is a good way to get a sense for how accurate they are, since most mistakes/hallucinations are readily apparent.

                          I've been using GPT-4 to write code almost daily for months now, and I'd estimate that it is maybe 80-90% accurate in general, with the caveat that the quality of the prompt can have a major impact on this. If the prompt is vague, you're unlikely to get good results on the first try. If the prompt is very thorough and precise, and relevant context is included, it can often nail even fairly complex tasks in one shot.

                          Regardless of what the accuracy number is, it strikes me as pretty silly to call them "BS Machines". It's like calling human programmers "bug machines". Yeah, we do produce a lot of bugs, but we somehow seem to get a quite a bit of working software out the door.

                          GPT-4 isn't perfect and people should certainly be aware that it makes mistakes and makes things up, but it also produces quite a lot of extremely useful output across many domains. I know it's made me more productive. Honestly, I can't think of any programming language, framework, technique, or product that has increased my productivity so quickly or dramatically in the 17 years I've been programming. Nothing else even comes close. Pretty good for a BS machine.

                          • btown 1 year ago
                            Even if you take the headline at face value (and IMO it's rather unfair)... the incredible saving grace of LLMs is that you have a plurality of BS machines, with different flavors of BS, whose outputs can be wired together.

                            Sure, the first-order output of today's generalist LLMs outputting one token at a time do seem to meet meet diminishing returns on factuality at approximately the level of a college freshman pulling an all-nighter. Not a great standard, that. But if you took an entire class of those tired freshmen, gave their outputs to an independent group of tired freshmen unfamiliar with the material, and told the second group to identify, in a structured manner, commonalities and discrepancies and topics they'd look up in an encyclopedia and things they'd like to escalate to a human expert on, and so on... all of a sudden, you can start to build structured knowledge about the topic, and an understanding of what is and isn't likely to be a hallucination.

                            One might argue that the right kind of model architecture and RLHF could bake this into the LLM itself - but you don't need to wait for that research to be brought into production to create a self-correcting system-of-systems today.

                            • lemmsjid 1 year ago
                              I noticed he mentioned asking Bing, so I directly asked ChatGPT “who is Ryan McGreal?” and it wrote, “ Ryan McGreal is a writer, urbanist, and the editor of Raise the Hammer, an online publication that focuses on urban issues, transportation, and sustainability. He has been actively involved in advocating for better urban planning and transportation policies in Hamilton, Ontario, Canada. Please note that my knowledge is based on information available up to September 2021, so there may have been developments or changes in his work or activities since then.”

                              That appears more or less correct

                              So to give chatgpt an opportunity to hallucinate similarly to the article, I followed up with, “Did he write for the nyt?” and it replied, “ I do not have any information indicating that Ryan McGreal has written for The New York Times (NYT). His work primarily focuses on urban issues and transportation, as mentioned earlier, and he is associated with Raise the Hammer, a local publication in Hamilton, Ontario, Canada. It’s possible that he may have contributed to other publications, but I do not have specific information regarding his contributions to The New York Times.”

                              While I have seen ChatGPT make stuff up I do think it’s useful to compare specific results across LLMs before using particular examples to make holistic statements.

                              • j16sdiz 1 year ago
                                Try it in other way: Ask something big (with lots of text, in this case: NYT) first .

                                Ask in this order:

                                1) what is NYT (New York times)?

                                2) Who is Ryan McGreal?

                                3) Did he wrote for NYT?

                                This build up more context for hallucinating

                              • ariym 1 year ago
                                They're not focused on being informationally accurate, they're optimized to be articulate
                                • borissk 1 year ago
                                  Any time I asked ChatGPT or another GPT a question regarding science (haven't asked any questions on other topics) I got a mostly correct answer back. And I've asked a few hundred by this point. This includes state of the art research covered in just one or a few articles.

                                  So I'm curious why my personal experience doesn't match all the complains about hallucinations.

                                  • chefandy 1 year ago
                                    I think the usefulness is pretty domain-specific. Every time I've given it a citation for a not-super-famous court opinion, it very confidently told me about a plausible-sounding case that never happened between people and companies that never existed.
                                  • coliveira 1 year ago
                                    I think that an AI-powered world will create a population that doesn't know how to distinguish truth from lies. People already believe that AI has some powerful hidden knowledge that they need to use, even when the AI model is spilling garbage. In the future, they will also be incapable to separate what AI models tell from reality.
                                    • mikeg8 1 year ago
                                      This is currently happening with people googling their way onto pseudo-science webpages/ blogs and interpreting that that content as “fact”.
                                      • borissk 1 year ago
                                        People already can hardly distinguish truth from lie. One Donald Trump lies constantly. Brexit referendum in UK was driven by a ton of lies, many people still believe them.
                                      • mcint 1 year ago
                                        Most people, most of the time are just BS machines. Obligatory -- but also question of the standards, presupposed purpose. Many dreams for what AI can be, can do, can provide sounds similar in the hoped futures they enable. That does not mean that the particular next-step goals of designers and implementers of different systems will achieve the same ends.

                                        These ones are premised on regurgitating inputs. That they can imitate more than one observer's interpretation of truth at one time. More the better.

                                        • slavetologic 1 year ago
                                          These models will be astounding in five years. Any hot take like this is click bait. And it's never from the people actually pushing the models forwards. Always onlookers
                                          • naniwaduni 1 year ago
                                            I'll believe this when the models stop looking like machine translation did six years ago.
                                          • joshspankit 1 year ago
                                            Counterpoint:

                                            Humans have been incentivized to essentially be BS machines.

                                            From low-quality blog posts to the highest-grossing marketing and everything in between (including many published books and scientific papers): BS makes enough money that it’s low-effort gives a decent ROI.

                                            Of course an AI trained on a large human corpus is going to produce BS. It’s just doing what it learned.

                                            • yawnxyz 1 year ago
                                              I'm surprised it doesn't touch on "creativity" which is a form of BS. So is being able to summarize or extract books and papers.

                                              Unless it's mechanical work, it requires some form of BS, and that's why we've traditionally been so much better at this than machines. We've never been able to create "BS machines" before, so this completely shifts the paradigm.

                                              • ryan_mc_g_real 1 year ago
                                                I would argue that creativity involves generating new ideas through a combination of divergent thinking (to imagine new associations between unrelated things) and convergent thinking (to bring a relational model from one domain into another), and is orthogonal to Frankfurt’s conception of BS as defined by indifference to the objective truth or falsity of a fact claim.
                                              • ftxbro 1 year ago
                                                imagine posting such a hot take in september 2023
                                                • ggm 1 year ago
                                                  They stole it from Abraham Lincoln's first tweet
                                                  • thghtihadanacct 1 year ago
                                                    truth is truth whenever its brought to view
                                                  • borissk 1 year ago
                                                    Any idea why was this flagged?
                                                    • bpiche 1 year ago
                                                      Yeah good question, was looking forward to a healthy discussion
                                                      • hooch 1 year ago
                                                        Agreed, it’s a properly reasoned essay.
                                                      • nilslindemann 1 year ago
                                                        Flagging this is clearly wrong.
                                                        • DienLe94 1 year ago
                                                          "The internet is simply too slow to be useful".
                                                          • Pxtl 1 year ago
                                                            Artificial Cliff Claven. Mansplaining as a Service. Truthiness on tap.
                                                            • ryan_mc_g_real 1 year ago
                                                              [dead]