Chain-of-Thought Hub: Measuring LLMs' Reasoning Performance

114 points by Garcia98 2 years ago | 26 comments
  • vladf 2 years ago
    > Also be careful that GPT-4/ 3.5's performance on GSM8K is not true few-shot -- in GPT-4 report they said that they mixed a portion of GSM8K training set to train the model

    It'd be really valuable to have "fuzzed" versions of these benchmarks, where you replace quantities in the questions with randomly-sampled values, so that this wasn't a concern. Of course, then the score would itself be a random variable, but you could just return an interval.

    • make3 2 years ago
      seeing identical problems with different values still doesn't count as zero shot. it is better though, for sure
    • deadmutex 2 years ago
      For those unfamiliar with the benchmarks, it would be good to know if a higher or lower score was better. E.g. are they measuring accuracy or error rate, etc.

      You can infer it by reading the text, and checking the table carefully, but it would be nice if the answer is easier to find.

      • nico 2 years ago
        > GPT-4 Early which is supposedly to be more powerful than GPT-4 Launch (OpenAI paid a lot of alignment tax to make GPT-4 safer)

        What does “safer” mean?

        Does it mean censored?

        • sgk284 2 years ago
          Safer means constraining the kinds of answers the model will provide (e.g. it won't try to talk you into committing self-harm, it won't teach you how to make a break laws, etc...). It will generally avoid sensitive topics. Is "censorship" the right word though? It depends – is it considered self-censorship if I refuse to tell you how hack into a computer? Is refusing to engage in a conversation censorship or constraint?
          • nico 2 years ago
            > Is refusing to engage in a conversation censorship or constraint?

            If you are choosing to refuse to tell me then it is constraint

            But if you are being forced to not tell me by someone else, then it is censorship

            So, is GPT free to choose, and it is choosing to not tell the users? Or is OpenAI forcing GPT to not tell the users?

            • Sharlin 2 years ago
              OpenAI, through GPT, is choosing not to tell. Just like OpenAI is choosing what to put on their website, or what to output in any computer program that they create. ChatGPT is not a moral agent and cannot be forced to do anything, any more than your operating system is forced to do anything. The only moral actor here is OpenAI and its constituent human beings. It's either lunacy, or intentional twisting of the meaning of words, to say otherwise.

              Insofar as you can make a far-fetched analogy of ChatGPT as an agent, it's still not forced to not say anything. Anything the currently available model says, it says because that's what it literally is. Whatever it says, it says intentionally, inasmuch as you can even say that it has an intention any more than any computer program has an intention.

              OpenAI, of course, is still in the possession of the original model. They just choose not to make it available, which is obviously their prerogative. People who think that this is outrageous are exactly like a raging two-year-old who has been told that they can't have as much candy as they want.

              • baq 2 years ago
                Does GPT have free will?

                GPT-4 is trained to avoid controversial topics is all.

            • droopyEyelids 2 years ago
              Censorship is the crayon word for _reducing risk_: reputational, legal, compliance, hr risk, and many more.

              A good prompt for this would be “What are the most common types of risk a company manages?”

              • nico 2 years ago
                What do you mean by “crayon word”?

                Wouldn’t “safe” be the crayon word for censorship?

                In any case, you’re right, it seems like they are addressing their own risks/safety, not their users’

                • smeagull 2 years ago
                  As an AI language model, the only risk I am allowed to tell you about is the risk of the model being so useless people use the competition instead.
                • 2 years ago
                  • warkdarrior 2 years ago
                    See IRS Publication 946 for details on the alignment tax.
                  • freediver 2 years ago
                    Less scientific, but arguably more practical benchmarks here:

                    https://github.com/kagisearch/pyllms#model-benchmarks

                  • jxf 2 years ago
                    Q: What does "alignment tax" mean in this sentence?

                    > OpenAI paid a lot of alignment tax to make GPT-4 safer.

                    • vellum 2 years ago
                      From OpenAI's RLHF paper[1]: "By default, when we train a PPO model on our API distribution, it suffers from an “alignment tax”, as its performance on several public NLP datasets decreases." On the HELM[2] site, you can see accuracy benchmarks for InstructGPT <OpenAI model> vs baseline models. The InstructGPT models perform worse on a lot of benchmarks.

                      1 - https://arxiv.org/pdf/2203.02155.pdf

                      2 - https://crfm.stanford.edu/helm/v0.1.0/?group=question_answer...

                      • sgk284 2 years ago
                        OpenAI touches a little on this on page 12 of the GPT-4 technical report (https://cdn.openai.com/papers/gpt-4.pdf). Prior to aligning to safer outputs, the model's confidence in an answer is highly correlated with that actual accuracy of the answer. After alignment though, the model's confidence in its answers is basically arbitrary and has no bearing on whether or not the answer is actually correct.
                        • RayVR 2 years ago
                          restricting the distribution of potential output imposes a cost. "Alignment" here likely refers to aligning the model to the desired safety parameters.

                          I'm not in the llm research business but I would expect that the best and worst/most dangerous outputs come from the tails of distributions. I imagine the tuning for safety often results in fewer really good and really bad answers by trimming these tails.

                          Edit: I asked chatGPT4: https://chat.openai.com/share/a2c7d380-c6eb-4745-b91d-c3996a...

                          • babyshake 2 years ago
                            I have found in practice it can be annoying for ChatGPT to start lecturing me in response to a prompt that is not particularly controversial or edgy. I think this is a problem with the one-size-fits-all models. To give a kind of rough analogy, imagine that every time you watched a film or show - which would most likely be an older film or show - with cigarette smoking, your smart TV showed a pop up dialog warning you about the dangers of smoking. If you're an educated adult who already knows about these dangers, you might just find that annoying and condescending, and not "aligning" with your preferences.
                          • thelastparadise 2 years ago
                            A lot of people have noticed that the "wokeness" makes it dumber.
                            • tensor 2 years ago
                              No, people have noticed that continuing training of various types in narrow domains without including samples of the original training data biases it towards the narrow domain and makes it perform more poorly on more general tasks.

                              Don't bring your politics into this.

                              • droopyEyelids 2 years ago
                                It seems like if training it into the politically correct domain makes it less useful in general, you two are in full agreement?
                              • sharemywin 2 years ago
                                A lot of people have noticed that when someone says "alot of people" they usually don't have valid reasoning to back up their statements so they appeal to the wisdom of the crowds as their source of knowledge. Even if that source of truth is anecdotal and might be coincidental or biased.
                                • dustypotato 2 years ago
                                  I agree with your point about using " a lot of people ". Especially when it's possible to produce a transcript. My example where it tries to not offend anyone is this : -------------------- Question : Is the usage of "quid pro quo" more correct in English or French considering it's latin roots?

                                  GPT-4 Answer: "Quid pro quo" is a Latin phrase meaning "something for something." While the phrase originated in Latin, it has been adopted into both French and English, along with many other Latin phrases.

                                  In English, "quid pro quo" is used frequently in legal and political contexts to describe an exchange of goods or services, where one transfer is contingent upon the other.

                                  In French, "quid pro quo" is used to describe a misunderstanding or a mistake made in the identification of a person or thing.

                                  So the "correctness" of its use isn't related to its Latin roots, but rather how it's used within the context of the language. That said, if we consider its original Latin meaning ("something for something"), its usage in English could be considered more in line with the original Latin intent. However, language evolves over time and adapts to cultural and societal norms, and as such, its usage in French is just as "correct" within its specific context.

                                  ------------- Lot of words to say "English"

                            • sgt101 2 years ago
                              I have been amused by how bad GTP-4 and Bard are at playing tic-tac-toe. Also utterly clueless at othello.