29 October 2023

Weekly Notes: Oct 29, 2023

Thoughts & notes of what I read - Oct 29, 2023

Weekly Notes: 2023-10-22

I recently downloaded Omnivore and I know it sounds crazy - but it has increased my quality of reading. I think my reading intake has slightly increased - but the real improvement has been that I’m able to engage with what I’m reading better (via the highlighting and notes tools).

Plus, I have a soft spot for anything open source.

Here’s some things I read this week & notes to myself.

  1. Code Smells in Pull Requests - Saw this on Paige Bailey’s Twitter. First off, I had no idea what a code smell was.

    If you’re like me and have no idea what a code smell is - basically, they’re visual indicators of inefficent, bloated, or messy code. This is a useful overview of some common code smells.

    tldr - based on the study, both accepted and rejected PR suffer from code smell. I’m curious what this looks like when you compare OSS from Corp repos. I’d assume Corp repos tend to have less code smell due to readibility reviews, etc?

  2. HumanEval vs MBPP vs a dozen other benchmarks

    Evaluating LLM quality has always been a notoriously difficult. And, in most cases, it seems as though projects like Chatbot Arena are effectively the golden standard, albeit manual.

    When it comes to LLM code generation, however, it seems to be a bit simpler. Basically - 1) does the code compile? and 2) does it pass a few of the tests that we created?

    What was particularly interesting to me, however, was how many different types of benchmarks there were - and the effort that folks go to prevent leakage into LLMs (and thereby rendering the benchmarks useless).

    HumanEval & MBPP

    HumanEval is a hand written dataset to ensure that there isn’t any leakage to new LLMs. “It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.” (source).

    HumanEval’s dataset contains “a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem.”

    On the other hand, MBPP is crowd sourced and contains a “task description, code solution and 3 automated test cases” (source).

    Ultimately, “HumanEval tests the model’s ability to complete code based on docstrings and MBPP tests the model’s ability to write code based on a description.”


    To measure correctness of a solution using the pass@k metric. At a high level, pass@k works by doing the following (thank you, Bard):

    1. Generate k samples for a given problem.
    2. Run the samples through a set of unit tests.
    3. Count the number of samples that pass all of the unit tests.
    4. Calculate the probability that at least one of the k samples is correct.
  3. You Can’t Sell Trees No One Cuts Down by Matt Levine

    Matt Levine is going to be a mainstay on this blog - absolutely love his newsletter.

    This article is particularly fascinating look into policy and human incentives. And maybe more selfishly, as a PM, how to really be thoughtful of what you’re building to avoid unintentional consequences.

    Carbon Credits

    “In particular, a classic form of carbon credits comes from designating some forest and promising not to cut down the trees in that forest. That naturally leads to dubious accounting regimes: The cheapest way to generate those credits is by promising not to cut down trees that you wouldn’t have cut down anyway.”

    The way they do this is to compare to a reference group - how do you prove that you didn’t cut down a forest? Compare it to a similar forest and if yours has more trees by year end - you’ve saved some portion of trees.

    Well, if you really want to make \($ - just burn the reference forest down? Not really great for the environment but\)$?


Machine Learning