Laird Stewart's blog

There's no free lunch in software

2026-05-22T00:00:00Z

Scott Sumner wrote a good post earlier this month on the usefulness of tautologies which got me thinking about tautologies in software development. One which came to mind is

"To reduce intrinsic complexity of a program one must reduce the complexity of the problem it's solving"

I liken this to the no free lunch theorem (hence the title of this post) because when designing ML algorithms, you can only improve performance by limiting the set of problems under consideration (i.e., making assumptions about the problem at hand). Similarly, in software, you can only simplify code by making simplifying assumptions about the problem you're solving.

So what can we learn from this tautology? If you believe that software developers should strive for simplicity (as John Ousterhout argues in "A Philosophy of Software Design"), then you must also believe software developers should focus on making assumptions.

What are assumptions to a software engineer? Assertions, preconditions, and API contracts! We should therefore strive to write methods that looks like

/**
 * Assumes that bar() was called first.
 */
public void foo(Key key)
{
    Preconditions.checkArgument(key != null);
    Preconditions.checkArgument(cache.contains(key));
    Preconditions.checkState(isCacheInitialized());
    ...
}

with the understanding that it will simplify the final program. I've recently found myself writing more and more methods that look like this. When I started out, I would have looked at comments like

"this method assumes that it will only be invoked once"

"this only works assuming no more data will be added to this collection"

as "cop-outs" or lazy coding, when in reality they are important design decisions. Every simplification is the flip-side of an assumption, and it's better to document (or code!) assumptions directly than leave them implicit.

April 2026 Roundup

2026-04-01T00:00:00Z

A demand-side analysis of AI's impact on the job market

This is the best article I've read since starting this newsletter. If you click on one link, make it this one.

As people get richer, they don’t just want more commodities. They want things that aren’t commodities in the standard sense of the word. The social aspects of products such as the relationships, the status, and exclusivity—what Rene Girard called the mimetic properties of desire—become much more relevant once people’s basic needs are satisfied. And the demand for these properties will bring the human element back into the production process, and with it, the jobs.

When AI automates commodity production, prices in that sector fall. That raises real income. If the goods and services people want more of as they get richer lie disproportionately in the relational sector, demand shifts in that direction. Baumol’s cost disease then amplifies the result: if the relational sector remains harder to automate, it becomes relatively more expensive and absorbs a growing share of total expenditure.

What will be scarce?, Alex Imas, 4/14/26

If you collaborate using Git, here are some fun commands to run

git log --format=format: --name-only --since="1 year ago" | sort | uniq -c | sort -nr | head -20
git shortlog -sn --no-merges
git log --format='%ad' --date=format:'%Y-%m' | sort | uniq -c

The Git Commands I Run Before Reading Any Code, Ally Piechowski, 4/8/26

One Dimensional chess

1D-Chess, Rowan Monk

Frontier AI labs are profitable on the margin

I had a discussion this week with co-workers about future AI cost for consumers. They argued that the model companies aren't profitable, and that they will inevitably raise prices to recoup their investments. This is a suspect claim, but apparently popular among otherwise well-informed people, so I thought I'd address it here. Two important pieces of information

Holding capability constant, inference cost has fallen 10x-100x year over year
While providers are operating at net losses, they are profitable on the margin

Micro 101 tells us that the firms will serve their models at the point where marginal cost equals marginal revenue. Importantly, upfront costs don't affect this equilibrium. Even if OpenAI were a monopoly and faced no competition, so long as the elasticity of demand is not 0, if marginal costs decrease, part of those savings will be passed along to consumers.

LLM inference prices have fallen rapidly but unequally across tasks, EpochAI, 3/12/25
No, it doesn't cost Anthropic $5k per Claude Code user, Martin Alderson, 3/9/26

March 2026 Roundup

2026-03-01T00:00:00Z

Grade inflation hurts students in the long run

Being assigned a higher average grade inflating teacher reduces a student's future test scores, the likelihood of graduating from high school, college enrollment, and ultimately earnings. ...The cumulative impact is economically significant: a teacher with one standard deviation higher average grade inflation reduces the present discounted value of lifetime earnings of their students by $213,872 per year.

Easy A's, Less Pay: The Long-Term Effects of Grade Inflation, NBER, 3/2026

Related, one proposed solution to grade inflation: capping the number of A grades awardable, penalizes students for taking difficult courses.

The real problem is not inflation per se. It’s that students are penalized for taking harder courses with stronger peers. A grade cap leaves that distortion intact—and can even amplify it.
...
The underlying issue is informational. A grade tries to capture two things—student ability and course difficulty—with a single number. Gans and Kominers show that in general this is impossible: if some students take math and earn B’s while others take political science and earn A’s, there is no way, from grades alone, to tell whether the difference reflects ability or course difficulty.

Grade Caps are Not a Good Solution to Grade Inflation, Marginal Revolution, 3/30/2026

Malus provides "clean room as a service" (i.e., the removal of external dependencies from a software project)

Our proprietary AI robots independently recreate any open source project from scratch. The result? Legally distinct code with corporate-friendly licensing. No attribution. No copyleft. No problems.

This is a consequence of zero-cost software I hadn't considered. An important question is whether the negative effects of this on the open source ecosystem will be outweighed by the increase in AI-assisted contributions. Also raises some interesting legal questions.
https://malus.sh

Robin Hanson's "Grabby Aliens" model (2020) suggests that if humans stay put on Earth, we won't discover aliens for 500 million years (median estimate) but once we do, it will only be a few years before they arrive on our doorstep. Fun idea.

I think this factor is under-estimated when discussing the Fermi Paradox. If most of the planets in the universe are too far away for us to see alien life, then if we see it at all we’ll be seeing their space ships as they come to us. But we won’t even see them launch to us, even with perfect telescopes staring out into the galaxy, until they’re almost here. In practice this means that, in the grand scheme of human history, the phase between becoming aware of aliens and meeting them is vanishingly short.

Notes on the Fermi Paradox, Casey Handmer, 3/3/2026
How Far to Grabby Aliens?, Robin Hanson, 12/2020

Trained to play tennis, humanoid robots can hold a cooperative rally.

This is the most recent jaw-dropping robotics demo I've seen. The footwork is particularly impressive.
LATENT: Learning Athletic Humanoid Tennis Skills from Imperfect Human Motion Data, 3/13/26

When it comes to AI's economic impact, always keep your eye on production .

The real issue is not “Who will get the profits from AI?”; the most interesting question is whether AI will lead to the production of 130 million household servant robots, or the production of another 2000 mega-yachts. When examining issues of inequality, it often makes more sense to focus on the structure of output, not the distribution of income.
...
I often see discussions of AI that makes a similar error, failing to understand that the essential question is output, not distribution. Many worriers about AI don’t seem to understand that these two scenarios are almost identical:

1. What if AI replaces all jobs?
2. What if America becomes so rich that we can all live as billionaires?

Imagine 130,000,000 Washing Machines, Scott Sumner, 01/01/2026

Brain-drain to the U.S. has significantly slowed Canada's economic growth

From 2014 to 2024, Canada’s real GDP per capita adjusted for purchasing power parity grew by just 3.2 percent in total, an anemic 0.4 percent per year on average, and the third lowest among 38 advanced nations. Over the same period, the United States posted 20.2 percent total growth (1.9 percent annually), and the OECD average reached 15.3 percent (1.4 percent annually). The measurement shortcomings cannot explain five-to six-fold differences in growth rates.
...
The analysis estimates that a substantial share of Canadians who would rank among top earners in Canada have emigrated to the United States—roughly 40 percent of potential top 1 percent earners and 30 to 50 percent of the next nine percentiles. Canadian-born individuals in the United States are more educated than native-born Americans, earn substantially more, and cluster disproportionately in top income deciles.
...
Canada is effectively exporting its inequality to the U.S. The brain drain simultaneously lowers our average income while raising American income, accounting for a significant share of the persistent GDP gap.

Why Canada's GDP per capita crisis is real: DeepDive, The Hub, 03/20/2026

Motivating the Particle Filter

2026-02-26T00:00:00Z

I’ve spent a while searching for resources to provide new hires on particle filters. There are two main categories: theoretical introductions (e.g., A Tutorial on Particle Filters for Online Nonelinear/Non-Gaussian Bayesian Tracking and conceptual blog posts (e.g., Emma Benjaminson’s Series). I’ve yet to find a resource with the following characteristics

Accessible to a Math/CS undergrad
Follows a single example, from Bayesian Inference to a Kalman Filter
Has visualizations of multiple-dimensions and hidden variables
Demonstrates the problems Kalman and Histogram filters encounter

I hope to fill that gap. These notes grew out of a recruiting talk I gave at UIUC and are intended as a conceptual primer for one of the theoretical introductions. A familiarity with calculus, statistics, and linear algebra is useful.

Outline

I’ll follow a single, unifying example: Imagine we’re in a submarine equipped with active sonar. Like a bat, we can emit a sound and listen for its echo. Given the echo’s elapsed time and speed of sound, we can calculate our distance to things. Unfortunately, we don’t have a perfect knowledge of the speed of sound underwater, as it varies depending on temperature, salinity, and depth. We are tasked to find another submarine which is somewhere nearby. We’ll start with the simplest case: Both submarines are stationary and we want to estimate only the distance to the other submarine.

Each act builds on the previous by adding one layer of complexity. A new technique (and limitations of the old one if applicable) will be discussed at each stage

Act	Added complexity	Topic
1	Single measurement, stationary submarine (Baseline)	Bayesian Inference
2	Multiple measurements	Recursive Bayesian Inference
3	Moving submarine	Kalman Filter
4	Non-Gaussian prior	Histogram Filter
5	Tracking more dimensions than range (e.g., lat, lon)	Particle Filter

Act 0 provides a recap of Bayes’ rule. Act 6 describes resampling and perturbations: two heuristics for better performance with less computation.

Act 0: Math Recap

The easiest way to remind yourself of Bayes’ theorem is to re-arrange the law of conditional probability (the comma means “and”, and the bar means “given”):

$P(A, B) = P(A|B)P(B) = P(B|A)P(A)$

$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$

Bayesian inference is a technique which repeatedly uses Bayes’ theorem to understand some outcome/event of interest $(A)$ as we observe events/collect data which tells us something about it $(\textrm{e.g., } B)$ . For example, “given that a card is red, what is the probability it is a heart?”

$P(A)$ is called the prior. This is the initial degree of belief in the hypothesis $A$ .
$P(A|B)$ is called the posterior. It is the degree of belief in $A$ after incorporating the news of $B$ .
$P(B|A)$ is the likelihood. It is the probability of the data given the hypothesis
$P(B)$ is called the evidence, or marginal likelihood. It is the probability of the data under all hypotheses.

The crux of Bayesian inference is that once we have $P(A|B)$ , if we observe another event, $C$ (which is independent of $B$ ) we can “update” our belief about $A$ by making $P(A|B)$ our prior and starting again to find a new posterior, $P(A|B,C)$ . Note that we don’t have to start from scratch each time we get additional evidence, only compute Bayes’ theorem one more time. All of our knowledge thus far about $A$ is contained in the posterior.

What do I mean by “degree of belief”? If you haven’t come across the distinction between Frequentest and Bayesian statistics, a frequentest will take a weighted coin, and will say “flip it 1 million times, and the ratio of $\frac{\text{\#heads}}{\text{\#flips}}$ is the probability it lands heads. A Bayesian will take a weighted coin and say probability is the”degree of belief” I hold that it will land heads. i.e., if I had to place a bet on it, what “probability” would make for a fair betting line?

We can derive Bayes’ theorem for probability density functions similarly:

$f_{X,Y}(x,y)=f_{X|Y=y}(x)f_Y(y)=f_{Y|X=x}(y)f_X(x)$

$f_{X|Y=y}(x)=\frac{f_{Y|X=x}(y)f_X(x)}{f_Y(y)}$

Remember the evidence, $f_Y(y)$ , is the probability density of the data under all hypothesis (possible values of $x$ ). In the continuous case you would write this as an integral

$f_Y(y)=\int f_{Y|X=x}(y)f_X(x)dx$

Act 1: One Measurement of A Stationary Submarine

To our submarine example, let’s make the following simplifications:

The problem is one-dimensional
Both submarines are stationary
The sonar readings are centered on the true distance with some noise. To be precise, they follow the normal distribution with $\mu=\text{true distance}$ , and $\sigma=100$ . A positive value means the object is in front, negative means it is behind.

   🛥️ ········))       🛥️
<───┼───────────────────┼───> x
    0                  ❓

Given the following measurement

Measurement	Range
#1	4781 meters

what is the probability distribution of position of the other sub? Let’s approach this as a Bayesian. Call the distance to the other submarine $x$ and the sonar reading $y$ . Here’s Bayes’ theorem again:

$f_{X|Y=y}(x)=\frac{f_{Y|X=x}(y)f_X(x)}{f_Y(y)}=\frac{f_{Y|X=x}(y)f_X(x)}{\int f_{Y|X=x}(y)f_X(x)dx}$

$\substack{\text{p-distribution of the distance x} \\ \text{given the sensor reading y}} = \frac{\substack{{\text{p-distribution of the sensor }}\\ {\text{reading y given distance x}}} \times \substack{\text{prior p-distribution}\\{\text{ of the distance x}}}}{\substack{\text{p-distribution of the sensor}\\{\text{reading y across all possible x}}}}$

Here, $X$ represents the “state” of the other submarine (also called our “hypothesis”) and $Y$ represents the Evidence/Data we have about this state. The core problem is that we have a probability density over possible observations, $f(Y|X)$ , but we want a pdf over state, $X$ . Bayes’ theorem is what achieves this.

We need two things to calculate our posterior. First, the prior $f_X(x)$ . This represents our belief about the other sub’s position before receiving any measurement. Since we don’t have any a-priori knowledge, we can use a Gaussian distribution with extremely high variance (i.e., very flat) which loosely says “it could be anywhere”.

$f_X(x)=\mathcal{N}(\mu=0, \sigma^2=10^8)=\frac{1}{\sqrt{10^8\times2\pi}}\exp\left(-\frac{1}{2}\left(\frac{x^2}{10^8}\right)\right)$

Aside: however wide, this prior is not uniform: it has slightly more probability around 0 than at 1000m. We can’t make our prior’s variance infinite as that is ill defined, but we could have used an uninformative, improper prior. “Uninformative” meaning it contains no information (uniform everywhere) and “improper” because it doesn’t integrate to 1. While I won’t go into more detail here, I point it out because sometimes people say you can “initialize the particle filter using the first measurement” (i.e., use the normalized likelihood function of the first measurement as your prior), but really what they’re doing is applying Bayes’ rule with an uninformative prior.

Second, we need our likelihood function, $f_{Y|X=x}(y)$ . This is defined as the probability of measuring $y$ given the state, $x$ . The problem statement directly gives this to us: “The sonar readings are centered on the true distance with standard deviation of 100”. Here, the “true distance” is $x$

$f_{Y|X=x}(y)=\mathcal{N}(\mu=x, \sigma^2=10^4)=\frac{1}{\sqrt{10^4\times2\pi}}\exp\left(-\frac{1}{2}\frac{(y-x)^2}{10^4}\right)$

Now, we can plug these into Bayes’ theorem and solve for the posterior $f_{X|Y=y}(x)$ . Fortunately, the product of two Gaussian Distributions is also Gaussian (proof) with mean and variance

$\sigma=\sqrt{\frac{\sigma_1^2\sigma_2^2}{\sigma_1^2+\sigma_2^2}},\quad\mu=\frac{\mu_1\sigma_2^2+\mu_2\sigma_1^2}{\sigma_1^2+\sigma_2^2}$

The denominator (evidence), $f_Y(y)$ , is a scalar value (remember, $y$ is given). Therefore, the posterior is also a Gaussian. I won’t go through the entire derivation here, but understand the solution is analytical

Aside: an analytical solution is derived by moving around variables with pencil and paper (e.g., anything you did in a high school algebra class). It is exact. Numerical solutions, on the other hand, are calculated via an algorithm and are approximate. They typically incur some discretization or rounding error which approaches zero in the limit of infinite memory or compute time.

I’ll note a few things. First, the prior is hard to see because it is pretty close to a flat line right around 0. Second, the mean of the posterior is very slightly less than the measurement. This is because our prior is centered at zero, so it will still “pull” the posterior towards it, no matter how flat it is. Third, the equation for the standard deviation, $\sigma$ , above guarantees that the variance of the posterior is less than or equal to the variance of prior and likelihood Gaussians.

Act 2: Multiple Measurements of A Stationary Submarine

Now, we get a second measurement,

   🛥️ ···))·····))     🛥️
<───┼───────────────────┼───> x
    0                  ❓

Measurement	Range
#2	4952 meters

And we’d like to update our probability distribution of the other submarine’s position, $f_X(x)$ . Recalling the math recap from the beginning, we can use the posterior from Act 1 as our new prior, rinse and repeat.

$\text{Act 2 posterior} = \frac{\substack{{\text{measurement \#2}}\\ {\text{likelihood}}} \times \text{Act 1 posterior}}{\substack{\text{probability of measurement \#2 }\\{\text{across all possible x}}}}$

Again, the variance of the posterior is smaller than the two inputs. Since the variance of the prior and likelihood are roughly the same, the mean of the posterior is half way in between the two.

One of the powers of Bayesian inference is that it allows subjective priors. By that, I mean your domain knowledge and lived experience may give you a hunch that the other sub likes to hang out in some area and is therefore probably some distance away. Bayesian inference allows you to turn this “hunch” into a prior. This is possible because (in theory) no matter what prior you choose (so long as it is not zero where it matters) with enough measurement updates eventually your posterior will converge on the true distribution. Therefore we can leverage subjective “hunches” while maintaining some mathematical guarantees.

Act 3: Multiple Measurements of a Moving Submarine

Now, the other submarine is moving, but we don’t know how fast. We take sequential sensor measurements. Our goal is to estimate (at the time of the last measurement) the velocity and range of the other submarine. We’ll make the following assumptions

   🛥️ ···))······))    🛥️💨
<───┼───────────────────┼───> x
    0                  ❓

Everything is still in one dimension
The other sub is moving at a constant velocity
Our prior on the other submarine’s velocity is $\mathcal{N}(\mu=0, \sigma^2=5)$ where a negative value means it is getting closer.

The state of the other submarine is now a random vector $X$ :

$x= \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}= \begin{bmatrix} \text{position} \\ \text{velocity} \end{bmatrix}$

And its probability distribution is now multivariate Gaussian $X\sim\mathcal{N}(\mu,\Sigma)$ . Our prior is

$\mu =\begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \Sigma=\begin{bmatrix} 10^8 & 0 \\ 0 & 5 \end{bmatrix}$

Again, we will employ Bayesian inference. The problem is now two dimensions, but conceptually the technique is the same. We will treat the first measurement just as we did in Act 2. However, after calculating the posterior $f_X(x)$ of the random vector, instead of immediately turning it into our next prior, we first must update it with time. To do this, we need something called a “motion model” or “state update equation”. Given a state at $t-1$ , and our assumption of constant velocity, the state at $t$ is given by

$\begin{aligned} \textrm{pos}_t&=\textrm{pos}_{t-1}+\text{vel}_{t-1}\Delta t \\ \textrm{vel}_t&=\textrm{vel}_{t-1} \end{aligned}$

where $\Delta t$ is the time in between each sensor reading. This is a linear transformation we can write in matrix form:

$\begin{bmatrix} \text{pos} \\ \text{vel} \end{bmatrix}_t = \begin{bmatrix} 1 & \Delta t \\ 0 & 1 \end{bmatrix} \begin{bmatrix} \text{pos} \\ \text{vel} \end{bmatrix}_{t-1} = A \begin{bmatrix} \text{pos} \\ \text{vel} \end{bmatrix}_{t-1}$

(I’m sticking with convention here to call this matrix $A$ ).

Given a state vector we can transform it with time, but how do we transform a continuous distribution? Let’s change our framing slightly; instead of thinking about probability distributions, let’s consider the random variable $X=\{\text{pos},\text{vel}\}$ which this distribution describes. Again, we are fortunate that our variable is Gaussian. Thinking back to an intro stats class, you may remember that multiplying a random variable by a constant $b$ scales its mean by $b$ and its variance by $b^2$ . This extends to random vectors. Given any linear transformation $Y=BX$ ,

$E[Y]=B E[X]$

$Cov(Y)=B Cov(X)B^T$

Because the Gaussian distribution is defined by this mean and variance, the resulting distribution of our random variable is also Gaussian with

$\mathcal{N}(\mu,\Sigma)\to \mathcal{N}(A\mu,A\Sigma A^T)$

So long as our motion model is a linear transformation and our prior and measurements are still Gaussian, the posterior will remain Gaussian, and therefore the entire process can be solved analytically. Say we receive the following three measurements:

Measurement	Time	Range
#1	0 seconds	7507 meters
#2	60 seconds	7671 meters
#3	200 seconds	8099 meters

To process the first, we apply Bayes’ theorem just as we did in Act 2. I’ve plotted top-down heatmaps of our probability distributions. Getting oriented, an informationless prior would look like a uniform, light orange background. Our prior is a bit better than informationless: we know the velocity is probably between $\pm 4$ m/s.

Our sensor still provides only range information. Therefore it appears as a 1D Gaussian smeared uniformly across the velocity axis.

Aside: I haven’t plotted the actual prior here – it would appear as a single color. I’ve emphasized things for visual effect.

Notice how the posterior mean is slightly lower than 7507. This is because our prior is not truly “informationless” it is just a very flat Gaussian centered at 0, so it will pull the posterior slightly towards 0.

Aside: the first posterior’s covariance matrix has 0s in its off diagonals. This is a fancy way of pointing out that it isn’t slanted. After we apply the motion model, these off-diagonal elements become non-zero and it becomes slanted. It is this covariance matrix that carries the “information” about the relationship between the possible positions and velocities: i.e., “If the sub has a positive velocity it is probably further away now”.

Let’s zoom out for one moment. Our goal is to motivate the particle filter. Until this point we haven’t defined “particle” or “filter”. The approach I’ve just described is is called “recursive Bayesian estimation” or a “Bayesian Filter”. The first phrase describes exactly what we’ve done: recursively apply Bayes’ theorem to estimate the state of the submarine. As for the latter, “filtering” simply means we are estimating the state of the submarine at the time of the last measurement as opposed to during the whole encounter. At this point you know literally 1/2 of “particle filter”, and our conceptual understanding is roughly half way there as well.

In particular, a Bayesian filter where the prior and measurements are Gaussian and the state update is linear is called a “Kalman Filter”.

Notice one more thing about our results. Our final posterior tells us that the other submarine’s velocity is between +2 and +4. This is remarkable because we never received any information about its velocity! We call velocity a “hidden state” because it is never observed. One powerful trait of Kalman Filters is that they can provide estimates of states which are never observed.

Aside: the Kalman filter can also represent Gaussian noise in the state update. For example, imagine the velocity of the submarine depended on the temperature/salinity of the water. We could model the effect of this unknown environment as Gaussian noise (just like we did with the measurements).

Aside: the “extended Kalman filter” (EKF) can use any differentiable function for the state-transition model (rather than linear). This involves constructing a Jacobian (matrix of partial derivatives) to update the covariance with. The EKF is the de facto technique of GPS systems.

Act 4: Histogram Filter

Now imagine that before receiving the first sensor report, we receive an intelligence that the other submarine is between 4,600 and 6,000 meters away.

   🛥️ ···))······))       🛥️
<───┼────────────────┼──────────┼───> x
0                  4,600  ❓  6,000

And then we receive the following three measurements:

Measurement	Time	Range
#1	0 seconds	4781 meters
#2	60 seconds	5063 meters
#3	200 seconds	5510 meters

We can incorporate the intelligence report into our Bayesian framework as a uniform prior:

$f_X(x) = \frac{1}{\text{1,400}} \quad \text{for } x \in [\text{4,600}, \text{6,000}]$

The problem with a uniform prior is that we can no longer solve the problem analytically. What can we do instead? One option divide the x-dimension into 200 “bins” then calculate the prior and likelihood for each bin. Then, using the same linear motion update

For each bin in step $(t-1)$ ’s posterior, transform its state $x_1$ into $x_2$ using $A$ and add that probability to step $t$ ’s prior’s bin at $x_2$ .

Aside: alternatively we could have used a Gaussian prior with ( $\mu=5,300, \sigma=700)$ , called it close enough, and used a Kalman Filter. This sounds hacky, but it’s often often a good option.

Let’s take one more step back. What we’ve just constructed is called a “Histogram Filter”. Our solution is no longer analytical – that is we’ve approximated our continuous distributions with a finite grid. This sacrifices some precision, but allows us to consider non-Gaussian priors.

In this Act we used the same motion model $A$ as in Act 3. However, now that we’ve abandoned the requirement of an analytical solution, our motion model no longer needs to be linear either. So long as we can come up with a way to “update” the probability in the posterior forward in time, we can do so however we wish. For example, say we know that every 10 minutes the other submarine turns around. We could have a literal if-else statement in our motion update that if $t\%(10*60)==0$ , negate the velocity. This is impossible with a Kalman Filter.

Finally, note that while our measurements have remained Gaussian they need not have. Once we start solving things numerically, we can drop any requirements about functions being Gaussian.

Act 5: Particle Filter

Now let’s consider a higher-dimensional problem: both submarines move in three dimensions. We would like to track both position $\{x,y,z\}$ and velocity $\{v_x,v_y,v_z\}$ . Assume our measurements are 3d Gaussians on $\{x,y,z\}$ and our prior is a uniform distribution over a cube in $\{x,y,z\}$ .

If, like before, we split each dimension into 200 bins, we now have $200^6=64\;\text{Trillion}$ bins. Using one float (32 bits) to store each bin’s value would require $32\times 64\times 10^{12} \;\text{bits}\approx250\;\text{Terabytes}$ . This is an example of “the curse of dimensionality”.

Let’s think about what’s happening here. In our 2-dimensional example, after 3 measurements we are almost certain the other submarine has a positive velocity. We nonetheless multiply the prior (0) by the likelihood (0) to get the posterior (0) for every bin with negative velocity. This is a tremendous waste of compute. There are some tricks like having a dynamic discretization: have very wide bins where there is no probability and very small grids where the probability is non-zero. Possible, but adds quite a bit of complexity.

Note that the Kalman filter does not suffer from the curse of dimensionality. Since it only needs to keep track of the mean and covariance matrix, which, in 6 dimensions only require 42 floats.

Aside: the idea of assuming a Gaussian distribution to avoid the curse of dimensionality is not unique to the Kalman Filter. In machine learning, Quadratic Discriminant Analysis and Naive Bayes leverage this idea.

The particle filter solves this problem. Instead of moving our probability between bins, leaving many bins with zero probability, why not move the bins themselves? Call these moving bins “particles”. Each particle is comprised of its state $\{\text{position},\text{velocity}\}$ and its probability which we call its “weight”.

To construct our set of particles, sample from the first prior distribution. There is no longer a need to discretize our likelihood functions as we can evaluate them directly at the state of each particle. After applying the likelihood function, we update each particle using our state-transition model $A$ . Like the histogram filter, we need not use a linear state-transition model or Gaussian prior/likelihoods. Note that the superscripts denote the particle index, not an exponent. Typically, the subscript is reserved for the time index.

$\text{particle 1 (state}\ x^1):\quad x^1_t = \begin{bmatrix} 1 & \Delta t \\ 0 & 1 \end{bmatrix} x^1_{t-1}$

$\text{particle 2 (state}\ x^2):\quad x^2_t = \begin{bmatrix} 1 & \Delta t \\ 0 & 1 \end{bmatrix} x^2_{t-1}$

$\vdots$

At each step the prior is the set of original particles and their weights and the posterior is the set of particles with updated weights. Each particle’s weight is updated according to Bayes’ theorem.

$\text{particle 1 (state}\ x^1):\quad \text{new weight} \propto \frac{\substack{{\text{likelihood of sensor }}\\ {\text{reading y given }x^1}} \times \text{old weight}} {\text{evidence}} \propto \text{likelihood}\times\text{old weight}$

$\text{particle 2 (state}\ x^2):\quad \text{new weight} \propto \frac{\substack{{\text{likelihood of sensor }}\\ {\text{reading y given }x^2}} \times \text{old weight}} {\text{evidence}} \propto \text{likelihood}\times\text{old weight}$

$\vdots$

I’ve written “ $\propto$ ” (proportional to) instead of “ $=$ ” because there is one more step to calculate the new weight. The particles comprise a probability distribution, so the sum of their weights should sum to 1. Therefore, we divide each weight by their sum. The effect of this normalization is the same regardless of any constant factor, so we can ignore the evidence constant.

Aside: people often get lazy and say “the particle’s likelihood”. This is short for “the likelihood of the measurement conditioned on the particle’s state”. Particles don’t have likelihoods.

For consistency, I’ll show the same 2D example as from Act 5. It’s important to understand that this technique can scale to higher dimensions, but things are simpler to visualize in 2D.

Note that in these plots I’ve set a floor on the opacity of each particle to help visualize things. Had I let the opacity go to zero, the very opaque particles would all disappear.

I’ll pause here and say that a particle filter with infinite particles and a histogram filter with infinite bins, will converge on the same analytical posterior provided by a Kalman filter. Of course, we’re doing all this to avoid trillions of cells let alone infinite ones, but it’s good to know things will converge to the correct answer. Everything from here on is a trick to use less compute, not a mathematical requirement.

Aside: what I’ve described here is called a “bootstrap filter”. It is a particular kind of particle filter where certain assumptions are made so that $\text{new weight}\propto \text{likelihood}\times\text{old weight}$ . This isn’t always the case. I’ve omitted the mathematical details since I don’t think they are necessary for a conceptual understanding. You can read more on Wikepdia here.

Act 6: Resampling and Perturbations

My critique of the histogram filter was that it wasted effort computing bins which had zero probability. We could make the same critique here: most of the particles have near-zero weight but we keep them around.

The solution is to “resample” the particles. Imagine we have 100 particles. Originally each has weight $1/100$ . After performing the measurement update, half have weight zero and half have weight $1/50$ . To resample we throw away the particles with 0 weight and duplicate each of the survivors. We’re left with a total weight of 1 and our surviving particles are all in the “region of interest”. I won’t go into more detail about resampling, there is no shortage of explainers on the internet which focus on it.

A common technique which accompanies resampling is to add “perturbations” to the resampled particles. That is, add a bit of Gaussian noise to each perturbed particle’s state. This is because we only have finite particles, so we would like to “smear out” the survivors to more evenly cover the space.

Aside: it’s at this point that things start becoming more of an art than a science. As you progress developing a particle filter, more and more decisions will start to fall in the former category

Here are the results of adding resampling and perturbing to our particle filter:

Fair warning: particle filters fall apart in high dimensions. There just aren’t enough particles to cover the space (curse of dimensionality). Things fare better if things are roughly Gaussian.

February 2026 Roundup

2026-02-01T00:00:00Z

Plenty was written on the Pentagon's falling out with Anthropic this week. This was one of the more clear-headed articles.

But the deeper problem isn't who's right in this negotiation; it's that the negotiation is happening at all. The terms governing how the military uses the most transformative technology of the century are being set through bilateral haggling between a defense secretary and a startup CEO, with no democratic input and no durable constraints. Congress should be setting these rules. And it should do so in a hurry.

Lawfare, Alan Z. Rozenshtein

A good visualization of grid-scale energy being brought online in 2026.

U.S. Energy Information Administration

A two-part series considering the policy implications of recursively improving AI systems.

There is one assumption I’ll ask you to make with me, which is that substantial automation of AI research is a near-term possibility. This requires believing a few things. First, that AI research and engineering is substantively composed of work like: finding optimizations in various complex software systems; designing and testing experiments for AI model training and posttraining; and creating software interfaces to expose AI model capabilities to users. Second, that a great deal of this work is essentially reducible to the engineering of software. Third, that AI models, while not yet geniuses, are reaching quite high levels of human competence. Fourth, that frontier lab leadership and staff are serious when they describe AI research automation as a near-term goal, and that frontier lab research staff are telling the truth when they say that AI is already writing a large fraction of their code.

I agree with Dean that policymakers should seriously discuss recursive self-improvement. However, his last two premises are weak. "Competence" is too vague to be useful. Models have exceeded human "competence" on narrowly defined math and coding tasks for 12 months. It shouldn't be a surprise they write most code at these companies. However, without metacognitive monitoring and self-directed planning, these look more like another step in the evolution of compilers, IDEs, and build tools than a drop-in labor replacement. I'm not saying that won't happen, just that he's missing a premise.

On Recursive Self-Improvement (Part 1), Dean Ball

The vegetables in VeggieTales are not Christian.

Substack, Justin Kuiper

Scott Alexander's rebuttal to the "stochastic parrot" argument

Scott argues that the "LLMs are just next token predictors" argument is confused. It can't say anything about a model's intelligence or world model. After all, humans are "just next sensory input predictors".

Recently, they [Anthropic] explored how Claude predicts where a line break will be in a page of text. Since line break is a token, this is literally a next-token prediction task. The answer was: the AI represents various features of the line breaking process as one-dimensional helical manifolds in a six-dimensional space, then rotates the manifolds in some way that corresponds to multiplying or comparing the numbers that they’re representing ... Next-token prediction created this system, but the system itself can involve arbitrary choices about how to represent and manipulate data.

Next-Token Predictor Is An AI's Job, Not Its Species, Scott Alexander

Film students can't sit through films.

At Indiana University, where Erpelding worked until 2024, professors could track whether students watched films on the campus’s internal streaming platform. Fewer than 50 percent would even start the movies, he said, and only about 20 percent made it to the end

A lot of discussion around this article focused on attention span. I think it's as much a sign of decreasing effort/standards. If I had to guess, I'd say fewer than 50 percent of students at IU read the assigned books in English classes.

The Atlantic, Rose Horowitch

Bacteria as a treatment for cancer. An idea I hadn't heard before.

Ewingella americana exhibited remarkably potent cytotoxic activity with selective tumor-targeting ability characteristic of facultative anaerobic bacteria. Mechanistic investigations revealed that E. americana functions through a dual-action mechanism: direct tumor cell killing and robust activation of host immunity, leading to enhanced T cell, neutrophil, and B cell-mediated tumor attack. Treatment with E. americana significantly outperformed standard therapies, including anti-PD-L1 antibody and doxorubicin, in tumor regression studies.

These are mice studies, so take with a grain of salt.

Discovery and characterization of antitumor gut microbiota from amphibians and reptiles, Iwata et al.

Luxury apartment construction is bringing down rent.

Building more housing brings down rent prices (not surprising). More important, rent prices come down for Class-C housing even when most of those additions are in luxury apartments. Austin, Phoenix, and Denver have had an average growth in new units of 6.8%, 4.9%, and 4.3% respectively over the past 5 years.

Luxury Apartments Are Bringing Rent Down in Some Big Cities, Bloomberg

January 2026 Roundup

2026-01-01T00:00:00Z

Fewer questions were asked last month on Stack Overflow than during its first month of operation.

Stack Exchange, Anonymous Post

Every contribution to Claude Code in December by its creator was via Claude Code.

A year ago, Claude struggled to generate bash commands without escaping issues. It worked for seconds or minutes at a time. We saw early signs that it may become broadly useful for coding one day.

Fast forward to today. In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed. Every single line was written by Claude Code + Opus 4.5. Claude consistently runs for minutes, hours, and days at a time (using Stop hooks)

X.com, Boris Cherny

India received $140B in personal remittances in 2024. That's more than the profit of their largest 70 publicly traded companies combined ($120B). The U.S. accounts for roughly 30%.

Personal remittances, received (current US$) - India, World Bank
List of largest companies in India, Wikipedia
Remittances to India, Wikipedia

A good blog post on software work estimation

The common view is that a manager proposes some technical project, the team gets together to figure out how long it would take to build, and then the manager makes staffing and planning decisions with that information. In fact, it’s the reverse: a manager comes to the team with an estimate already in hand, and then the team must figure out what kind of technical project might be possible within that estimate.

How I estimate work as a staff software engineer, Sean Goedecke

Central Stockholm renters face a 20-year wait due to rent control

If you’re looking for a standard rental contract in Stockholm, you’ll have to be prepared to wait. Apartments are allocated through a waitlist, and in 2025, new tenants in the city center had waited an average of 21 years. Rents are regulated, and can be far below half of market rents at the most attractive addresses.

the up•date, Stefan Schubert

Gemini's performance on the "Needle in a Haystack" test blows competition out of the water.

They've got some secret sauce, maybe sub-quadratic attention, maybe RL improvements. Unfortunately, we won't know anytime soon. google seemingly solved efficient attention, Celeste

A 2014 interview with Walter Isaacson at Khan Academy

I'm a fan of Isaacson's books, but had never heard him speak. He's quite charismatic. He thinks one driver of creativity which we've lost is the connection between the arts/humanities and science/engineering (the classic Steve Jobs mantra). Similarly, he thinks that cities which lack creative types and a culture of humanities will fail to produce innovation in the long run. He was more bullish on SF than Palo Alto/Mountain View. Granted, this was 10 years ago. SF is back on the rise, but confounded by AI. Suggest starting at 24:00

Yeah, and I'm trying to say if you look at Ben Franklin, the most important scientist of his period. Even though you probably don't... You know, we think of him as a doddering dude playing his kite in the rain. Single fluid theory of electricity that comes from his electricity experiments is up there in that century, you know, with Newton, even. I mean, he's the best experimental scientist of his time. Jefferson would have thought you were a Philistine if you didn't study botany and everything else. Nowadays, people like a Ben Franklin don't do electricity experiments.

Walter Isaacson - President and CEO of the Aspen Institute, Khan Academy

SBF and Log Dollar Utility

2025-12-30T00:00:00Z

It would have served Sam Bankman-Fried to separate utility from dollars.

I was given How Not to Be Wrong for Christmas and am enjoying it so far. I've watched plenty of Numberphile and Veritasium and have read Stephen Pinker's similar book, Rationality, yet it introduces many topics I haven't come across before. One of which is St. Petersburg paradox:

I put $2 in a pot then flip a coin
If tails, the game ends and you win the money. If heads, I double the money
I flip the coin again. Repeat step 2

Clearly you should want to play this game, as you win no matter what. The question is, how much should you be willing to stake for the opportunity to play? $10? $20? If you do the math you'll find that the expected value (EV) of this game is infinite — the series diverges. Therefore in theory you should be willing to stake any finite amount to play the game. This feels wrong. I wouldn't stake my entire net worth, let alone one paycheck to play. The mathematics is sound, so the paradox is of human behavior. Perhaps humans underestimate small probabilities (I've never seen a coin land 10 heads in a row) or perhaps the question is nonsense (no house could ever fund the other side of the bet).

The traditional solution is to say the premise of the EV calculation is wrong: the player should calculate the EV of their utility not of the raw dollar amount. Further, their utility should be the logarithm of the dollar amount. I had never heard of doing such a thing: every example in school, every analysis of a casino/board game, every quant interview question calculates EV directly with dollars. However, on consideration, it doesn't seem outlandish. To rappers, money is logarithmic; lines are as often about figures as they are absolute dollars. Further, for most millionaires it would hurt more to go broke than it would feel good making another million. With this conception of utility, the expected value is no longer infinite.

How does this relate to SBF? In Sam's appearance on Conversations with Tyler, he was posed a similar question: if you could push a button with a 51% chance of doubling the number of sentient beings in the universe (e.g., double the number of earths) and a 49% chance of killing all living beings, would you press it? He answered that he would press it repeatedly. Most commentators flawed Sam for lacking empathy or being hyper-rational in his answer. From what I know from Going Infinite the former is fair. But I don't think the latter is. If Sam calculated his EV using the logarithm of the number of living beings (which I'm starting to think is correct), he wouldn't have arrived at the same conclusion and his answer would have been more wonky/"hyper-rational".

Taking this a step further, could solve some of the problems/paradoxes that arise with longtermist utilitarian ideas. If you're not familiar, some people argue that we should not (or should only barely) discount the utility of future sentient being's lives with time. Therefore since there could be many more living people in the future we should prioritize protecting the future above all else. Therefore, even if you choose not to discount with time, you could still avoid some of the problems with expected values and infinity if you consider utility to be the logarithm of the number of beings. Surprisingly, in all I've read about utilitarian philosophy I've never heard this idea, though I'm sure I'm not the first.

For another time, but you could obviously extend this to SBF's Alameda Research. I'm curious to what extent, if at all, hedge funds separate EV/utility from dollars.

December 2025 Roundup

2025-12-01T00:00:00Z

School cellphone bans likely have a small, positive impact on test performance

A recent difference in difference study finds Florida's 2023 secondary school cell phone ban increased test scores by 0.6 percentiles; a smaller effect than I expected.

The Impact of Cellphone Bans in Schools on Student Outcomes: Evidence from Florida, National Bureau of Economic Research

The UK without London is poorer than Mississippi

Tyler Cowen's alternative to Fischer Random Chess

In his conversation with Kenneth Rogoff, Tyler Cowen suggests an alternative to Fischer Random Chess: select starting positions from a database of equal positions 2 or 4 moves in. In Fischer Random, the back rank is randomized emphasizing over-the-board play over preparation. Cowen's alternative achieves the same goal, but results in more standard positions which observers can empathize with. An interesting idea I hadn't heard before.

Kenneth Rogoff on Monetary Moves, Fiscal Gambits, and Classical Chess (Ep. 241) , Conversations with Tyler

St. Petersburg Paradox

I put $2 in a pot then flip a coin
If tails, the game ends and you win the money. If heads, I double the money
I flip the coin again. Repeat step 2

Clearly you should want to play this game, as you win no matter what. The question is, how much should you be willing to stake for the opportunity to play? Blog post with the answer and related thoughts here.

How Not to Be Wrong, Jordan Ellenberg

I like how Casey Handmer and Terraform think about resumes

I could look at two identical candidates with similar career paths, and have no way of knowing that one of them built a jet engine in their living room. Their resumes are very similar and have no ability to choose between them. But a single photo of them in front of their own jet engine would tell me 95% of what I need to know to make a job offer. It would pretty much instantly put them on the top of the screening pile too

For Terraform, send us your one pagers. Ideally they will contain photos of awesome hardware you personally created, together with a brief and informative summary of how the project relates to your desired role with us

Maximizing Resume SNR, Casey Handmer

An argument to emphasize culture over process and incentives in software engineering teams

Culture matters, Dan Luu

Gregory Clark's The Son Also Rises gets repetitive but is worth the read.

Clark's central insight is that if you track the prevalence of a surname in an elite institution, it remains over- or under-represented much longer than traditional social mobility rates would suggest. The majority of the book is spent showing this holds regardless of time-period or location.

To explain this, Clark models status as latent variable. This variable is "indistinguishable" from genetics, depending only on that of the parents and randomness. The variable determines, again with random noise, observable outcomes like income or educational attainment. Traditional methods measure observable variables regressing to the mean, and therefore overestimate the pace of social mobility.

Clark concludes that status is mostly inherited, government efforts to increase social mobility have largely failed and as long as marriage is assortative (high-status people marry other high-status people) social mobility will remain slow.

The Son Also Rises, Gregory Clark

Bryan Caplan's The Case Against Education argues higher-ed is mostly signaling.

When we look at countries around the world, a year of education appears to raise an individual’s income by 8 to 11 percent. By contrast, increasing education across a country’s population by an average of one year per person raises the national income by only 1 to 3 percent. In other words, education enriches individuals much more than it enriches nations.

Correcting for underlying cognitive ability, Caplan estimates 60% or more of the education-wage premium is the sheepskin effect. In particular, he argues education signals intelligence, conscientiousness, and conformity. Firms pay this premium, so an open question on my mind is why probationary periods aren't more common.

The Case Against Education, Bryan Caplan
The World Might be Better off Without College for Everyone, Bryan Caplan

November 2025 Roundup

2025-11-01T00:00:00Z

An interesting multi-part retrospective on 2010s culture from a little-subscribed Substack

Before dropping out to found Theranos, Elizabeth Holmes had only two semesters’ worth of chemical engineering, but that was irrelevant, as the stamp of Stanford became the only relevant factor in her ridiculous narrative of entrepreneurship ... Unlike Holmes, Javice collected her degree, but classmates told Fortune she more or less spent her time collecting stamps, as she finagled her way into a prestigious incubator for entrepreneurship that the university hosted

There was a fairly recent tweet expressing outrage that Columbia University was revoking degrees to sanction student activists, and it said something like, “How can they revoke degrees? Degrees are earned.” While the work of attending classes and completing assignments and passing exams does seem to show the inherent value of a degree, the autodidact gets no respect for a reason. The conferral of a diploma, the recognition by a university and preferably a university with a good name, is what makes that work “real”

The Long 2010s, Part 1: Activity Without Action. Substack

Trader Joe's played a major role in popularizing wine (particularly Californian wine) in America. In 1970, Trader Joe's was the largest wine retailer in California.

Buyer for Trader Joe's stores helped popularize wine. Los Angeles Times
The Complete History & Strategy of Trader Joe's. Acquired

Prediction Error Minimization Theory suggests the brain works by refining a world model to minimize the difference between its prediction and what happens next. I had never heard of this; it's supervised learning for cognitive science, which seems too simplistic.

Is prediction error minimization all there is to the mind?. The Brains Blog

One economist calculates U.S. academic achievement declines will result in 6% lower GDP by the end of the century.

An 8 per cent lifetime 'tax' is coming for students. Washington Post

Last month, I shared an Atlantic article about the De Beers diamond cartel. They just ran a paid post in the NYT. Like their previous campaigns, it doesn't advertise any product; this time they mythologize the story of a single diamond discovery.

How Diamonds are Born. New York Times (Paid Post)

The first AI image that made my jaw drop. Courtesy of Nano Banana Pro. Past models made photorealistic people look airbrushed and glossy; we've crossed the uncanny valley completely.

Riley Goodside. X

Taiwan's central bank plays a heavy-handed role in devaluing its currency to subsidise exports. Good news for Joey!

Critics say the central bank prioritises export growth with single-minded fervour, an approach which harms the country in several ways. First, keeping the currency weak subsidies exporters at the expense of importers. In Taiwan, where the vast majority of both food and fuel (for vehicles and power plants) is imported, this acts as a transfer from poor households to the owners and employees of exporting firms. Taiwanese workers have good reason to feel aggrieved. Labour productivity has doubled since 1998, yet unlike in most rich countries or even in wage-suppressed China, pay has not risen in tandem. Taiwanese unit labor costs, a measure of what workers earn per unit of output, have fallen by 25% over the same period.

A Taiwanese Big Mac, it turns out, costs 56% less than an American one. America is a fraction wealthier than Taiwan, but that affects things only on the margins. Adjusting for this, we calculate that the Taiwan dollar is 55% undervalued, the most of all 53 currencies we track.

Formosan flu. Economist

I just started reading Gregory Clark's The Son Also Rises. He tracks social mobility using the relative prevalence of surnames in elite institutions. The rates he calculates are lower than most other estimates.

Thus the representation of surnames among both attorneys and physicians in Sweden suggests a similar pattern: social mobility in Sweden is much slower than conventional estimates suggest, even for very recent generations. A second surprising finding from the surname distribution of Swedish physicians is that not only are true social mobility rates slower than conventionally estimated, but they are no faster now than they were in the early twentieth century. The enlargement of the political franchise and the institutions of the extensive welfare state of modern Sweden, including free university education and maintenance subsidies to students, have done nothing to increase rates of social mobility.

The Son Also Rises. Princeton University Press

Online Master’s Programs are Worthless

2025-10-11T00:00:00Z

When I say “Online Master’s” I mean fully remote programs for professionals from accredited institutions. For example, JHU Engineering for Professionals, where I’ve taken courses. These programs provide value via

Instruction (materials and guidance provided by the professor)
Signaling (proxy for intelligence/conscientiousness – for career advancement)

The introduction of reasoning models represents a drastic decrease in the value of both factors. First, equivalent instruction is possible via the combination of publicly available syllabi and recorded lectures in concert with LLM tutors.

	Online MS	Internet	Reasoning Model
Syllabus/Homework	X	X
Lectures	X	X
Tutoring/Feedback	X		X

Second, the elephant in the room, reasoning models can get As in most classes. I’ve tested these models on physics, math, and CS problems and suspect this is the case for all STEM subjects. If a student wishes to do so, they can get an A with almost no effort and without learning anything. If they haven’t already, employers and PhD admissions committees will soon realize that online Master’s degrees are only worth the student’s word.

I predict online Master’s programs will lose 90% of their value in the next year as LLM tutors improve and employers wake up. Assuming these programs won’t suddenly fall 90% in price, I expect enrollment to fall.

What next?

I won’t be taking any more classes at JHU. I’m toying with the idea of finding a group of coworkers or friends who want to find a textbook or recorded online course to study together. If I were in charge of my company’s HR department, I would stop compensating employees for online coursework. I would also stop counting online Master’s programs awarded after 2025 as “years of experience” when calculating salary scales. Finally, if I were a director making hiring decisions, I would weigh online Master’s coursework equally to a prospective hire claiming to have self-taught a course via a recorded lecture series and LLM tutor.

Given the following observations:

What previously costed $2,500 (assuming instruction is 1/2 the value of these courses) can now be offered for the price of an LLM subscription.
Demand for post-graduate education will increase as students drop out of (or never begin) these programs

There’s an opportunity here for a startup (or open source project). I’m imagining a service that organizes publicly released lectures and syllabi and matches students with others in their city to form study groups. What remains is recovering the lost signaling value. Perhaps peer reviews or in-person final exams at testing centers.

October 2025 Roundup

2025-10-01T00:00:00Z

The De Beers diamond cartel ran an advertising campaign to dissuade people from selling their diamonds

"It is conservatively estimated that the public holds more than 500 million carats of gem diamonds, which is more than fifty times the number of gem diamonds produced by the diamond cartel in any given year. Since the quantity of diamonds needed for engagement rings and other jewelry each year is satisfied by the production from the world's mines, this half-billion-carat supply of diamonds must be prevented from ever being put on the market. The moment a significant portion of the public begins selling diamonds from this inventory, the price of diamonds cannot be sustained. For the diamond invention to survive, the public must be inhibited from ever parting with its diamonds."

Have You Ever Tried to Sell a Diamond? The Atlantic

Takeaways from Acquired's third part on Google

In 2000, 15% of Google's data center infrastructure was dedicated to their language model PHIL, used for search and AdSense
Google has, in total, roughly as many TPUs as NVIDIA ships GPUs per year
Google could have bought Tesla for 6 billion in 2013
Zuckerberg tried to buy DeepMind for $800 million in 2014, twice what Google ultimately paid
The AlphaGo documentary recommended by Ben is worth watching

Google: The AI Company. Acquired

The manufacture of paper grocery bags is loosely 7.9x worse for the environment than plastic bags

Environmental impacts of different types of grocery bags. Our World in Data

The fastest buildout of nuclear reactors took place in France in the 70s and 80s

“During the 1980s, France increased the number of reactors in commercial operation from 15 to 55. Even China, with the world’s most streamlined regulatory process and developed industrial base, has failed to match this record”

“In the 1970s, France was building nuclear reactors at one-third to a half of the pre-interest costs that new reactors in the US had risen to amid toughening environmental and safety regulations.”

Liberte, egalite, radioactivite. Works in Progress

Paul Erdos' mathematical output heavily relied on stimulants

“Erdos first did mathematics at the age of three, but for the last twenty-five years of his life, since the death of his mother, he put in nineteen-hour days, keeping himself fortified with 10 to 20 milligrams of Benzedrine or Ritalin, strong espresso, and caffeine tablets. 'A mathematician,' Erdös was fond of saying, 'is a machine for turning coffee into theorems.'"

On Talent. Felix Stocker

In relative terms, data center freshwater consumption is unconcerning

0.19% of America's freshwater consumption in 2023 went towards powering and cooling data centers. For reference, that is 10% of the freshwater lost due to indoor household leaks. If forecasts are correct and data center electricity usage triples by 2030, their additional water consumption will be equivalent to the US producing 5% more steel. At the personal level, each day the average American consumes (indirectly) the equivalent of 800,000 chatbot prompts in water. When putting together this roundup, I realized that I've actually met Andy before. He's the director of EA DC.

The AI Water Issue is Fake. Andy Masley

Takeaways from Andrej Karpathy on Dwarkesh Podcast

Andrej believes that LLM's encyclopedic knowledge is holding them back. He believes removing much of this encyclopedic knowledge, leaving behind a "cognitive core" will improve performance. Tool use (e.g., connecting to Google) would make up for lost factual knowledge.
Andrej discounts the possibility of a discrete intelligence explosion. He views the impacts of AI as an extension of ongoing computer automation.
He's starting an AI education company. Incidentally, I wrote a blog post making the case for something exactly like this.

Andrej Karpathy — AGI is still a decade away

The simplest way to induce a psychotic break in an LLM may be to ask it for the seahorse emoji

Multi-scale emergence can be quantified using the entropy of a system's transition probability matrices

A very interesting non-expert summary of a paper published this month (I haven't had the chance to read the technical paper). At a high level, you can describe any system as a Markov Chain and represent its different scales as groupings of its events. Then, take the transition probability matrix of each grouping and score them using an entropy-based measure to quantify their "irreducible causal contribution".

I Figured Out How to Engineer Emergence. Eric Hoel

Coal-fired plants cause 5x more excess deaths per year than total excess deaths from Chernobyl

23k excess deaths per year in the U.S. from coal plants, 4k total excess deaths caused by Chernobyl

Mortality risk from United States coal electricity generation. Science
Deaths due to the Chernobyl disaster. Wikipedia
Power to Save the World. Gwyneth Cravens

The US and Canada accidentally kick-started India's nuclear weapons program

The US and Canada provided India with a heavy-water research reactor through the Atoms for Peace program. India then built a reprocessing facility and repurposed the reactor to extract Plutonium-239 for its first nuclear test. India argued that its test was a "Peaceful Nuclear Explosion" and did not violate the stipulation of the deal that the reactor should only be used for peaceful purposes.

The U.S., Canada, and the Indian Nuclear Program, 1968-1974. National Security Archive
Power to Save the World. Gwyneth Cravens

Pew Research survey finds increasing distaste for sports betting across America

10% more Americans see sports betting as bad for society than in 2022. This trend is growing across every demographic. 43% say legal sports betting is bad for society, while only 7% say it is good. An equal percentage of college and non-college graduates bet on sports (22%). Sports betting is more common among young people and those with higher incomes.

Americans increasingly see legal sports betting as a bad thing for society and sports. Pew Research

September 2025 Roundup

2025-09-01T00:00:00Z

Gemini has met Larry Page’s 2000 definition of artificial intelligence

“Artificial intelligence would be the ultimate version of Google. If we had the ultimate search engine it would understand everything on the web it would understand exactly what you wanted and it would give you the right thing. That’s obviously artificial intelligence; be able to answer any question basically because almost everything is on the web right and so we’re nowhere near doing that now. However we can get incrementally closer to that and that’s basically what we work on and that’s tremendously interesting from an intellectual standpoint … so I expect to be doing that for a while.”

Larry Page and Sergey Brin interview on Starting Google (2000)

At the time of Gmail’s launch in 2004, Bill Gates couldn’t imagine needing more than 1G of email storage

“How could you need more than a gig? What’ve you got in there? Movies? Power-Point presentations?”

Gemini 2.5 Pro can one-shot linear programming word problems

Given a linear programming problem with four constraints, Gemini correctly translates the constraints to a Python program using scipy’s linprog library. Two years ago, hand-massaging LLMs to do named entity recognition and formulation of constraints was an active area of research. NL4OPT was a competition to do just this. I came across this while looking into creating my own NL4OPT tool. I posed the problem to Gemini expecting it to fail. Needless to say, I was surprised. Tool use is the past, present, and future. Its interface is Python.

Above-market government salaries can compromise economic productivity

“In many countries, public employees enjoy considerable job security and generous compensation schemes; as a result, many talented workers choose to work for the public sector, which deprives the private sector of productive potential employees. This, in turn, reduces firms’ incentives to create jobs, increases unemployment, and lowers GDP…. [Calibrating the model to Greece] we find that a 10% drop in public sector wages results in a 3.8% increase in private sector’s productivity, a 7.3% drop in unemployment, and a 1.3% increase in GDP.”

The Unintended Consequences of Meritocratic Government Hiring — Geromichalos, Kospentaris
I previously had a vague conception that part of Singapore’s success could be attributed to its high-paid bureaucrats. I’ll have to revisit this. What’s missing in this analysis is the flip side: does having brilliant bureaucrats afford a point of GDP in efficiency? Related, I read an interesting Substack post about Singaporean culture and why they don’t have entrepreneurs. It mentions the loss of talent to government. See also India, Greece, Brazil: How High Government Pay Wastes Talent and Drains Productivity — Marginal Revolution

A larger fraction of Europeans die from preventable heat death than the fraction of Americans who die from firearms

“Most of this death is preventable. The technology that prevents it is air conditioning. Barreca et al. (2016) find that heat deaths in America declined by about 75% after 1960, and that ‘the diffusion of residential air conditioning explains essentially the entire decline in hot day–related fatalities’”

“Europe’s crusade against air conditioning is insane” — Noahpinion.

GPT 5 created this slop website. GPT-5: It Just Does Stuff — Ethan Mollick.

“Conversations with Tyler” is my new favorite interview style podcast

Lex is out of his element in economics/politics. Dwarkesh is energetic, but asks too many niche, drawn-out questions (cynically, to show off). Tyler asks sharp, one sentence questions, guiding the conversation while letting the guest talk. In his episode with John Arnold, Tyler brought up one interesting question about solar: How do we prepare for volcanic events occurring every ~150 years? This is too long a horizon for the market to solve. How can the government ensure some base load of natural gas or nuclear is always available over 100s of years? Tyler also sees NIMBYism as nuclear’s greatest obstacle.

Aircraft carriers may already be a relic

“Similarly, anybody who’s read even news articles about hypersonic weapons should decide that buying more aircraft carriers is not a good thing. But we do need some of those resources shifted to this new defense ecosystem that’s very experimental, that’s building swarming weapons.” — Christopher Kirchhoff

It seems he’s saying “if you’ve read the classified briefs it’d be even more obvious”
Christopher Kirchhoff on Military Innovation and the Future of War (Ep. 225) — Conversations With Tyler

️️10% of 25 to 54-year-old men are not seeking employment

Age-graded classrooms are the worst form of schooling … except for all the others

Age-graded classrooms work because they optimize student motivation. Peer pressure is a great motivator; so while often inefficient, there is a reason lock-stepping students onto the same track as others has become the standard. The author estimates that around 5% of students are intrinsically motivated enough to teach themselves (he calls them no-structure learners). This is why considering selection bias is so important. I was recently excited to learn about Math Academy. I fear now their results may boil down to selection bias.

“Here’s something you have to remember. It’s easy to cherry-pick in education. If you want to start a school to prove that penguin-based learning is the future, that penguin meditation and penguin-themed classrooms are superior to the stuffy, traditional, obsolete schools we have now, you can. It’s simple. Find a way to only accept no-structure and very low-structure learners. Then start your school. Do your penguin meditation, make sure there’s a basic structure for learning core academic skills, and you’re set. The results will be great, you can publish articles about the success of your method, if you’re lucky you’ll get some of that sweet sweet philanthropy money”

“Your Review: School” — Astral Codex Ten

A great comparison of US renewable energy supply

Renewable Energy Technical Potential and Supply Curves for the Contiguous United States: 2024 Edition
Note the y-axes are not the same. For solar (B) this is the nameplate capacity. It is common to assume a capacity factor (% of time the sun is shining) of 20% and therefore 5x the required wattage is required to be built in nameplate capacity (along with batteries, transmission, etc.). For reference, the US total primary energy consumption is 94 quadrillion Btu, equal to 3,100 GW on average. I’ve placed vertical lines at this point. Eyeballing this, solar and land-based wind look good while offshore wind is a no-go. I don’t know enough about the types of geothermal to comment there.

Reflections on Two Years Of Software Engineering

2025-08-27T00:00:00Z

This advice is general, but probably biased by my background with legacy Java applications.

First, work through MIT’s “The Missing Semester of Your CS Education”

Useful concepts

Chesterton’s Fence: A hiker comes across a small wooden fence on a beautiful hillside which anyone could step over. He removes it to restore the place’s beauty; it wouldn’t stop trespassers anyway. Once he has left, over the next months the cows from the neighboring farm eat all its grass and trample it into a muddy slope. When building software, if you encounter something that seems poorly designed, figure out why it was designed that way before refactoring or removing it.
Pareto Principle (power law): Most real-world relationships are non-linear. E.g., 80% of revenue comes from 20% of customers.

One example of the Pareto Principle is methods which speed up typing:

Touch typing
Vim key bindings
Dvoark layout
Ergonomic keyboard

Touch typing will get you 80% of the way to a “pro” typist while Vim keybindings will get you another 15%. Changing keyboard layouts could help a little beyond that. Although touch typing provides 5x more value than Vim keybindings, it’s about the same difficulty to learn. My suggestion: if you can’t already touch type, learn now. If you can, learn vim keybindings. Don’t worry about keyboard layouts, taking “The Missing Semester”, reading books about software development, and networking are all better uses of your time.

Minimize configuration. “Every line of configuration is a liability” - ThePrimagen. It’s tempting to spend a lot of time configuring your IDE, terminal emulator, and shell. First, remember Chesterton’s Fence – the creators of dev tools designed their tool’s features and configuration intentionally (often for their own use!). Another drawback is that you will inevitably need to switch computers, help colleagues, or ssh into a server. These are all more difficult if you can’t use default configurations. My suggestion: when starting with a new tool, use its vanilla configuration for the first month (or more). Learn its core features before configuring settings or installing plugins.

Learn the command line versions of tools

They are more feature-rich
They can be scripted
Amortize costs of learning the CLI (terminal emulator, shell) and supporting tools (e.g., tmux, man) across multiple applications.

Try the simple/open source tool first. More often than not it has all the functionality you’ll need and is easier to learn. If you discover a functionality it doesn’t have, you’ll be more prepared to choose a paid offering. For example, VisualVM has suited all my Java profiling needs so far.

I read a few books on software engineering. It was a good use of my time. Ranked (power law applies here):

A Philosophy of Software Design – John Ousterhout
Code that Fits in Your Head – Mark Seemann
Working Effectively with Legacy Code – Michael Feathers
Design Patterns – Gamma, Helm, Johnson, Vlissides
Refactoring for Software Design Smells – Suryanarayana, Samarthyam, Sharma
Clean Code – Robert Martin

And finally, a few tips I try to follow (but beware pithy rules like these)

Make simplicity the number one priority
Use pure methods and immutable data whenever possible
Favor composition over inheritance
Favor strong typing

Notes on Grad-B and Curvature Drifts in Tokamaks

2025-07-13T00:00:00Z

Motivation

If held in a plasma of high enough temperature and density, light ions will eventually fuse. Magnetic confinement is an approach to controlled fusion which achieves these conditions using the magnetic force. This force acts perpendicular to ions’ velocity and the magnetic field, causing them to travel in helices around field lines. Tokamaks bend these field lines into a torus, trapping ions like beads on a bracelet. Once this trap is set, ions are injected and heated with electromagnetic radiation to achieve fusion conditions. Unfortunately, the centers of ion orbits drift away from their original field lines causing them to escape confinement. The addition of a poloidal (short way around the torus) component to the magnetic field negates the effect of this drift. This post derives a representative magnetic field using a toroidal/poloidal coordinate system and motivates the additional poloidal component by simulating the trajectory of a single particle.

Toroidal Coordinate System

It’s much simpler to formulate a tokamak’s magnetic field using a toroidal/poloidal coordinate system because its geometry mirrors the problem’s, encapsulating its complexity. This system describes locations relative to a “central circle” of radius $R_0$ using three coordinates: $r$ , $\theta$ , and $\zeta$ . $\zeta$ measures the toroidal angle (long way around), $\theta$ the poloidal, and $r$ the distance from the central circle.

Figure 1. Toroidal Coordinate System Top View

Figure 2. Toroidal Coordinate System Side View

The translation to and from Cartesian coordinates is given by $\begin{aligned} &x = (R_0+r\cos\theta)\cos\zeta\quad&r = \sqrt{(\sqrt{x^2+y^2}-R_0)^2+z^2}\\ &y = (R_0+r\cos\theta)\sin\zeta\quad&\theta = \arctan2(z, \sqrt{x^2+y^2}-R_0)\\ &z = r\sin\theta\quad&\zeta = \arctan2(y, x)\\ \end{aligned}$

Where $\arctan2(y, x)$ is the C/Python function which returns the angle between the positive x-axis and the point $(x, y)$ in the plane. Notice that $\sqrt{x^2+y^2}-R_0$ is the signed distance to the central circle in the xy-plane where a negative value indicates the point is within the circle.

The unit vectors in this system (blue vectors in the figures), expressed in Cartesian coordinates, are $\hat{r}= \left( \begin{array}{c} \cos\theta \cos \zeta \\ \cos \theta \sin \zeta \\ \sin \theta \\ \end{array} \right),\; \hat{\theta}= \left( \begin{array}{c} -\sin\theta\cos\zeta \\ -\sin\theta\sin\zeta \\ \cos\theta \\ \end{array} \right),\; \hat{\zeta}= \left( \begin{array}{c} -\sin\zeta \\ \cos\zeta \\ 0 \\ \end{array} \right)$

Therefore, to translate a field vector with base ( $r$ , $\theta$ , $\zeta$ ) and components ( $a$ , $b$ , $c$ ) to Cartesian coordinates, the base can be found using the translation above, and the vector using the scaled unit vectors: $\begin{aligned} \mathbf{p}&=a\hat{r}+b\hat{\theta}+c\hat{\zeta}\\ \mathbf{p}&=(a\cos\theta\cos\zeta-b\sin\theta\cos\zeta-c\sin\zeta)\hat{x}\\ &+(a\cos\theta\sin\zeta-b\sin\theta\sin\zeta+c\cos\zeta)\hat{y}\\ &+(a\sin\theta+b\cos\theta)\hat{z} \end{aligned}$

Note On Handedness

Notes on Bayes’ rule

2025-05-14T00:00:00Z

Consider a rectangular dartboard with two circles $A$ and $B$ drawn on it. A monkey is throwing darts at the board, and they fall randomly within it. The probability that the dart falls within the circle labeled $A$ is its area divided by the area of the rectangle, $20/100=0.2$ . Now imagine you’re facing the other way and the money throws the dart. The bartender tells you it landed in $B$ . What is the probability it also landed in $A$ (i.e., in the purple intersection)? It’s hopefully intuitive this is the area of the purple section divided by the red section, $4/12=0.33$ .

Bayes’ rule formalizes this: $P(A|B)=\frac{P(B|A)P(A)}{P(B)}\tag{1}$

Where $P(A)$ is the probability of event $A$ occurring and $P(A|B)$ is the probability of $A$ occurring given that $B$ occurred. Plugging in the probabilities from the dartboard example, we get the result we expect: $P(A|B)=\frac{(4/20)\times (20/100)}{(12/100)}=4/12=0.33\tag{2}$

One way to derive Bayes rule is by starting with the definition of joint probability: $P(A\cap B)=P(A|B)P(B)=P(B|A)P(A)\tag{3}$

If you aren’t convinced of this fact, consider the dartboard again. You can think of $P(A|B)$ as what % of the red circle is covered by the purple section. Or, in other words, the ratio of the area of the purple section to that of the red section. If we then multiply that by the area of the red section, we’re left with the area of the purple section. You can use the same logic to convince yourself of the third term in (3). Finding Bayes’ rule is as simple as dividing (3) by $P(B)$ .

Let’s take a step back, and think more about (1). $P(A|B)$ is a probability and therefore bounded between 0 and 1. How does this fraction ensure this is the case? It’s clear that it won’t be negative as the three components are all positive. But if $P(B)\to 0$ , why wouldn’t this fraction grow to infinity? The reason is that $P(B|A)P(A)=P(A\cap B)$ which is strictly less than or equal to $P(B)$ .

In (1), $P(A)$ is called the prior and $P(A|B)$ the posterior since $P(A)$ was our belief about the probability of event $A$ occurring before observing the event $B$ , and $P(A|B)$ is our belief about its probability after (prior to) observing $B$ .

Resources

An Introduction to Bayesian Thinking: Free online textbook from a Coursera course. Good for someone who has taken an undergraduate stats course

Notes on Likelihood

2025-04-16T00:00:00Z

Take a Gaussian probability distribution $f(x; \mu, \sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{1}{2}(\frac{x-\mu}{\sigma})^2)$

This is a function of

x

and is parameterized by

\mu

and

\sigma

. If we instead consider the expression on the right as a function of

\mu

and

\sigma

, parameterized by

x

, we have the likelihood function

L(\mu, \sigma; x)=\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{1}{2}(\frac{x-\mu}{\sigma})^2)

This is the definition of the likelihood function.

I’ll emphasize that the expression on the right hasn’t changed. A few things fall out of this 1) Previously the integral of $f$ (holding $\mu$ , $\sigma$ constant) over $x$ was 1. Now if we integrate $L$ over $\mu$ and $\sigma$ (holding x constant) there is no reason to expect the integral is still 1. 2) $L$ evaluated at $(\mu=a,\sigma=b;x=5)$ is a likelihood. If $f$ happened to be parameterized by $\mu=a,\sigma=b$ and evaluated at $x=5$ the result would be the same (since the expression is the same). Therefore, the value of $f$ evaluated at $x=5$ is the likelihood of its parameters parameterized by $x=5$ . That’s a long-winded way to say $f(x)$ is a likelihood. (Remember it is the integral of $f$ which is a probability: $\int_a^bfdx$ ).

So what does it mean? Look at the line on this surface for $\mu=5.5$ . Notice that near $\sigma=0$ it has a low likelihood. As $\sigma$ increases, its likelihood increases for a bit and then declines again. Consider three points along that line ( $\mu$ , $\sigma$ ) = (5.5, 0.2), (5.5, 1), (5.5, 2). How did we calculate the value of $L$ at each of these points? We just plugged in these $(\mu, \sigma)$ pairs along with $x=5$ into the expression above. Let’s visualize our doing that, but letting x vary again (notice these curves are PDFs now).

The value of our likelihood surface for each combination $(\mu, \sigma)$ is the intersection of the red line with each of these PDFs. Orange performs the best. Even if we centered the green PDF on $x=5$ it would still not have as high a value as the orange curve. So we can think about the values of the likelihood function like this: “how well does this pair of parameters $(\mu,\sigma)$ fit the data point $x=5$ ?”

That brings us to Wikipedia’s definition: “A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model.” Hopefully this is clear now.

One last thing. Wikipedia’s article continues “It is constructed from the joint probability distribution of the random variable that (presumably) generated the observations”. When I said earlier that the definition of the Likelihood was simply the PDF with its parameters considered as variables that was only partially true. It is actually the joint PDF. In this case we just considered a “joint” PDF with only one part. $L(\mu, \sigma; x_1,x_2,...x_n)\triangleq \prod_{i=1}^nf(x_i;\mu,\sigma)=\prod_{i=1}^n\frac{1}{\sigma\sqrt{2\pi}}\exp(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2)$ The likelihood function is useful because it helps us guess the parameters of a distribution given some samples from it. Say we have IID samples $x_1,x_2,\cdots,x_n$ , from a known distribution with unknown parameters, and we want to find the parameters which maximize the probability of having drawn this sample. The parameters which maximize the likelihood function are those parameters! If this isn’t clear, think back to the one dimensional case and the previous figure.

Here is a simple example of this process (called a maximum likelihood estimate) for three data points assumed to be drawn from a Gaussian distribution

Beware of Pithy Rules About Software

2025-01-05T00:00:00Z

The root cause of a lot of bad code is the overuse of common rules (e.g., DRY, YAGNI, “keep methods under X lines”). These are well-intentioned and distill years of programming expertise. However, even in a world where they’re right 90% of the time, the 10% of times they’re wrong translates to technical debt which compounds quickly. This isn’t to say their creators are at fault; These rules are often introduced with disclaimers about overuse. The problem is that these catchy sayings don’t have room for disclaimers. After learning a new rule, new developers may only remember the acronym and not any of its nuances. Senior developers aren’t innocent either; During code review, it’s easier to cite a rule than explain the underlying issue with the code. Teachers don’t want to overwhelm students, so they may give a blanket statement like “don’t use static methods”. There is a direct relationship between the length of a piece of advice and its value. Longer answers can provide context, edge cases, and describe more sophisticated concepts. It shouldn’t be a surprise that short, general statements can lead us astray.

“Methods should be shorter than X lines”

The professor of my software design course at Georgia Tech suggested methods should be under five lines, and Clean Code claims “The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.” In my mind, this rule is just a heuristic to remind us that methods should generally only have a single “concept”. I argue that because the former is easier to remember it gets used (and abused) more frequently. See It’s probably time to stop recommending Clean Code for an example of how this can go too far.

“Don’t repeat yourself”

In my first internship, I inherited a thousand-line R script with code literally copy and pasted half a dozen times with different hard-coded parameters. DRY is a powerful concept when you first learn it, and I’ve met a number of data scientists who could have learned it sooner. However, entry level software engineers understand copy and pasting code is a red flag. However, it can cause more harm than good in the form of premature optimization and over-abstraction.

“Comments are a failure to express yourself through code”

Early on at my first job, I came across this saying in Clean Code. It seemed reasonable, and my team already used comments sparingly, so I accepted it at face value. Over my first year, I wrote nearly no comments. I went so far as to leave PR comments to explain confusing bits because I internalized that putting it in the code was a failure. I’ve since grown a much more favorable outlook on comments. The ultimate irony is that, upon returning to the chapter on comments in Clean Code, I agree almost entirely with it. But because I only remembered this one saying, my outlook on the topic was quite warped.