Saturday reads

Man, what a week, eh? Not much to add for now

Knowledge-Creating LLMs

The thesis is that it will soon become economically possible for large labs to keep frontier knowledge that they discover on their own gated behind API, and maybe not share LLMs that should be capable of finding their solutions.

One particular part of this post that I liked is the section about mapping problems to canonical problems, which are themselves "easily", or at least straightforwardly, solvable. I had thought about this before, and for a while I've known that what I really like is exactly finding out this mapping in problems. I'm sad to realize that it will soon be automated, too.

Fraud Investigation is Believing Your Lying Eyes

patio11 has commentary on the Minnesotan Child Care industry. Money quote:

There is a genuine difference in the culture and epistemology of the financial industry versus the government of the United States here. In the financial industry, we keep blacklists and getting a second chance after obvious misbehavior is intentionally non-trivial. This runs against deeply felt values of civil servants. An accusation is not a conviction, and absent clear authority to impose consequences in a new program, an actor *convicted at enormous societal cost* emerges to a new program officer as tabula rasa, equal in moral worth to any randomly chosen citizen.

I will not argue that Mastercard has better moral intuitions than the Founding Fathers. I would, however, happily suggest that the government not assume that the Constitution contains emanating penumbras obligating it to be repeatedly taken advantage of by the same people in the same fashion. We are not forbidden object permanence.

LLMs could be, but shouldn't be compilers

Alperen Keles argues that LLMs, no matter how good, will never be compilers in the sense that a compiler’s job is to turn a well-defined instruction into a deterministc-ish translation to a lower level language, while LLMs/agents effectively make implementation and design decisions for us. Suggests dealing with this by moving to defining specs (which are higher level than code)

How to stop being boring

Am I telling on myself by stopping to read this article? Basically, the author argues that what makes one interesting is precisely what we try to hide to “blend in the background” (which is kind of a platitude said this way but she puts more color in it) and suggests actively spelling out what we learned to hide over time and progressively be more open about it.

Quantifying infrastructure noise in agentic coding evals

Interesting account on agentic benchmarks, with unexpected (to me, who don’t work professionally with this) lines like We have observed anecdotally, for instance, that pass rates fluctuate with time of day, likely because API latency varies with traffic patterns and incidents.

I don’t particularly care for public benchmarks when using my tools, I like testing them and, honestly, I don’t really feel that much of a difference in most cases when comparing similar gen models (5.3 codex vs opus 4.6 as of today, for example). Maybe that’s a skill issue on my part, but anyway.

Now, what is relevant to me (and possibly to everyone that uses these tools): these things are RL’d on environments, and thus their training are dependent on these environments more broadly than only the harness. Something like Claude Code working better on my Mac Mini M2 than my Macbook Pro M4 if the mini is closer to Anthropic’s training environment and thus better suited to it. Maybe we’ll soon see models tailored to the environments they’re expected to run in? Claude Code t4g.large edition? Or, rather, scope these harnesses to a standard 2vcpu + 8gb ram + 20GB disk format and just spawn these virtually on the host? Who knows.

Comments

Popular posts from this blog

What if regular exercise is the best cognitive exercise?

An interview with Steve Wozniak by Jessica Livingston cured my AI anxiety

A neat approach for sortable versioned filenames