Because The Air Is Free -- Pablo Duboue's Blog

RSS Feed

The pressing need for LLM error characterization (instead of just LLM proliferation)

Mon, 19 Aug 2024 01:18:08 -0700

Tags: technical

permalink

In AI for the last year and a half, the focus has been on LLMs, as they succeed on tasks that are hard for humans. When humans are able to accomplish some of the feats that LLMs are able to do, it shows a level of proficiency that can be translated to many other tasks.

Proficiency in a task goes hand in hand with the type of errors committed when performing the task. In a past life, I was a late cofounder in an elementary math education startup. We worked in understanding the erroneous logic behind children's miscalculations in say, fraction additions. With this understanding we could then explain to the students the error in their logic and steer them into the correct solution. This was possible because humans will err in similar ways (unless a kid is an unusually creative failure innovator). The way an intelligence fails constitutes its error profile.

While the LLMs show human-comparable proficiency in some tasks, their error profile is very non-human. I have argued in an earlier post that the most exciting and novel insight I gained during the Jeopardy! Watson system construction was peeking into its non-human error profile. The same can be said with my work with LLMs so far. It is exciting to live in a time where non-human intelligences (as primitive as these machines are) start to appear.

Aside from the excitement, this non-human error profile means our intuitions for LLM behaviour are uncalibrated. There are clear engineering drawbacks of this lack of understanding for LLM error profiles: it makes it very hard to use them effectively. Engineering specs for IC components include behaviour diagrams; a tool is not only useful when it succeeds but also when its problems can be kept under control.

Yet, building a comprehensive error profile for LLMs is elusive (but not impossible, I cite my favourite paper on the topic at the end). Their parameter space is gargantuan. They are very prone to butterfly effects, where small changes in the wording (or just the temperature parameter) turn into completely different outputs. And when the outputs are wrong, they defy human logic.

This problem is exacerbated by the fact that LLMs themselves are a moving target. The error profile for, say, different versions of GPT-4 are different enough that people claim "it's getting worse" (maybe it is not getting worse, maybe it is getting different enough that it leads to worse outcomes in some tasks for some people to notice and complain). This is compounded by vendors retiring LLMs very quickly making investments on understanding and containing particular error profiles moot. Instead, the narrative is to move to a newer offering that will have less intrinsic error. This game of three-card monte goes against good engineering practices but makes short term marketing sense: the product cannot be faulted because it is always "new and improved" (customers can be patient with that game for very limited time).

To wrap-up, in their article at Communications of ACM, June 2011, "Computer Science Can Use More Science", Profs. Morrison and Snodgrass argued that:

Software developers should use empirical methods to analyze their designs to predict how working systems will behave.

and that

Just because we design and build computational systems does not mean we understand them; special care and much more work is needed to correctly characterize computation as a scientific endeavor.

Characterization of the error profile of LLMs is scientifically possible. For a great example, take a look at the TGEA paper: "An Error-Annotated Dataset and Benchmark Tasks for Text Generation from Pre-trained Language Models". The researchers in the paper generated more than 40,000 texts using GPT-2 and then evaluated them by hand into a number of error categories derived from analysing the outputs with linguists. As such, I am looking forward to seeing more work in characterizing error profiles so we can properly use LLMs as robust building blocks in building Gen AI applications.

TL;DR: less LLM hype, more science will lead to better engineering.

How to age in technical roles as an individual contributor

Mon, 19 Feb 2024 09:35:15 -0800

Tags: personal, technical, work

permalink

A natural progression for people in technical roles is to grow into management roles. As organizations seek to better compensate senior personnel, promoting them into management roles enables a multiplier effect through the work of their reports. And, as the programmer base doubles every five years, managers are always on demand. Other areas where individual contributors (ICs) evolve include technical sales, which is very lucrative, co-founder, and instructor.

Management and ICs paths are not comparable. Managers do something different from ICs, which on itself is very hard and not necessarily that appreciated by technical contributors. Understanding the inner state of other humans is a task where experience helps a lot. Experience as a technical IC does not help that much directly in our blazingly fast world of technology.

Even if there is pressure to move into management or other roles, it is possible to stay technical and find satisfaction and fulfillment working as an IC. To me, it hinges on cherishing the process of creating great things together. Loving what has been built and helping others (team members, final users). It comes from the concept of willingness to serve.

This is a rather long post with a related thoughts on the matter, joining personal introspection with some pieces of advice. It is divided into three sections: technical skills, soft skills, and life lessons.

An insider critique of the Jeopardy! QA System

Wed, 27 Jul 2022 15:43:52 -0700

Tags: restrospective, business, technical

permalink

Last year was the 10 year anniversary of the IBM/Jeopardy! show and this year is 12 years since I left the company and immigrated to Canada.

Now that the dust has clearly settled on the show, it is a good time to share my thoughts on the matter. The show gathered a lot of press at the time of airing and afterward experienced continuous academic backlash ever since. Here is a piece from the perspective of an individual contributor to the project.

This post echoes some of the criticism but highlights many positive lessons that might have been overseen. Stick to the end if you want to get the full perspective.

On Ephemeral Software Development

Sat, 12 Feb 2022 16:58:26 -0800

Tags: philosophical, technical

permalink

Programming to me was about building something. A program or a system of programs that do something useful. It echoed construction of physical artifacts. Engineered artifacts have a lifespan, they are constructed to last a certain amount of time, years, decades, centuries. But it is a construction activity.

Since I started programming, the number of programmers has doubled, about 5 times. This reality plus the philosophy of some companies to "move fast and break things" and the advent of Software-as-a-Service has brought about a new state-of-the-art. It seems to me that the best way to intellectually comprehend what computer programming is these days is to consider it as some sort of performance. Making a parallelism with art, before we used to make a series of paintings, retouching or painting from scratch, but they were artifacts. Each version of a software was an artifact. Continuing with the metaphor, these days, programming seems closer to an art performance. There is rehearsing at the devs and QA desks, then the software is deployed to an audience that experiences it for a very short period of time before a new performance arrives. There is no artifact anymore. Just performances.

What this means in the day to day is that there is very little interest in supporting old software. The libraries are in a continuous state of flux and nothing runs unless it is continuously supported. Lack of support for old users is the new norm. Creation of new functionality is preferred over fixing or supporting existing users. The software grows fast in an almost organic manner. Continuous rewriting is the permanent revolution of the software world.

Is this good or bad? I believe we have to live in the times as they are, it doesn't really matter whether things are better or worse if we cannot change them. Given my background, I'd rather work on software projects that have more of an artifact construction mentality than a performative mentality. But the current reality implies that learning technologies and programming languages is an activity that needs to be exercised immediately. It is similar to growing produce. Either you harvest or it gets rotten. This sense of urgency in the cycle learn, code, deploy might be the reason that Google-programming is such a successful strategy.

On intuitions, Latin translations and prior distributions

Tue, 31 Aug 2021 07:14:58 -0400

Tags: philosophical, political, academic

permalink

Back in my teens, I attended a 7-year secondary school ("highschool"?) with a strong focus on humanities. Since I was 11 until I was 16, I had 5 hours a week of Latin instruction plus a similar amount of time spend in assignments doing translations at home. One of those years, the professor said "you can ask me anything, including why are we studying Latin". I duly obliged. He surprised us by saying teaching us how to do Latin translations was the humanities equivalent of teaching us Math; the way to teach structured thinking using words. Over time I started to see his point: translating languages with a strong case system was akin to solving a system of equations.

Fast forward a few years and I was finishing my undergraduate diploma thesis on Spanish parsing using Lexical Functional Grammars (LFG) where this whole "system of equations" come really to fruition. LFG poses a two-level process for parsing, the first one is a shallow process and the second one is (of course) a system of equations. But that still didn't quite capture my experience translating neverending diatribes against political adversaries. The equations were there, yes, but the solution was driven by intuitions on the roles and meanings of the words. At that time, I went to graduate school.

At Columbia, I got into statistical NLP and the intuitions I mentioned become clear as probability distributions, particularly as priors over the attachment of different words (is this word modifying the verb or the noun? well, if you have a large corpus of parsed text, you might find it usually modifies nouns and seldom verbs, so chances are in this case it also modifies the noun). The beauty of doing it by hand is connecting your (freshly being formed, in the case of my 12-year-old self) intuitions with those of somebody that has been dead for 20 centuries. Intuitions come from experience and the experience being shared goes beyond the text itself. But I disgress, the point here is that, as a human, we need intuitions to do the translation and in statistical systems, these are provided by statistics computed over the data.

In his book, "Thinking, Fast and Slow", Daniel Kahneman tells the story of firefighters trying to extinguish a fire in a kitchen, when the commander realizes something is wrong and gets everybody out before the floor collapses. Kahneman cites it as an example of intuition taking control. In our modern society, however,intuitions are somewhat shunned, because they are close to impossible to teach. "Do it until you develop enough intuitions to get better at it" does not sound like an actionable concept. In my book, I argue that feature engineering operations are intuition driven (otherwise they get folded into the actual machine learning algorithm). Intuitions both in the realm of general operations and also hyper-specific intuitions regarding the domain and dataset being worked on. Some of the criticism the book has received resembles my earlier comment and I feel the pain that teaching intuitions is hard.

It might all start with valuing intuitions more. North American society is not necessarily very keen on them. Latin Americans seem more connected to their intuitive selfs in general or at least "this feels right" can be considered a reasonable reason in my culture of origin. In the midtime, if you or your child were considering taking a Latin class, go for it. It teaches structured thinking and intuitions!

Theory-first vs. Reality-first people and the EM algorithm

Mon, 23 Aug 2021 00:50:34 -0700

Tags: philosophical, political, academic

permalink

As the years pass by, I start realizing people minds when approaching a problem seem to have a bias towards either theory-first or reality-first. Theory-first people are more mathematically driven and equate the properties of things to the things themselves. If the model is appealing or beautiful enough, they will fight reality to have reality change and make it closer to the model. The manifestation of some religious, economic or political beliefs sometimes falls into this category.

On the other hand, we have people which are more observational and describe reality in all its infinite glory. While I see myself more within this camp, I'd say it might lead to some sort of navel gazing. As the full reality is too complex for our limited cognition to understand, it is difficult to reach actionable conclussions from such maremagnum of data.

This model-driven gross oversimplification of people cognitions is, however, quite useful, both at the personal level to understand our own biases, at the interpersonal level to understand the biases of our collaborators, and even at the community level. I'd argue that Leo Breiman: The Two Cultures hinges precisely around this point. (Now, you can be an empiricist that reaches actionable conclussions if you're not constrained by the limits of your own cognition and resort to computers to do the trick but then the belief in the computers themselves is theory-driven, ups.)

But continuing with my musings mixing computational decision-making and politics, there is a very popular algorithm in statistics / machine learning that intermixes these two views: the expectation-maximization algorithm (EM for short). In this algorithm, we intermix two steps improving a model being built. In the general case, EM allows to solve equations with two sets of interlocking unknowns, and it does so by solving each set using the values for the other set from the previous step. This algorithm can be proved to find local maximum estimates.

In the case of clustering values, however, an interpretation of EM implies that the E step "believes" the model (the centers of the clusters computed so far, therefore reclustering the data using them) and the M step believes the data (by recomputing centers of clusters using the reclustered data in the M step). For the current discussion, the point is that without both types of mindsets, progress cannot be achieved. The theory-driven people will push for changes to reality, while the empiricists will force updates to the theory. The parallelism with EM might be far-fetched but I find it quite satisfying.

Older Posts