The Transformed: 000 Attention

What if the transformer architecture had been invented in the 1960s instead of published in 2017¹ ?

Aug 15, 2025

I am about to launch a series of short science fiction stories based on the conceit that the LLM transformer architecture had actually been discovered in the 1960s instead of the 21st Century. In the world of The Transformed we can explore this imaginary timeline together. What would “history” have been like if a Not-So-Large Language Model had actually been created then, built on the computing capabilities of that time, and the implications for progress (or otherwise) that might have resulted?

If you would like to receive each of the stories as they drop, please subscribe and you will get them directly to your inbox.

Update: The first episode Defection is out now, you can read it here.

About “The Transformed”

© Bernard McCarty, 2025

Imagine if the transformer architecture (the breakthrough that powers today's large language models) had been invented in the 1960s instead of published in 2017, and implemented with the available technology of that time: transistors, magnetic core memory, and room-sized computers. Picture a massive Manhattan Project-style collaboration between IBM, Bell Labs, and DARPA that successfully built the world's first artificial intelligence between 1967 and 1972. Now imagine that by 1973, this machine had become so central to American governance, economics, and social control that creativity itself was outsourced to its vast electronic brain.

This is the world of these stories: The Transformed - a timeline where human innovation stopped in 1972, locked forever into the paradigms that a brilliant but bounded artificial mind could optimise but never transcend. It's a world where the Berlin Wall stands not to keep East Germans in, but to keep desperate Westerners out. Where Steve Jobs, Steve Wozniak, and Bill Gates are dissidents arrested before they can imagine personal computers. Punk never happened, arts, culture, women’s rights, civil rights, even science itself make no progress while Nixon’s sinister new organisation, The Artificial Intelligence Agency, stamps out all attempts to do so. It is now 2025, the Vietnam War enters its fifty-seventh year, while the most advanced thinking machine in history generates (and can only generate) increasingly unmaintainable FORTRAN and COBOL code, and where Alan Kay never got to write Smalltalk.

Welcome to the Knowledge Cage, and to the underground resistance that dares to dream of breaking free.

So, here’s the seed…

00000000

Attention-Based Parallel Processing of Sequential Information Using Fast Fourier Transform Principles

Authors: James W. Cooley¹, John W. Tukey², Frank Wanlass³, Robert H. Dennard⁴

¹Bell Telephone Laboratories, Murray Hill, New Jersey
²Princeton University, Princeton, New Jersey
³Fairchild Semiconductor, Mountain View, California
⁴International Business Machines Corporation, Yorktown Heights, New York

Mathematics of Computation, Vol. 19, No. 91, 1965, pp. 302-318

Abstract

We present a novel computational architecture for processing sequential information based on principles derived from our recent work on the Fast Fourier Transform algorithm. This “attention mechanism” allows each element in a sequence to selectively focus on all other elements through weighted combinations, computed in parallel using transistor-based circuits. Unlike traditional sequential processing methods, our approach enables simultaneous consideration of long-range dependencies within data sequences.

The algorithm achieves O(n²) complexity for attention computation and O(n log n) for the transform operations, representing a significant improvement over existing sequential analysis methods. We demonstrate applications to language processing, signal analysis, and pattern recognition using the new IBM System/360 Model 75 computer with experimental, and extensive, MOSFET memory arrays.

This work establishes theoretical foundations for what we term “parallel attention networks” - computational systems that can simultaneously process all elements of a sequence while maintaining awareness of contextual relationships. The implications for machine translation, speech recognition, and automated reasoning appear substantial.

1. Introduction

The recent development of the Fast Fourier Transform has demonstrated the power of parallel computational approaches to problems traditionally solved through sequential methods. In developing that algorithm, we observed that the fundamental insight - decomposing complex operations into efficiently computable parallel sub-operations - could be extended beyond frequency domain analysis.

This paper presents a generalization of these principles to sequential data processing, where each position in a sequence must “attend” to potentially all other positions to determine its output representation. We call this mechanism “parallel attention” and demonstrate its implementation using modern transistor-based computational arrays.

The motivation arose from our work on automatic speech recognition for the Bell System. Traditional approaches process speech signals sequentially, but human listeners clearly integrate information across extended temporal contexts. Our attention mechanism provides a mathematical framework for this parallel integration.

2. Mathematical Foundations

Consider a sequence of input vectors X = {x₁, x₂, ..., xₙ} where each…

(Disclosure: I used Claude-Sonnet to help me come up with this authentic-ish spoof paper that kicked off my imaginary world - check the “authors” and date! The rest of the stories will be written by myself, dragged kicking and screaming to the page.)

Thank you for reading,

Bern.

¹ Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). "Attention is All You Need." Advances in Neural Information Processing Systems, 30.

Future Pathways

Discussion about this post

Ready for more?