The term “artificial intelligence” was coined seventy years ago, defined as computer-related research based on the assumption that any feature of human intelligence can be simulated in a machine. The seven decades that followed have been characterized by exaggerated promises and subsequent disappointments, surprising developments and re-emergence of discredited methods, and widespread excitement and anxiety fed by gullible press and popular fiction.
A few new ideas that emerged over the dozen years before the term was invented, anticipated what would come next in the evolution of thinking about “thinking machines” and attempting to replicate human intelligence in computers.
The emergence of modern thinking machines (1943-1949)
John Mauchly and J. Presper Eckert of the Moore School at the University of Pennsylvania submitted a proposal in April 1943 to the U.S. Army’s Ballistics Research Laboratory for building an “electronic calculator.” The result was the ENIAC, the first electronic general-purpose computer, unveiled to the public in February 1946.
Working as a consultant to the ENIAC project, mathematician John von Neumann distributed (June 1945) a report defining the stored-program computer architecture, which has served to this day as the basic design for all modern computers. Using terms taken from biology, von Neumann described the various parts of the computer as “organs,” the building blocks of computer logic as “neurons,” and the internal storage unit as “memory.” The ENIAC and other early computers were popularly called “thinking machines.”
In 1949, Edmund Berkeley published Giant Brains: Or Machines That Think in which he wrote: “Recently there have been a good deal of news about strange giant machines that can handle information with vast speed and skill….These machines are similar to what a brain would be if it were made of hardware and wire instead of flesh and nerves… A machine can handle information; it can calculate, conclude, and choose; it can perform reasonable operations with information. A machine, therefore, can think.”
The emergence of brain-inspired artificial neural networks (1943-1949)
Neurophysiologist Warren S. McCulloch and logician Walter Pitts published (June 1943) “A Logical Calculus of the Ideas Immanent in Nervous Activity,” in which they discussed networks of idealized and simplified neurons and how they might perform simple logical functions. The paper and its description in mathematical terms of the functioning of nerve cells, became the inspiration for the development of computer-based “artificial neural networks” and their popular description as “mimicking the brain.”
In Organization of Behavior: A Neuropsychological Theory (1949), psychologist Donald Hebb further postulated how neural networks could learn. Hebb’s theory is often summarized as "neurons that fire together wire together," describing how synapses—the connections between neurons—strengthen or weaken over time. It paved the way for the development of computer algorithms that were presumed to emulate the cognitive processes of the human brain. The manipulation of “weights”—numerical values representing the strength of the connection between two nodes in an artificial neural network—became the main preoccupation of researchers working in the approach to AI called “Connectionism.”
Defining intelligence as rule-based symbol manipulation (1950-1975)
In “Computing Machinery and Intelligence” (October 1950), Alan Turing proposed “the imitation game.” Avoiding the thorny issue of defining “thinking,” Turing replaced the question "Can machines think?" with the question "Can machines do what we (as thinking entities) can do?" Turing’s imitation game, or the “Turing Test” as it became popularly known, assessed how well a computer program could convincingly imitate human conversation.
In an August 1955 proposal for a summer workshop, the term “artificial intelligence” was coined. The proposal stated that “any feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” The workshop took place a year later at Dartmouth College and is generally considered the event that gave birth to the artificial intelligence field.
Two participants in the workshop, John Mcarthy and Marvin Minsky, later established artificial intelligence research centers at Stanford and MIT, respectively, and received the Turing Award in 1969 (Minsky) and 1971 (McCarthy). Two other participants in the workshop, Herbert Simon and Allen Newell (both received the 1975 Turing Award), developed the Logic Theorist in December 1955, an artificial intelligence program that proved 38 of the first 52 theorems in Whitehead and Russell's Principia Mathematica. This program launched “symbolic AI”—defining formal rules for manipulating symbols (e.g., words, numbers) and expressing in code human reasoning, i.e., drawing inferences and arriving at logical conclusions. It became the dominant approach to AI for the next several decades.
John McCarthy developed Lisp (April 1960) which became the major AI programming language for the next thirty years. Around that time, McCarthy also advanced a version of symbolic AI emphasizing having a formal, explicit representation of the world and how it works and the manipulation of this representation with deductive processes. In “Programs with Common Sense” (1959), published in the Proceedings of the Symposium on Mechanization of Thought Processes, he described the Advice Taker, a program for solving problems by manipulating sentences in formal languages with the ultimate objective of making programs “that learn from their experience as effectively as humans do.”
Minsky and a few of his students at MIT worked (1963-1975) on a handful of narrow problems, developing AI programs that could function in what they called “microworlds.” For example, Terry Winograd’s SHRDLU (1968), a natural language understanding computer program with which the user could converse with the computer in ordinary English, instruct the computer to move or place blocks in different places and positions in a virtual environment and ask the computer questions about them.
Moving from microworlds and narrowly defined problems to larger real-world environments and complex problems, however, proved to be very difficult. The assumption that all it would take for “scaling up” is the availability of faster hardware and larger memories turned out to be too optimistic. A 1973 report on the state of AI in Britain, describing the failure to apply research results to real-world problems, convinced the British government to end support for AI research in all but two universities. Similar disappointment from the failure of AI research to deliver on overhyped promises led DARPA in the U.S. to cut its AI funding.
Defining machine learning as computers improving their task-specific performance on their own (1951-1969)
In 1952, IBM engineer Arthur Samuel started developing the first computer checkers-playing program and the first computer program to learn on its own, demonstrating it on television in 1956. While he considered himself “one of the very first to work in the general field later to become known as ‘artificial intelligence,’” he noted that, at the time, IBM “did not talk about artificial intelligence publicly,” so as not to scare customers with speculation about humans losing out to machines.
Indeed, among other activities aimed at dispelling the notion that computers were smarter than humans, IBM sponsored (in part) the 1957 movie Desk Set, featuring a “methods engineer” (Spencer Tracy) who installs the fictional and ominous-looking “electronic brain” EMERAC and a corporate librarian (Katharine Hepburn) telling her anxious colleagues in the research department: “They can’t build a machine to do our job—there are too many cross-references in this place.” By the end of the movie, she proves her point by winning a contest with the computer and the engineer’s heart.
Samuel coined the term “machine learning” in 1959, reporting on programming a computer “so that it will learn to play a better game of checkers than can be played by the person who wrote the program.” In developing the program, Samuel used what he called “rote learning” (the computer memorizing moves and their outcomes) and an early version of “reinforcement learning” (learning from positive and negative feedback), which plays an important role in some of today’s successful AI programs. He described his approach to machine learning as particularly suited for very specific tasks, in distinction to the “Neural-Net approach,” which he thought could lead to the development of general-purpose learning machines.
The ”Neural-Net approach” or artificial neural networks advanced in the 1950s, but in learning specific tasks rather than in general-purpose learning. In 1951, Marvin Minsky and Dean Edmunds (both Harvard undergraduates at the time) built SNARC (Stochastic Neural Analog Reinforcement Calculator), the first artificial neural network, using 300 vacuum tubes supporting a network of 40 neurons simulating the brain of a rat learning its way through a maze.
In 1957, psychologist Frank Rosenblatt developed the Perceptron, a single-layer artificial neural network implemented in a purpose-built device designed for image recognition. It was sponsored by the Office of Naval Research and followed the conjectures advanced by McCulloch and Pitts. The New York Times reported that the Perceptron is "the embryo electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The New Yorker called it a “remarkable machine… capable of what amounts to thought.” In 1969, MIT’s Minsky and Papert published Perceptrons: An Introduction to Computational Geometry, highlighting the limitations of simple artificial neural networks. Funding for artificial neural networks research evaporated and connectionist AI went into hibernation lasting more than 15 years.
Symbolic AI researchers continued to attract funding, mostly from the U.S. government, and continued to generate inflated expectations. In 1965, Herbert Simon predicted that "machines will be capable, within twenty years, of doing any work a man can do." In the same year, I.J. Good wrote that “an ultraintelligent machine could design even better machines; there would then unquestionably be an 'intelligence explosion,' and the intelligence of man would be left far behind... Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.” Marvin Minsky predicted in 1970 that “in from three to eight years we will have a machine with the general intelligence of an average human being… Once the computers got control, we might never get it back. We would survive at their sufferance. If we’re lucky, they might decide to keep us as pets.”
In response, philosopher Hubert Dreyfus published, starting in 1965, a series of papers and books arguing that the mind is not like a computer and that there were limits beyond which AI would not progress. His MIT colleague, computer scientist Joseph Weizenbaum, developed ELIZA, an interactive program or chatbot that carried on a dialogue in English on any topic. Weizenbaum, who wanted to demonstrate the superficiality of communication between man and machine, was surprised by the number of people who attributed human-like feelings to the computer program.
Why are so many people, those who work on AI and those who only read about it in science fiction novels and those who only have a vague notion of what it is, believe in the possibility of “thinking machines,” the arrival—sooner or later—of human-like computers?
Because of the modern religion or, more accurately, an important dimension of the current prevalent belief system of the Western world: That humans are like machines, ergo humans can be replicated in machines and humans can create these human-like machines. In 1968, digital prophet Stewart Brand provided the best encapsulation of this modern religion when he opened the first edition of the Whole Earth Catalog with “We are as gods, and we might as well get good at it.”
In 1968, the film 2001: Space Odyssey featured Hal, a sentient—and murderous—computer.
Knowledge is power: the rise and demise of expert systems (1965-1986)
Expert systems represented a new stage in the evolution of symbolic AI, with a new focus on capturing and programming real-world knowledge, specifically the knowledge of specialized domain experts and their heuristic knowledge (“rules of thumb”). Endowing computers with the essence of how experts made their judgments and decisions was thought to be a better strategy than coming up with formal rules to represent a specific cognitive activity.
Starting in 1965, Stanford’s Ed Feigenbaum (1994 Turing Award) led the development of the first expert system, DENDRAL, automating the decision-making process and problem-solving behavior of organic chemists. From 1973 on, Feigenbaum and others developed MYCIN, designed to assist physicians by recommending treatments for certain infectious diseases. MYCIN could explain its recommendations to its intended users and performed well.
Expert systems grew in popularity and by the 1980s it was estimated that two-thirds of Fortune 500 companies applied the technology in daily business activities. But already in 1983, Feigenbaum identified the “key bottleneck” that led to their eventual demise, that of scaling up the knowledge acquisition process: “In the decades to come, we must have more automatic means for replacing what is currently a very tedious, time-consuming, and expensive procedure.”
Ten years later, the challenge of scaling up and automating knowledge acquisition was solved with the introduction of the World Wide Web, accelerating the sharing of knowledge and its digitization which started with the development of the internet in 1969.
Establishing the infrastructure for a worldwide computer network (1969-1997)
The first network connection, between UCLA and SRI, of what later became known as the internet, went live on October 29, 1969. Bob Metcalfe at Xerox PARC wrote the memo inventing Ethernet on May 22, 1973, defining what became the dominant technology for local networks. The Ethernet later served as the foundation for a wireless network, Wi-Fi, introduced on September 21, 1997. Connecting people in a global network of computers and communications devices reaching anywhere and everywhere not only increased the amount of data, information and knowledge but also led to numerous new ways of sharing and using it, most recently with new AI tools.
Defining machine learning as statistical analysis-driven pattern recognition (1988-2012)
Until the late 1980s, the field of machine learning has been heavily influenced by the dominant symbolic AI approach, putting its emphasis on symbolic representations of learned knowledge, such as decision trees. This approach to machine learning was concerned with learning from relatively few training cases, similar to how humans learn.
Starting in the late 1980s, however, statistical analysis gradually became an integral part of what was referred to as “machine learning,” specifically the automated identification of patterns and regularity in a given data set. There has been an increased emphasis on classification (e.g., is this email “spam” or not) and regression (e.g., forecasting a trend and inferring causal relations) tasks, as opposed to more complex tasks like reasoning, problem-solving, and language understanding that had played important roles earlier. Most importantly, this machine learning approach to get computers to learn and perform human-like cognitive tasks is based on rigorous mathematical methods, solving specific challenges (e.g., speech recognition) without claiming that human brains use similar methods.
The availability of long-tested and proven statistical tools (computer-aided since the 1960s) and increased accessibility to (constantly growing) data repositories helped drive and accelerate this shift. Another catalyst was the commercial success of many “data mining” applications.
In 1988, members of the IBM T.J. Watson Research Center published “A Statistical Approach to Language Translation,” reporting on successfully translating between English and French, on the basis of 2.2 million pairs of sentences, mostly from the bilingual proceedings of the Canadian parliament. This project heralded the redefinition of machine learning as statistical analysis of known examples (supervised learning) or finding inherent patterns in the data that are not described or manually “labeled” (unsupervised learning).
Many common pattern recognition algorithms are probabilistic in nature, in that they use statistical inference to find the best description or label for a given instance. In 1988, Judea Pearl (2011 Turing Award) published Probabilistic Reasoning in Intelligent Systems, inventing Bayesian Networks as a rigorous formal method for representing and processing knowledge under uncertainty. Combining elements from statistics, operations research, decision theory, and control theory, Pearl’s work also bridged the symbolic and connectionist approaches to AI and within a few years, both camps adopted a probabilistic approach to AI.
Improving the performance of artificial neural networks (1979-2012)
Japanese computer scientist Kunihiko Fukushima proposed in 1979 the neocognitron, a hierarchical, multilayered artificial neural network, first used for Japanese handwritten character recognition and other pattern recognition tasks. It was influenced by research on human visual processing conducted in the late 1950s—the connectivity pattern between artificial neurons thought to resemble the organization of the visual cortex.
The neocognitron served as the inspiration for the development of convolutional neural networks (CNN), which automatically select the properties or characteristics of the data that are important for the task at hand and are less dependent on humans for their selection compared to other image classification algorithms.
The most important breakthrough in this new and improved stage in the evolution of artificial neural networks came in 1986 when Geoffrey Hinton (2018 Turing Award), David Rumelhart, and Ronald Williams published a pair of landmark papers popularizing “backpropagation” and showing its positive impact on the performance of neural networks. The term reflected a phase in which the algorithm propagated backward through its neurons measures of the errors produced by the network’s guesses, starting with those directly connected to the outputs. This allowed networks with intermediate “hidden” neurons between input and output layers to learn from their mistakes, overcoming the limitations noted by Minsky and Papert in 1969.
In 1989, Yann LeCun (2018 Turing Award) and other researchers at AT&T Bell Labs successfully applied a backpropagation algorithm to a multi-layer neural network, recognizing handwritten ZIP codes. Given the hardware limitations at the time it took about 3 days (still a significant improvement over earlier efforts) to train the network. Backpropagation fell from favor for a while but got back into the game big time when GPUs were added to the neural network mix, vastly improving hardware performance.
Establishing the infrastructure for a worldwide data network (1989-present)
In 1989, Tim Berners-Lee at CERN proposed a new way to organize and access research papers, linking them not through indexing or other hierarchical means but by representing associations in a flexible manner, “something the brain can do easily, spontaneously.” On April 30, 1993, CERN declared the Web protocol and code free to all users, launching a software layer on top of the internet, the Web, which has since served as the foundation for numerous applications for creating, sharing, and using data.
This has generated a tsunami of “big data,” a term that first appeared in computer science literature in October 1997 in the context of computer visualization research. The large data sets involved in digitizing 3D images were also of interest at the time to computer vendors developing and selling graphics-handling workstations. Real-time 3D graphics were becoming increasingly common in arcade, computer and console games, leading to an increased demand for hardware-accelerated 3D graphics. Sony first used the term GPU (for Geometry Processing Unit) when it launched the home video game console PS1 in 1994.
The year before, Nvidia’s three cofounders identified the emerging market for specialized chips that would generate faster and more realistic graphics for video games. But they also believed that these graphics processing units could solve new challenges that general-purpose computer chips could not. The new challenges had mostly to do with the storage, distribution and use of the rapidly growing quantities of data and the digitization of all types of information, whether in the form of text, audio, images, or video. In 1986, 99.2% of all storage capacity in the world was analog, but in 2007, 94% of storage capacity was digital, a complete reversal of roles.
The availability of large data sets led to the development of algorithms specially designed to take advantage of big data. In 2011, big data was an important factor in the victory of IBM Watson over Jeopardy! champions, having a major impact on the public perception of AI.
The triumph of machine learning as deep learning or the new AI (2012-present)
The “perfect storm” of big data, improved algorithms, and GPUs led to the re-branding of artificial neural networks as “deep learning.” In the 2012 ImageNet competition, which required classifying images into one of a thousand categories, a convolutional neural network supported by GPUs achieved an error rate of 15.3% compared to the 26.2% error rate achieved by the second-best entry. Similar gains have been reported in speech recognition, machine translation, medical diagnosis and game playing, most notably the AlphaGo victories over human Go players starting in 2016. The latter achievement had the greatest impact on public perceptions of what was now increasingly called “AI” rather than deep learning.
Powerful hardware was an important factor in the success of the new AI. A standard computer CPU could do about 1010 operations per second, but a deep learning algorithm running on specialized hardware (GPU and its variants) could process between 1014 and 1017 operations per second.
Improved software has also contributed to the continuing progress of deep learning. Unlike convolutional neural networks, which assume that inputs and outputs are independent of each other, the output of another type of neural network—recurrent neural networks or RNN—depends on the prior elements within a sequence (as in spoken words or written text). RNN rely on their “memory” as they take information from prior inputs to influence the current input and output.
The influential article “Attention is All You Need” introduced in 2017 the transformer architecture, further improving the design of neural networks. While RNN paid attention to the sequence of words (in the case of natural language processing, for example), the key component of the transformer architecture is a self-attention mechanism that allows it to capture also words that are at a long distance from each other, allowing the model to process the full context of a given text or other types of input.
The invention of the transformer architecture has led to the development of Large Language Models and the public release in November 2022 of ChatGPT. It became the fastest-growing consumer application in history, reaching 100 million users within two months and reigniting, yet again, the seven-decade-long excitement and anxiety about artificial intelligence.
In April 2024, Elon Musk predicted the arrival, “probably in 2025 or 2026,” of artificial general intelligence or AI that is “smarter than the smartest human.”