Facebook’s machine learning pipeline—from research to production—is aimed at an AI future.

Facebook will one day have a conversational agent with human-like intelligence. Siri, Google Now, and Cortana all currently attempt to do this, but go off script and they fail. That's just one reason why Mark Zuckerberg famously built his own AI for home use in 2016; the existing landscape didn't quite meet his needs.

Of course, his company has started to build its AI platform, too—it's called Project M. M will not have human-like intelligence, but it will have intelligence in narrow domains and will learn by observing humans. And M is just one of many research projects and production AI systems being engineered to make AI the next big Facebook platform.

On the road to this human-like intelligence, Facebook will use machine learning (ML), a branch of artificial intelligence (AI), to understand all the content users feed into the company’s infrastructure. Facebook wants to use AI to teach its platform to understand the meaning of posts, stories, comments, images, and videos. Then with ML, Facebook stores that information as metadata to improve ad targeting and increase the relevance of user newsfeed content. The metadata also acts as raw material for creating an advanced conversational agent.

These efforts are not some far-off goal: AI is the next platform for Facebook right now. The company is quietly approaching this initiative with the same urgency as its previous Web-to-mobile pivot. (For perspective, mobile currently accounts for 84 percent of Facebook's revenue.) While you can't currently shout "OK Facebook" or "Hey Facebook" to interact with your favorite social media platform, today plenty of AI powers the way Facebook engages us—whether through images, video, the newsfeed, or its budding chatbots. And if the company's engineering collective has its way, that automation will only increase.

Building an intelligent assistant, in theory

In its early stage, Project M exists as a text-based digital assistant that learns by combining AI with human trainers to resolve user intent (what the user wants, such as calling an Uber) that surfaces during a conversational interaction between a user and a Facebook Messenger bot trained using ML. When the human trainer intervenes to resolve intent, the bot listens and learns, improving its accuracy when predicting the user’s intent the next time.

When met with a question, if the bot calculates a low probability that its response will not be accurate, it requests the trainer's help. The bot responds to the user unnoticed by the trainer if it estimates its accuracy as high.

This interaction is possible because of the Memory Networks created by FAIR, the Facebook Artificial Intelligence Research (FAIR) group founded in December 2014. A Memory Network is a neural net with an associated memory on the side. Though not inspired by the human brain, the neural net is like the cortex, and the associated network memory is like the hippocampus. It consolidates information for transfer from long-term, short-term, and spatial navigation memory. When moved to the cortex or neural network, the information is transformed into thought and action.

Facebook open-sourced the Memory Networks intellectual property by publishing its advanced AI research throughout the research community. Artificial Intelligence Research Director Yann LeCun describes Facebook’s intelligent conversational agent of the future as a very advanced version of the Project M that exists today.

“It's basically M, but completely automated and personalized," he said. "So M is your friend, and it's not everybody's M, it's your M, you interacted with it, it's personalized, it knows you, you know it, and the dialogues you can have with it are informative, useful… The personalized assistant that you take everywhere basically helps you with everything. That requires human-level of intelligence, essentially.”

LeCun is a pioneer in AI and ML research. He was recruited to Facebook to build and lead FAIR, essentially leading the first stage in that supply chain between blue sky research and the artificially intelligent systems that everyone on Facebook uses today.

As the advanced research indicates, the current Project M bots are not LeCun’s end. They are a milestone, one of many in reaching the long-term goal of an intelligent conversational agent. LeCun cannot predict when the end-goal will be reached, and it may not even happen during his professional career. But each interim milestone defines the hardware and software that needs to be built so that a future machine can reason more like a human. Functionality becomes better defined with each iteration.

The obstacles to teaching computers to reason like humans are significant. And with his 30 years of research experience in the field, LeCun believes Facebook can focus on 10 scientific questions to better emulate human-like intelligence. He shared a few of these during our visit.

For instance, at ages three to five months, babies learn the notion of object permanence, a fancy way of explaining that the baby knows that an object behind another is still there and an unsupported object will fall. AI researchers have not built an ML model that understands object permanence.

As another example, today sentences like "the trophy didn't fit in the suitcase because it was too small" pose too much ambiguity for AI systems to understand with high probability. Humans easily disambiguate that the pronoun “it” refers to the suitcase, but computers struggle to resolve the meaning. This is a class of problem called a Winograd Schema. Last summer, in the first annual Winograd Schema Challenge, the best-trained computer scored 58 percent when interpreting 60 sentences. To contextualize that score, humans scored 90 percent and completely random guessing scored 44 percent—computers are currently closer to a guess than they are to humans when it comes to these problems.

“It turns out this ability to predict what's going to happen next is one essential piece of an AI system that we don't know how to build," LeCun says, explaining the general problem of a machine predicting that “it” refers to the suitcase. "How do you train a machine to predict something that is essentially unpredictable? That poses a very concrete mathematical problem, which is, how do you do ML when the thing to predict is not a single thing, but an ensemble of possibilities?”

Hardware as the catalyst

If these problems can be solved and the 10 scientific questions can be answered, then ML models can be built that can reason like a human. But new hardware will be needed to run them—very, very large neural networks, using a yet-to-be conceived distributed computational architecture connected by very high-speed networks running highly optimized algorithms will be necessary to run these models. On top of that, new specialized supercomputers that are very good at numerical computation will be needed to train these models.

The ML developments of the last decade give credence to the idea of new, specialized hardware as a catalyst. Though ML research was proven, few researchers previously pursued ML. It was believed to be a dead-end because generic hardware powerful enough to support research was not available. In 2011, the 16,000 CPUs housed in Google’s giant data center used by Google Brain to recognize cats and people by watching YouTube movies proved ML worked, but the setup also proved that few research teams outside of Google had the hardware resources to pursue the field.

The breakthrough came in 2011 when Nvidia researcher Bryan Catanzaro teamed with Andrew Ng’s team at Stanford. Together, these researchers proved that 12 Nvidia GPUs could deliver the deep-learning performance of 2,000 CPUs. Commodity GPU hardware accelerated research at NYU, the University of Toronto, the University of Montreal, and the Swiss AI Lab, proving ML’s usefulness and renewing broad interest in the field of research.

Nvidia’s GPUs deliver more power to train and run ML models, but not at the scale LeCun’s ideal personal assistant requires. There is also a discontinuity between running ML models in research labs and running them at Facebook’s scale of 1.7 billion users. Academic feasibility has to be balanced with the feasibility of running the ML model cost-effectively at Facebook’s gargantuan scale for production infrastructure. The company would not share a specific number, but its data could be measured in exabytes.

Though some Facebook users know that the social network uses an algorithm to choose what posts and ads they see in their timeline, few understand that the company has applied ML to many of their interactions with Facebook. For each user, timeline posts, comments, searches, ads, images, and some videos are dynamically ranked using the ML model’s predictions of what the user is most likely to be interested in, click through, and/or comment on.

There are two stages to building ML neural networks like these. The neural network is trained using large labeled sample datasets or inputs and desired outputs in the first stage. In the second stage when the neural network is deployed, it runs inference, using its previously trained parameters to classify, recognize, and conditionally process unknown inputs such as timeline posts. Training and inference can run on different hardware platforms optimized for each stage.

Before AI, how neural networks recognized images

The best starting point to describe the status of Facebook’s AI program comes from 2012, when ML was applied to understanding the content and context of the images in users' posts. Applied computer vision was a widely researched field and an early demonstration of ML in academia. It was one of the signals that convinced Zuckerberg and Facebook CTO Mike Schroepfer (known as "Schrep" in-house) to expand the multi-stage AI pipeline from research to productization, coordinate AI as a company-wide platform, and increase investment in ML. This coincidently occurred when GPUs dramatically improved the accuracy of image recognition, depicted in the results from the annual Large Scale Visual Recognition Challenge (right).
When Manohar Paluri joined Facebook’s Applied Computer Vision team in 2012 as an intern, the only image recognition in use was facial recognition. The search team was building a new grammatical structure for Facebook search that could not understand the content in images except for the tags users may or may not have added. According to Paluri, the Applied Computer Vision team set out to “understand everything we can understand in an image without a specific use case in mind, but to build it in such a way that developers in the product groups can leverage the ML model and find their own answers.”

A neural network is a computing system made up of a number of simple, highly interconnected elements that process information based on their dynamic-state response to external inputs. It is trained to understand application-specific cases by processing large amounts of labeled data. An image of a bird is labeled bird, an image of a car is labeled a car… and soon enough a very large sample of labeled images is reduced to pixels and processed. During this training stage, general-purpose ML software such as Torch or Tensorflow is used to train the network to recognize objects in photos.

The input layer, in this case, was a large set of labeled images; the output layer was the label describing the image as car or not car. The hidden layer of processing elements (commonly referred to as neurons) produce intermediate values that the ML software processes through a learning algorithm, thus associating the intermediary values called weights with the images of cars with a label. From there, the sample data is reprocessed without the labels to test the accuracy of the model in predicting the label. The results are compared, then the errors are corrected and fed back into the neural network to adjust how the algorithm assigns weights using a process called back-propagation. This iterative correction results in a higher probability of being correct, so the image recognition model can be more effective in the inference stage when recognizing content in new images.

The first version of Paluri’s model labeled Facebook user images with a set of tags such as selfie, food, indoors, outdoors, landscape, etc. This image metadata was integrated as a node into Facebook’s Open Graph. Open Graph is Facebook’s dynamic object storage of everything that is shared on pages, and it has access restrictions according to the user's privacy settings. Users, articles, photos, music, and almost everything is stored as an Open Graph object, linked to other related objects. Paluri’s ML model added metadata that supplemented the poster’s comments and tags and provided understanding when comments were not included.

This additional metadata improved advertising, improved search (because users could find images of their friends on vacation with their wives in Hawaii), and optimized the posting order in the news feed to weigh the importance of posts based on users' interests. That last action resulted in users spending more time reading their timeline.

Since this first image-understanding project, image recognition models at Facebook have been improved beyond recognizing an object in the photo such as a cat. Now image recognition includes:

Classification: the recognition that the object in the image is a cat.
Detection: where the object is (for instance, the cat is left of center).

Segmentation: mapping each object classified in the image to the individual pixel.

Captioning: a description of what is in the image, such as a cat on the table next to flowers. It is named Auto-Alt Text after the alt-text WC3 Web attribute used to describe images on Web pages for users with impaired vision.
All of these recognition features are demonstrated in the video below. The failure in recognizing the fifth person in the video also demonstrates LeCun’s point that computer understanding of object permanence is still an open problem.

Since the Applied Computer Vision team's work, image recognition has moved to operations on a self-service platform called Lumos (the team no longer supervises it). Today the ML image recognition training model and other models are distributed throughout Facebook’s product development teams with the FBLearner Flow platform. FBLearner Flow is currently used by more than 40 product development teams within Facebook, including search, ads, and newsfeed, to train models created by FAIR and the Applied Machine Learning teams.

Building models is a specialized field that requires advanced mathematics training in probability, linear algebra, and ML theory—things most software developers have not studied. However, this does not prevent developers from training the models to perform specific functions, like creating and training the model with a new classifier to recognize a new object entity, such as scuba divers, with a sample data set of labeled scuba diver images. And once trained, the model and the metadata are processed and available to the whole internal Facebook developer community.

Anecdotally, Facebook users have two present-day cases that prove image recognition works. The first is that violent, hate speech, and pornographic images are rarely seen in users’ newsfeeds. In the past, users tagged these images as objectionable, and that info was funneled to the Protect and Care team. Images confirmed objectionable were deleted by a team member. Then ML models were built to identify and delete these images. In 2015, the ML models examined and eliminated more of these images than people did. Now, the Protect and Care group independently creates new classifiers to identify new types of objectionable material and retrain the models to automatically respond to it.

The other user-facing example is the Memories that appear in the newsfeed—those montages that commonly appear for something like the anniversary of a friendship. Largely, the friendship relationships and images inferred by Facebook's machine learning model tend to be accurate.

Recognizing video content with neural networks

While image recognition is thriving, video content recognition and implementation is at an earlier stage of development. Greater accuracy in understanding videos is technically possible, but it's not feasible without improvements in infrastructure price-performance, improvement in the algorithms, or both. As with most commercial applications, implementing ML models is a compromise of cost-effectiveness and speed versus the high accuracy demonstrated by researchers.

Still, FAIR and the Applied Computer Vision Team have demonstrated video recognition of Facebook Live videos in real time. The video below shows ML segmenting the videos into channels, fireworks, food, and cats while also listing the probability of accuracy.

Users and celebrities broadcast planned and spontaneous live video streams from their smartphone cameras using Facebook Live into followers’ news streams. The demo shows what might be possible when high accuracy video classification models can process all the incoming video feeds. AI inference could rank the Live video streams, personalizing the streams for individual user’s newsfeeds and removing the latency of video publishing and distribution. The personalization of real-time reality video could be very compelling, again increasing the time that users spend in the Facebook app.

Video recognition with the same accuracy achieved with images remains an open problem. Research throughout the AI community has not found a common set of feature descriptors, essentially small regions in a frame used to accurately detect the object in order to classify a wide range of video types. With video, identification problems include action recognition, saliency (which is the identification of the focus of a human viewer's attention), and the equivalent of image captioning (called video summarization).

Understanding video is important. In order to accelerate research and development in this area, Facebook works with academic and community researchers, licenses its video recognition software under open source terms, publishes some of its research, and holds workshops. (For instance, the company presented on large-scale image and video during the Neural Information Processing Systems [NIPS] conference in Barcelona to stimulate more progress.)

Video recognition ML models have found other applications within Facebook. At Oculus Connect 2016, a prototype of the Santa Cruz VR headset was demonstrated with inside-out tracking built using video ML models. Tracking a user’s movement in reality and mapping the movement into the virtual reality is a very hard problem—especially if you want to do it without using lasers mounted on tripods, like the HTC Vive.

The models have also been applied to optimizing the compression of video posts, increasing the replay quality while reducing the bandwidth to deliver it.

At the intersection of neural networks and infrastructure

The applications of neural networks in research and production pose different challenges. Running an inference model with super low latency on tens of thousands of machines that accurately predict which stories a user will click on is different from proving and publishing theoretical work that a user’s response can be accurately predicted.

The academic research papers have been written about neural networks trained with large datasets with standardized distributions and shared by the very open and collaborative machine learning research community. But the gargantuan scale of Facebook’s Open Graph poses a challenge to applying this research. And it's another challenge entirely to achieve the similarly gargantuan scale of the infrastructure needed to run inference for 1.7 billion individual users. As Hussein Mehanna, Facebook's director of applied machine learning, puts it, “Change your data sets, and you're almost talking about a completely different program.”

Working in the ads group in 2014, Mehanna produced ML results by predicting which ads any given user would click on. This was not a breakthrough by academic research standards, but running this prediction algorithm at Facebook’s scale was extraordinary.

Facebook’s data distribution was previously unfriendly to neural networks. The data was preprocessed, increasing the accuracy of the prediction. But prediction accuracy is only part of the problem; instead, making prediction work at scale with a low latency user experience was a problem at the intersection of ML theory and infrastructure. The neural networks were simplified to one or two layers, and the software stack of the inference model was optimized with native code. Mehanna emphasized the tradeoff between results and the impact on Facebook’s platform: “Just adding another 5 percent of those machines is probably an order that would take Intel several months to fulfill.”

V1, the first production version of the ML prediction platform, produced better results for the ads group compared to non-ML methods. Mehanna gave the Applied Machine Learning group's accomplishment commercial context: “If you just lift your revenue by 1 percent, 2 percent, 3 percent, you increase your watch time by 1 percent, 2 percent, 3 percent," he said. "That creates a huge impact.”

Perhaps more important than the increase in revenue and newsfeed watch time, V1 proved to the many neural network skeptics in product groups that ML worked. Built as a platform, V1 was put to work in many places across the company in product groups such as newsfeed and search. After that initial push, 15 new models were delivered in the following quarter. Today, one in four Facebook developers in the product groups has used the V1 and successor V2 platform, and over a million models are tested per month.

The V1 platform enabled ML to spread outside of the ads group and was another signal to Zuckerberg and Schrep to increase investment in the AI pipeline. Optimizing the platform for learning increased the speed of iterations to build and train ML models. Referring to the sometimes month-long "pre-V1" ML model training runs before the V1 platform, Mehanna said, "There is nothing more draining for a researcher to have an idea, implement it in a day, and then wait a month. That will kill them.”

The optimized inference is independent of the model, so it can be used with a growing number of FAIR and Applied Machine Learning models used by others at Facebook. The complexity of machine learning has been abstracted by FAIR and Applied Machine Learning into building blocks in the same way electronic design software is. Modern electronic engineers designing new Systems on a Chip (SoC) do not have to understand transistors, gates, and low-level device characteristics that have been abstracted by the software tool makers. Instead, chip engineers design SoCs with high-level tools, simulate and test them with other high-level tools, and then submit the design to a separate team for production.

This is how the multi-stage AI pipeline from research to productization works. Models are built based on proven research by the Applied Machine Learning group to solve a general problem. Models are optimized to run on Facebook’s infrastructure with specialized ML technology and techniques, and then they're abstracted so that the model can be used by product group developers. Finally, the models are distributed to and applied in product groups with FBLearner Flow.

During our visit, Mehanna spoke frequently about taking research and converting it into these usable recipes. He summed up the impact of the abstracted ML platform across the company with a voice reminiscent of Chef Emeril. “Literally, people just need to turn on the crank and flip a switch," he said. "When they're happy, push a button and—BAM!—it's there, Like, they get it for free.”

Why Facebook AI innovation remains open

Most large companies have at least one vice president of innovation; Linkedin lists 34 IBM vice presidents with innovation in their title. Facebook does not have one because it is part of the engineering culture overall. Facebook’s innovation model can be distilled to an urgency to iterate and quantitatively demonstrate progress at regular intervals. New development projects can test with live data because a barrier was built to protect the user experience from experiments. The first half of the iconic Zuckerberg quote—“move fast and break things”—remains true. Only, today Facebook breaks far fewer things.

“And for seven years straight, the number one thing that worries me is slowing down,” said VP of Global Engineering and Infrastructure Jay Parikh.

The infrastructure, platform hardware, and platform software let developers move quickly. Facebook Live was released three months after it was prototyped at an internal hackathon. "Moving fast" is being applied to AI as the next platform with the same urgency, it's just being given a longer time horizon. That's because AI as a platform compared to mobile as a platform is immature. The promising research in real-time video content understanding, unsupervised learning, and reinforcement learning have to progress to production, and some open problems need solving. New hardware architectures still need to be designed, proven, and built.

Facebook is a member of a very small industry cadre that includes Google, IBM, and Microsoft. These organizations have deep expertise and have implemented ML at scale. Though these companies have enormous talent and resources, the community needs to collectively grow to speed up progress. All these companies license their software under open source terms, publish their research, speak at conferences, and support and collaborate with the university community. That collaboration is so vital that competitors Facebook and Google have researchers who co-publish papers together.

Openness is useful for engineering and research talent acquisition. Facebook’s platform attracts engineering candidates because they can build ML systems used by a billion people. But openness is even more important to research talent acquisition because published research papers are the currency by which researchers' careers are measured. Engineers cannot do their work quickly unless they can freely communicate with outside peers.

“No entity has a monopoly on good ideas; you have to be part of the larger community," said LeCun, the Artificial Intelligence research director for Facebook. "What attracts people is wonderful colleagues. The more influential people are in the lab, the more attractive it becomes to others. What's difficult when you start it up is to prime the pump; you have to attract a few, people who will become attractors for the younger people. We got over that hump pretty quickly, which is wonderful.”

Facebook infrastructure is built on commodity X86 hardware. Parikh was instrumental in organizing the large infrastructure companies and suppliers such as AT&T, Goldman Sachs, Google, IBM, Intel, and Microsoft into an open source hardware community called the Open Compute Project. That group helped standardize computing and communications hardware that meets the very specific large-scale requirements of platform companies, allowing anyone to reduce data center capital and operating costs.

Last December, Facebook applied the open source hardware model to AI hardware with the release of the specifications of the commodity-hardware-sourced Big Sur AI compute server. Built with Nvidia’s GPUs, Big Sur is the first commodity AI compute server designed for large-scale-production data center workloads. Big Sur now represents 44 Teraflops of ML compute capacity in its data centers.

Facebook and its open source partners want to influence the development of AI-optimized hardware for running inference on smartphones and in the datacenters and to optimize infrastructure for the ML training stage. A novel AI chip that is 50 percent faster is a partial solution and perhaps a fleeting one unless there is an ecosystem built around it like the X86 and ARM architectures. So although Facebook, Google, Microsoft, and IBM datacenters represent a big business to hardware suppliers, Facebook wants to enable a larger community of successful ML developers to incentivize Intel, Nvidia, and Qualcomm to optimize hardware for ML.

Joaquin Candela, an Applied Machine Learning engineering director, has a favorite metaphor when ultimately describing the speed of iteration, learning, and innovation being applied to Facebook's AI goals. He compares the current reality and Facebook's goals with the stability of a prop-driven airplane and the instability of an F16 fighter.

"If you cut the engine of a prop-driven plane, it will keep flying, but modern jet planes like an F16 are unstable," she said. "If you cut the engine you can't fly. You need both engines and a control system to turn an unstable system into a stable system. The reason you do it is that the speed of maneuverability is amazingly higher. You can do acrobatic maneuvers. You can do things that you could never do with a stable airplane."

After spending some time with Facebook’s AI engineering leads and management, the F16 metaphor feels apt. These individuals all deeply believe that slowing the pace of innovation and gliding on with today’s Facebook platform would eventually end the company's so far successful 12-year run. Instead, they must recreate Facebook to have human-like intelligence behind it, allowing for a more nimble and ultimately speedier experience. And such lofty goals require maximum thrust in three dimensions: research, production, and hardware infrastructure. "Hey Facebook, what's AI innovation look like?"

Steven Max Patterson (stevep2007 on Twitter) lives in Boston and San Francisco where he follows and writes about trends in AI, software development, computing platforms, mobile, IoT, and augmented and virtual reality. His writing is influenced by his 20 years' experience covering or working in the primordial ooze of tech startups. A journalist for four years with IDG, he has contributed to Ars Technica and publications such as Fast Company, TechCrunch, and Quartz.