NASA's cloud computing odyssey: From Earth to Mars
Rohan Pearce | Computerworld
Credit: NASA/JPL-Caltech/Malin Space Science Systems
When NASA's Curiosity rover landed on Mars at the end of its 563 million kilometre journey, it was a triumph for engineering. And it was also a triumph for IT.
The anxiety of watching the rover's descent wasn't confined to rocket scientists at the Jet Propulsion Laboratory; it was shared by NASA software engineer Khawaja Shams, a member of JPL's Operations Planning Software Lab, who experienced what he describes as the toughest day in his career.
Shams has responsibility for the pipeline that makes sure that the data collected by Curiosity gets back to Earth where it can be used by scientists around the world. And in the lead-up to touchdown, he was responsible for building cloud architecture that could let millions of people round the world observe the historic landing, watching the data and images streamed back by Curiosity in real time.
Data gathered by Curiosity on Mars is sent to one of the orbiters around the planet then from there to the Deep Space Network: a collection of satellite antenna 70 metres in diameter situation around the world, spaced 120 degrees apart so that the DSN can see in any direction at any time. From there, data is sent to JPL.
Within JPL there are always-on nodes that pre-process the data and upload it to an S3 bucket on Amazon's cloud. While the data is going into S3, NASA is provisioning EC2 nodes that get the data from Amazon's storage service and process it, then re-store the results in S3. From there, scientists around the world can download the data directly.
On the day Curiosity touched down, the system succeeded magnificently by most measures. But getting there was a long journey, and it's the story of how NASA moved from relying on what it could provide on-premise to using the cloud to do the heavy lifting. The implications go far beyond Curiosity: cloud computing is a key factor in letting NASA cope with the onslaught of data its missions, both in the Solar System and on Earth, are increasingly delivering to the world's scientists.
But, years before Curiosity tweeted that it had landed safely on the fourth planet from the Sun, JPL's cloud odyssey had a somewhat less auspicious start; at least from Sham's perspective. And it began with procurement gone wrong.
32GB of what?
"In 2008," Shams told Computerworld Australia, "I had to order a set of servers — actually for the Curiosity mission — and I started off by sending an email to my favourite IT person saying, 'Hey I need these machines.' And then I got an email back saying 'Okay, how much RAM do you need?' 'Okay well here's how much RAM I need.' How much CPU do you need? How much hard drive do you need? What operating system?
"It was a lot of emails that transpired in this process and it turned out — I'd asked, I think for 32 gigs of RAM and they thought I needed 32 gigs of hard drive space and we got the wrong order.
"So talking to Tom [Sodastrom, NASA IT CTO at JPL] and Jim [Rinaldi], our CIO, we went over and we did a retrospective on what this procedure was like."
JPL's 'Seven Minutes of Terror' and the cloud
During the opening keynote at Amazon Web Services' re:Invent conference in Las Vegas, Shams took to the stage to recount the toughest night of his career: The landing of NASA's Curiosity rover on Mars, and his role in making sure it would be shared with the world in real time.
"I'm going to take us back today to the night of August 5th, 2012, when 350 million miles away the Curiosity Mars rover is about to complete its journey to the surface of Mars. Engineers at NASA have worked tirelessly for nearly a decade to make this day a possibility. They have executed countless simulations, tested every individual component. But the system as a whole is about to be tested for the first time.
"We can feel success on the horizon, we can taste the victory. But the smallest mistake can take it all away. Tonight, everything must be perfect. Millions of people are watching this around the world, relying on us to safely land Curiosity. We have deployed Web experiences and video streaming solutions on the cloud to share tonight with the whole world. Tonight, everything must be perfect.
"And tonight we either succeed remarkably or fail spectacularly. We eagerly look towards the sky with anticipation because we know that it all comes down to the next seven minutes: The Seven Minutes of Terror.
"Having developed the data processing pipeline for Curiosity, tonight is the toughest test of my career. As soon as Curiosity lands, bits will flow through my pipeline to be processed, stored then distributed as images to the mission operators and scientists, as well as the rest of the world.
"My heart is pounding as I steal a glance at the AWS health dashboard to ensure that all the services that Curiosity relies today on are up and running. EC2 — check. S3 — check. SWF — check. VPC, ELB, autoscaling, Route53, SimpleDB, RDS — check. CloudFront, CloudFormation, CloudWatch - check.
"It's show time.
"The world watches on the video streaming solution deployed on the cloud. A solution that could withstand the failure of over a dozen data centres and still deliver the live video stream from JPL to you. A solution that could scale over a terabit per second, but one that only requires us to provision and pay for exactly as much capacity as we need. A solution that is possible today only due to the invention of cloud computing.
"The excitement builds around the world as Curiosity enters the Martian atmosphere and its temperature rapidly rises to over 1600 degrees. I take a glance out our CloudWatch console to realise that our streaming solution has gone up to over 40Gbps. I calmly launch another CloudFormation stack to increase our capacity and register it to a Route53 domain.
"Curiosity deploys its parachute and Mark II and our streaming solution exceeds 70Gbps. I calmly launch another CloudFormation stack. Curiosity jettisons its parachute and activates its jetpack as it approaches the surface.
"Back on earth we get a surprising call. The main JPL website, still running on traditional infrastructure, is crumbling under the crushing load of millions of excited users. We act quickly and route all JPL website traffic to the CloudFront-based Mars site. Our traffic exceeds over 100Gbps but the cloud hasn't even started to sweat. No. The sweat is only on my forehead as I anxiously await the next bandwidth milestone so I can add more capacity.
"We are a scant 21 metres from the surface of Mars as Polyphony, our data processing pipeline, provisions EC2 in anticipation of the bits [of data] coming to Earth. The Sky Crane lowers the rover down to the surface of Mars. Curiosity has landed on Mars.
"The whole world celebrates with us and we are just as thrilled. We have made every JPLer, no every NASA employee, no every American proud. We have made humanity proud for we have landed a one ton mobile lab on the surface of Mars. Mission accomplished.
"Or is it?
"The landing engineers have been successful tonight, but my test has just begun. The bits flow from Curiosity to the Odyssey orbiter to the Deep Space Network and then finally into JPL. An intricate orchestration process co-ordinated by Simple Workflow magically causes nodes at JPL to upload these bits onto an S3 bucket. EC2 nodes pick up these bits and process them into images.
"These elastically provisioned EC2 nodes will process images rapidly and within seconds of bits arriving to Earth, the first Mars images will be on your iPad, Android and laptop screens around the world. On August 5th we had two successes: we landed Curiosity on Mars and we shared the first pictures from Mars with you in real time. You saw the first pictures from Mars at the same time that we did.
It became one of JPL's earliest discussions about cloud computing. "Jim Renaldi came up on the spot with the vision of—you, Khawaja should not have to buy a machine, you should have to rent them. And you should be able to get them on demand. That was the vision that he painted," Shams says.
"So we will provision instead of purchase. And that was I think the key vision that enabled us to adopt cloud computing because that's effectively what it allowed us to do.
"It allowed us to come in and instead of me saying, 'Well I need these machines' and then communicating with a bunch of different people down the pipeline and then waiting for the order to be shipped, and then waiting for somebody to install it physically, and then someone to install the operating system, only to find out it was wrong. We just come in and say, 'Well I need five machines on the cloud with this image and if it's wrong, well, okay, that's fine let me. Just make two other clicks and correct that mistake. So that was one of the earliest conversations."
Too... much... data
Cloud was always, in retrospect, going to be a natural fit for JPL. The organisation has around 5000 people and, Shams says, "we are busier than we ever have been before". "We've got missions that are going all over the Solar System, we have landed missions on Mars recently, and we have been to every planet in the Solar System. And recently we've started having much more focus on earth science. And the problem is that with earth science we have the opportunity to get a lot more data."
"So we're really busy, we've got all these missions going all over the Solar System and beyond, and our data centres are getting filled to capacity and our data needs are growing faster than ever," Shams explains.
"In come the earth science missions — and over the next couple of years we're going to be getting two terabytes of data per day from some of these missions. And this is a scale that is orders of magnitude bigger than what we've ever seen before.
"So we're running out of space, we're running out of capacity. We want to be able to use the physical space that we have at our laboratory for people and for science rather than running infrastructure. We're also noticing that cloud vendors are starting to offer these capabilities and infrastructure at a much lower cost. And add to that the elasticity that is available to us in the cloud diminishes our cost even more."
This combination of on-premise infrastructure reaching its limits, an onslaught of data and the limited timeline of some of JPL's missions — some only last for six months — made cloud an inviting option for the organisation. When a mission is underway, JPL "really process the data as much as we can for those six months [for example], and then that infrastructure is going to go to waste after that. So with cloud computing, we're able to say, 'Okay well we're just going to pay for it while we use it and turn it off when we're done. '"
Using cloud for computationally intensive processing mitigates the risks associated with capital investment, Shams adds. Before employing cloud computing, the IT infrastructure for a mission would be purchased a year or more in advance: It would be tested and then put in change control configuration and not used until the actual mission took place.
"Now there's a risk here that if the launch is unsuccessful we have made all this investment and this infrastructure's not going to be used," Shams says.
"The other risk is that we have paid too much for this infrastructure because we bought it a year in advance. So now move forward four years — cloud computing. Let's say there's a hundred machines that we needed. We have the opportunity to bring up the hundred machines [in the cloud], test everything worked, shut them down, don't pay for anything and if the mission is successful — which almost every time it is — we just launch those machines, just as if we left them, and start paying for them immediately."
Getting to the point where Shams could ramp up instances in AWS for the data pipeline from Curiosity to Earth was not straightforward, however.
There are a lot of government regulations NASA has to abide by. The good news, Shams says, that all the downlink data the space agency collects can be released into the public domain, which was an important factor that let JPL experiment with cloud computing while navigating these regulations.
NASA has worked closely with cloud vendors to find ways of using cloud that don't fall afoul of the law; for example Amazon established its GovCloud in the Oregon Region, which offers the same security as AWS public cloud but is compliant with ITAR — the International Traffic in Arms Regulations — which governs the movement of sensitive data: Data won't be shifted offshore and the facilities are staffed only by "US persons" (a category that includes US citizens and certain permanent US residents, among others).
And while it may have only taken a bungled procurement order and a conversation to set things in motion — making it happen took a lot longer. For example, the first contract that JPL signed with AWS took eight months.
"At that time, cloud vendors weren't ready for the enterprise," Shams says. "They were still built for the start-up with a credit card, or Joe Smith with a credit card. Having to deal with the enterprise was something they were learning first hand."
Dealing with a government agency involves an even steeper learning curve, due to the regulations that must be abided by when dealing with an organisation like NASA. But that eight months wasn't wasted: not only was the contract signed in the end, but Soderstrom conducted a retrospective to identify blockages in the process that could be removed to ensure smoother sailing as NASA continued to embrace cloud.
"The reason why it took us eight months is because there was a long communication chain," Shams says.
The convoluted communications chain involved Shams communicating with NASA's procurement team, which would have to talk to NASA's legal team as well as Amazon's sales team. On top of that JPL's security team was also an obvious stakeholder.
On the Amazon side, they also had their security team and compliance and legal teams. The upshot was a communication pipeline ripe with potential for blockages and channelled through Khawaja and the procurement team.
"You can imagine," Shams says, "IT security tells me something, I tell it to the procurement guys, who will then tell talk to the sales guy, who will then talk to IT security guy, who will then come back to them, then go to procurement who will then go to Khawaja who will then go to security. This is completely inefficient."
In the wake of the lengthy negotiations over the first contract, Tom Sodastrom, NASA IT CTO at JPL, figured out what Shams describes as a "magic formula": Bringing stakeholders on each side face to face to discuss the issues involved.
When Amazon established its GovCloud Region, NASA needed to sign a new contract with AWS. This time, it only took a month. This idea of approaching a cloud vendor as more of a partner has continued, with peers on each side having meetings with each other to foster a collaborative relationship.
Shams believes this approach, of bringing together peers across the customer and the vendor, is applicable for large enterprises beyond NASA. It stops things getting lost in translation as communications go up and down the chain, and removes bottlenecks.
He cites as an example the collaboration between JPL's security team, and AWS's. "The IT security team's goal has been identified as, well, you're going to make cloud computing secure or tell me why it can't be done. So it's now part of their pay cheque ... to go figure this out. So they have to go to talk to the IT security team [at AWS].
"The IT security team [at AWS] has to make cloud computing secure anyway at Amazon because that's their bread and butter. Now you've got two teams with the same motive, right, and they're going to talk to each other without any bottlenecks. So I do think for any large enterprise, this is a magic formula: to identify the peers and to talk to them as much as you possibly can."
Cloud means enterprises need to move beyond a routine customer-vendor relationship, requiring a deeper level of collaboration. Shams and Sodastrom sit on cloud vendors' customer advisory boards, letting them have input into the direction of product development.
"We provide insights into how cloud is being used in our organisation and what are the features that are missing, or what are the key enablers that are missing, that will allow us to adopt cloud more effectively," Shams says.
NASA also has relationship with some cloud vendors' internal product teams. "They'll bounce some ideas off of us, and that helps us influence them, but it also helps us understand where things are going and it helps us get guidance to ensure that we're using the best practices."
The cloud changes IT
Shams says that Sodastrom's approach to IT at JPL has been a key factor in the shift to cloud. His approach and that of the Office of the CIO is to treat IT as an enabler of new capabilities, rather than a hindrance. Shams cites the example of cloud security — "IT security would be very cautious of, 'Hey! You mean you're going to tell your data on whose servers?!"
"So what's Tom's doing with them for instance," Shams says, "is he's telling them don't say no — say how? Or say, why not? So rather than 'Hey guys, don't do this', it's more like 'How do I do it more securely, more effectively and if really I can't, if I'm really doing something very stupid, tell me what else I can do to still meet my requirements.'"
JPL is still adjusting to the impact that cloud has on how IT operates, for example the shift from capital expenditure to operational expenditure. At first this shift made some project managers uneasy, Shams says, because of the impression that this would make their budgets more unpredictable. However, "they quickly realised they have more control of the budget now all of sudden because they can control how much capacity they can invest in."
"Typically if you bought a bunch of hardware before the mission started producing data you might not even realise that you've bought too much or too little," Shams says.
"So now, based on the amount of money you have available and based on the changing requirements, you have the agility to redefine how much infrastructure you're actually going to use and pay for.
"We are noticing so, for instance, for some of the projects I'm working on, when I'm putting the budgets [together] I'm actually putting in operational cost — this is how much money we're intending to spend every month — rather than at the start of the project I'll go buy these 40 terabytes of hard drives and set them up accordingly."
"It's literally: as the project progresses we'll continue to pay a monthly [fee]," Shams says.
Although it was AWS's cloud that did the heavy lifting for Shams data pipeline, NASA has a multi-cloud approach.
The agency has an internal document that sets out its strategy for cloud computing, as identified by Soderstrom. The idea boils down to 'Use the right cloud for the right job'.
Soderstrom developed a tool called CASM: the Cloud Applicability Suitability Model. It's a simple questionnaire that NASA stuff can fill in, answering questions about their project's needs — latency requirements, regulatory requirements and the data to compute ratio, for example.
"Based on these questions, we assess whether their data belongs in our private cloud or in the public cloud, and within the public cloud, it's 'Which public cloud?" Shams says.
"And within the private cloud, does it belong in the supercomputer centre, does it belong in our regular data centre does it, belong in a virtualized environment with VMware, or does it belong in an Openstack environment... things like that."
"So these questionaries help us asses where to place the environment," he says.
"But there's no edict from NASA or anybody that says 'use Amazon' or 'use Microsoft' or 'use Google'. It's literally about having the right cloud for the right job, and it's literally on a per application basis that we make this decision."
When Curiosity landed on red soil
When Curiosity ended one phase of its mission by hitting red soil and started the main, most important part of its job, JPL witnessed firsthand how the elasticity of cloud can, in some cases, be a game changer. The Mars Science Laboratory website, which ran on AWS, was a "good foray" into cloud computing when it comes to Web hosting, Shams says, surviving the onslaught of massive amounts of traffic.
In the wake of its success — JPL's regular website went down during the Curiosity landing due to the volume of traffic, so it was redirected to the MSL site, which remained up — websites across NASA are beginning to be migrated to the cloud. Unlike a service such as Netflix — also a heavy user of Amazon's public cloud — which knows that it's going to have massive traffic on a daily, NASA's traffic tends to spike and ebb, depending on public excitement about different missions.
"We get a lot of attention and then it will die down, and then we land a rover, get a lot of attention, and it dies down," Shams says. "It's a very elastic environment that's basically built for cloud computing."
With a service like S3, NASA can store data and then not worry about going in and adding more services as interest ramps up, because it will scale automatically behind the scenes. Shams adds that the storage service also means that backups aren't a concern because data will be automatically replicated across multiple data centres, and daemons will regularly check the integrity of data to make sure nothing has been lost, re-replicating it as needed.
Another advantage of using cloud for Web hosting is security: holes have to be opened in firewalls to allow page requests in and data out, which can create a vulnerability. If a machine is running on NASA's network, there's the risk that the compromising of a Web server might open up the rest of the organisation's network to attack. With cloud, you can put a Web server in an isolated environment, "so somebody penetrates your website — that's all they've gotten into."
Despite wariness over cloud computing, the security team at JPL has also found other advantages over on-premise hardware. Cloud computing can be used to combat uncontrolled IT sprawl and give security far more oversight. "Cloud computing is way more secure than me setting up a server at a desk under my cubicle," Shams says.
"We will see a major shift toward cloud computing for websites across NASA," Shams says. It's an "ongoing process" that's being endorsed within JPL. It's "being enabled by our Office of the CIO, and they're doing everything they can to make it happen as quickly as possible."
When Curiosity touched down on Mars, there were two successes, not one, Shams said during a presentation at AWS's re:Invent conference. The rover was landed successfully, and NASA was able to share the moment with the rest of the world. And while the magnitude of the latter feat may go unnoticed by some, its implications for IT are significant, to say the least.
"Mission accomplished," Shams told the conference.
Not always smooth sailing
Although Curiosity may have provided a highly visible success story, the path to the cloud has not always been smooth sailing for JPL. One of the early stress-inducing incidents encountered was an apparent attack on their cloud infrastructure.
JPL uses Amazon Web Services' Virtual Private Cloud (VPC) offering, which lets an enterprise cordon off a set of EC2 instances and connect them to their internal network over a VPN, treating them as extensions of internal infrastructure. JPL set up a VPC and started running instances in it, and one morning at 6am, Shams says, they got a phone call saying that a node in the VPC was under attack.
"We all panic and we're looking around to see what might be going on, who might be on to us and who's trying to compromise our system and if there's an internal breach... And three hours later we're still trying to figure out what actually happened," he recalls.
"It turns out that our IT security team, the same people who monitor the alarm that went off, also have a system that does penetration testing of all of our systems. And because our machines were in the VPC, they also went and said 'Oh hey, you're a Web server I'm going to starting throwing all this traffic at you and see if you succumb to one of my SQL injections for instance'."
"So it was a self-inflicted-alarm," Shams says. However, it was actually "really good", he adds, "because, one, it helped us ensure that our testing infrastructure that we have for internal resources is still working and, two, when things do go wrong in the VPC we still figure out, just like our infrastructure. It was an attestation to the fact that we're able to leverage our internal infrastructure to protect the resources in the cloud, just like they would the resources on-premise."
Another lesson learned the hard way, albeit one that involved less panicked phone calls, was the importance of collaboration with cloud vendors: approaching the relationship as more of a partnership than a pure vendor-customer relationship.
"With cloud is there are so many features that are coming in so fast," Shams says, and when JPL started its cloud journey circa 2008-2010, Shams was "in a very exciting development role". "I would be developing services to build on top of cloud capabilities and I'd develop a service, it would have two or three more bugs left, and we'd hear that Amazon was about to release this other service that's going to do everything that we've written, except a lot better. And at that point we would throw away our code and say, 'Okay we'll just use this'."
The lesson, Shams says, was to be open with the vendor about what you're trying to build and what's missing in their services, "so that we can actually stay on the same page as to what might be coming up and what we should build and what we shouldn't build. Kind of understand, principally, what the vendor is interested in building and what they're interested in letting others build on top of that."
Rohan Pearce travelled to Amazon re:Invent as a guest of Amazon.
Follow Rohan on Twitter: @rohan_p