Your average security operations center is a very busy place. Analysts sit in rows, staring intently at computer monitors. Cybersecurity alerts tick past onscreen—an average of 10,000 each day. Somehow, the analysts must decide, in seconds, which of these are false alarms, and which might be the next Target hack. Which should be ignored, and which should send them running to the phone to wake up the CIO in the middle of the night.
It’s a difficult job.
The alerts are false alarms the vast majority of the time. Cybersecurity tools have been notoriously bad at separating the signal from the noise. That’s no surprise, since the malware used by hackers is constantly mutating and evolving, just like a living thing. The static signatures that antivirus software uses to detect them are outdated almost as soon as they are released.
The problem is that this knowledge can cause a kind of numbness—and make tech teams slow to act when cybersecurity software does uncover a real threat (a problem that may have contributed to the Target debacle).
Luckily, a few government labs are experimenting with a new approach—one that starts with taking the “living” nature of malware a little more seriously. Meet the new generation of biology-inspired cybersecurity.
Sequencing Malware DNA
The big problem with signature-based threat detection is that even tiny mutations in malware can fool it. Hackers can repackage the same code again and again with only a few small tweaks to change its signature. The process can even be automated. This makes hacking computers cheap, fast, and easy—much more so than defending them.
Margaret Lospinuso, a researcher at Johns Hopkins University’s Applied Physics Laboratory (JHUAPL), was pondering this problem a few years ago when she had a brainstorm. A computer scientist with a lifelong interest in biology, she was aware that programs for matching DNA sequences often had to ignore small discrepancies like this, too. What if she could create a kind of DNA for malware—and then train a computer to read it?
DNA maps out plans for complex proteins using only four letters. But CodeDNA uses a much longer alphabet to represent computer code. Each chunk of code is assigned a “letter” depending on its function—for example, a letter A might represent code that opens a certain type of file, while a letter B might represent code that opens a server connection. Once a suspicious computer program is translated into this type of “DNA,” Lospinuso’s software can then compare to the DNA of known malware to see if there are similarities.
It’s a “lossy technique,” says Lospinuso—some of the detail gets scrubbed out in translation. However, that loss of detail makes it easier for CodeDNA to identify similarities between different samples of code, Lospinuso says. “Up close, a stealth bomber and a jumbo jet look pretty different. But in the distance, where details are indistinct, they both just look like planes.”
The resulting technique drastically cuts down on the time analysts need to sort and categorize data. According to one commercial cybersecurity analyst, the similarities CodeDNA found in two minutes would have saved him two weeks of hard work. But the biggest advantage of CodeDNA is that it won’t be fooled by small tweaks to existing code. Instead of simply repackaging old malware, hackers to build new versions from scratch if they want to escape detection. That makes hacking vastly more time-consuming, expensive, and difficult—exactly how it should be.
How to Build a Cyber-Protein
Lospinuso’s team built CodeDNA’s software from scratch, too; it’s different from standard DNA-matching software, even though they implement the same basic techniques. Not so with MLSTONES, a technology developed at Pacific Northwest National Laboratory (PNNL). MLSTONES is essentially a tricked-out version of pBLAST, a public-source software program for deciphering protein sequences. Proteins are constructed from combinations of 20 amino acids, giving their “alphabet” more complexity than DNA’s 4-letter one. “That’s ideal for modeling computer code,” said project lead Elena Peterson.
MLSTONES originally had nothing to do with cybersecurity. It started out as an attempt to speed up pBLAST itself using high-performance computing techniques. “Then we started to think: what if the thing we were analyzing wasn’t a protein, but something else?” Peterson said.
The MLSTONES team got a bit of encouragement early on when their algorithm successfully categorized a previously unknown virus that standard anti-virus software couldn’t identify. “When we presented [it] to US-CERT, the United States Computer Emergency Readiness Team, they confirmed it was a previously unidentified variant of a Trojan. They even let us name it,” Peterson said. “That was the tipping point for us to continue our research.”
Peterson says she is proud of how close MLSTONES remains to its bioinformatics roots. The final version of the program still uses the same database search algorithm that is at the heart of pBLAST, but strips out some chemistry and biology bias in the pBLAST software. “If the letter A means something in chemistry, it has to not mean that anymore,” Peterson says. This agnostic approach also makes MLSTONES extremely flexible, so it can be adapted to uses beyond just tracking malware. A version called LINEBACKER, for instance, applies similar techniques to identify abnormal patterns in network traffic, another key indicator of cyber threats.
A Solution to Mutant Malware
Cyberattacks are growing faster, cheaper, and more sophisticated. But all too often, the software that stops them isn’t. To secure our data and defend our networks, we need security solutions that adapt as fast as threats do, catching mutated malware that most current methods would miss. The biology-based approach of CodeDNA and MLSTONES isn’t just a step in the right direction here—it’s a huge leap. And with luck, they will soon be available to protect the networks we all rely upon..
With contribution by Nathalie Lagerfeld of Hippo Reads.