The Panama Papers and How Open Source Technologies Changed the World
Do you have $1 million of spare change in your pocket? Well, if you are like me or billions of other people, then you probably don’t. And let's not limit ourselves to individuals, there are many companies that don’t have access to that kind of cash either. Aside from the obvious “cash is king” motto, why am I asking you this?
Because we live in a digital age of what some call information explosion, where the amount of data generated every day was unthinkable a few years ago. I read that in 2015, about 1 trillion photos were taken and approximately 500 hours of video were uploaded to YouTube every second. Social media has taken the way in which we connect to a whole different level, with about 1.65 billion Facebook users. And they are not alone, there are many other communities that have over 100 million users.
How many searches does Google serve in a year? It is estimated that it will be about 1.2 trillion searches this year.
We live in a connected world, so much so that Cisco has called 2016 the Zettabyte Era. They estimated that “annual global IP traffic will pass the zettabyte ([ZB]; 1000 exabytes [EB]) threshold by the end of 2016, and will reach 2.3 ZB per year by 2020.”
The numbers are astounding. But from all this data, can you extract meaningful information?
Data vs. Information
There are many sources of data, both structured and unstructured, that can be used for analysis. Some common data sources companies use are from within their own logs, databases, file servers, and many other repositories. When data is processed, interpreted, organized, structured and/or presented so as to make them meaningful or useful, it is called information. Information provides context for data.
But getting this information can be challenging. Let me tell you about one in particular.
Mossack Fonseca a.k.a. The Panama Papers
In April 2016, the world was hit with the news by the International Consortium of Investigative Journalists (ICIJ) of a torrential leak of information from a Panamanian law firm that handled the affairs of some of the most influential and powerful people in the world. They used Mossack Fonseca1 for all kinds of purposes, legal or illegal, to help them manage or hide money offshore in tax havens (or from their significant others). For many from the 1%, life will never be the same.
With 11.5 million documents that span over 40 years, this leak contains details of thousands of opaque offshore companies, trusts, and foundations used by the law firm’s clients. But 11.5 million documents is not something that can be taken lightly if you plan to make this data into meaningful information. In total we are talking about close to 2.6 terabytes of information.
If it is hard to imagine how much data that is, you may remember how Steve Jobs once said “1,000 songs in your pocket” with the original iPod. Well, if Steve Jobs was presenting this leak, he would probably interpret 2.6 terabytes as “over half a million songs in your pocket!” That is a lot of bytes!
A Word about Security for Dealing with the Risks of Data Leaks
I heard someone the other day say that companies with valuable data can be classified in two segments: those that found out they were hacked and those that were hacked but did not know it happened. This might be too extreme but given today's increasing number of cyber threats and data leaks across multiple business sectors, it's never a bad idea to be on the safe side. You usually think it is not going to happen to you, but you never know!
So as a few pointers, it is recommended to:
- Have isolated and independent environments for Development, QA, Staging, and Production
- Work on your permissions and schema and don’t assign more permissions than necessary - read more about document-level security here
- Turn off what you are not using (when talking about ports or services running by default)
- Centralize your users directory for security and governance
- Avoid using default usernames and passwords
Dealing with Massive Amounts of Data
You are a journalist and out of the blue sky you get 11.5 million documents that can change the world in a very significant way because they contain information of how the rich avoid taxes and hide their wealth among other legal transactions - well, life is not that easy that a hard drive magically shows up in a journalist’s door with once in a lifetime data, but let’s make this a very lucky journalist for the sake of making the point on the analysis of the data.
What do you do? How do you get started? Do you buy a computer with a big hard drive? The next step? Install Adobe Acrobat Reader to check PDFs? Get an image viewer that can open TIFFs (Tag Image File Format, a commonly used format for transferring images of documents) and install email clients and multiple database formats.
Well… tough luck I would say. If you go for a very conservative 10 minutes for reading each document, it would take a person over 210 years to read them without sleeping, eating, or hemorrhaging - and lawyers don’t usually create short documents so I would aim for a still conservative half a century to really go over them.
And this is why I truly believe that, sometimes not having a million dollar budget may be a good thing. Because...
Scarcity is the Mother of Invention
Well, maybe that's not Plato's “Necessity is the mother of invention” that you're familiar with. But it suits my purpose here.
Many people and companies don’t have a million dollars to pay for the hardware and commercial licenses required to automate the processing of millions of documents.
Years ago, particularly when we all relied almost exclusively on relational databases, having to manage such amounts of data as the Panama Papers required pretty heavy machinery - or huge and expensive machines - along with wallet-thinning licenses2.
But now people and companies don’t need to spend large sums of money any more to carry out an analysis like this one. In this age of Hadoop, Solr, Elastic Stack, virtualization (or "the cloud"), and more, we can create very scalable discovery and analytics applications that also work with our limited budget.
Complex Data Forensics Made Possible with Open Source
The Panama Papers is an amazing example of how they were able to store, search, and analyze large amounts of data using open source technologies. Here's a high-level overview of what they did:
- They had to reverse engineer the databases from Mossack Fonseca. They obtained the databases from several private sources, a little bit at a time. This is a typical case where old technology with unpatched software can create security vulnerabilities. It is critical to take security seriously or you may seriously find yourself in a tough spot.
- Once the ICIJ had access to the multiple different formats of the documents, they had to find a way to extract the information from them. This was tricky because in many cases the documents consisted of mostly unstructured images that needed to be processed by OCR (Optical Character Recognition, which takes an image and convert it into text). For this purposes they used Apache Tika, in conjunction with Tesseract, to extract text and metadata from over a thousand different file formats.
- Once they have extracted the information from the documents, they relied on Apache Solr, a NoSQL, open source search engine, to store and index raw data for search. An index was created so you can search for names, addresses, items related to different countries, and the links between them. A personal note is that Solr is my favorite search engine, considering the capabilities and features that an open source search engine can provide.
- They then used Project Blacklight as a user interface. This is also an open source discovery platform used primarily by librarians, content curators, and others responsible for managing digital collections.
- The ICIJ also created a document processing chain called Extract, which they open sourced as well.
There were other tools used by the ICIJ for visualization and other purposes. While a couple of them were proprietary, the most powerful and important parts required for automated data extraction were built on open source technologies.
Open Source: Going Beyond Cost Savings
Coming back to the benefits of open source, let’s imagine for a second what if money did grow on trees and commercial solutions were available to the public easily? Well, I think there is still a competitive edge for open source that commercial solutions do not have.
Two words: collective intelligence.
Some of the brightest minds in this planet have contributed to create open source frameworks like Solr, Hadoop, Elastic Stack, and many more. They are motivated individuals with a lot of experience; anyone that has a bright idea and is able to implement it can help move the project to new levels. And there probably is a lot less bureaucracy and red tape within the project.
If there was no open source, it would be highly unlikely that an analysis like the Panama Papers would be carried out, much less available to the public. Go on, search it yourself!
If you want to read more about the whole process that the International Consortium of Investigative Journalists took to analyze the Panama Papers mainly through the use of open source technologies, I recommend that you read: Wrangling 2.6TB of data: The people and the technology behind the Panama Papers.
Wrapping It All Up
In summary, the world has changed. Long gone are the days when only those with a thick wallet could afford the commercial software licenses required to search and analyze data stored in discs so big that need a new “something-ta-byte” to describe its size.
Now any average Joe with good programming skills and insight can download an open source project, configure it, spin up a few machines in a virtualized environment, move a bunch of data into a blob or a bucket, search, and find undiscovered insights.
It is a new world, a data-rich, analytics-driven world that is finally accessible to us all.
1I am not implying that the use of this law firm was only for illegal purposes, however, in a great deal of cases it is. This makes the Panama Papers so important for some countries, entities, or people to analyze and understand.
2I am not indicating that we should let go altogether of relational databases, commercial data management systems, or even huge machines. There are cases where we want to work with all of them. But in the age of open source, we now have alternatives.