Using Python's Pandas, NetworkX, and pyvis to understand and visualize companies within a directly connected LinkedIn network.
To understand and visualize the companies within my directly connected network on LinkedIn
- LinkedIn data sources — retrieving LinkedIn Network data from a “Get a copy of your data” CSV export
- Diving into the data — exploring, cleaning, and aggregating the data with Pandas
- Creating the network — creating a network graph using NetworkX
- Visualization — visualizing the network with pyvis
- Improving the output — cleaning up the network graph with additional filtering
Hover over the nodes for more details
Recently, I was exploring my LinkedIn network to see what some of my colleagues from high school and undergrad are currently up to.
As I was scrolling through the connections page, I noticed LinkedIn gives you options to filter and searching with ease, but it doesn’t really provide tools to learn about your network as a whole.
So I decided to see if there was an easy way to export my network data to see what I could do with a few hours of exploring the data.
LinkedIn data sources
My first thought was to checkout out the LinkedIn’s Developer API.
Something I do fairly frequently at my current job is integrating various 3rd-party REST APIs into our platform, so I wanted to see all the functionality and possibilities that this API would provide.
After reading through some documentation, I decided this wasn’t a direction I wanted to pursue. Most of their developer products require approval, so I decided to look into other options.
Another thought I had was to write a quick scraping script to pull down the HTML of my connections page and parse out names and companies, but I assumed there had to be a more simple way to get this data.
Finally, after a bit of research, I found that there are various “Get a copy of your data” reports that you can run within LinkedIn. In order to get to these reports, you can do the following:
- On the homepage toolbar, click the Me dropdown
- Under the Account section, click Settings & Privacy
- Click on Get a copy of your data, and you can view the various reports
- Select the reports you’re interested in, for this, I just checked Connections
After requesting the report, it should only take a few minutes before you get an email saying your report is ready for export.
Diving into the data
To reiterate our goal, we want to get a broad understanding of the companies within the first layer of our network (direct connections). Now, let’s load up Python and learn more about this data in this CSV.
Reading in the data
Once the CSV is downloaded, we can open it up with Pandas and take a look (output will be commented below).
I won’t post the name’s of any individuals or full rows to respect the privacy of my connections, but when I searched through the my Connections CSV, I noticed a few initial patterns that would help clean up the data.
Cleaning up the data
At first glance, the first thing I notice is connections who don’t list a current company, so let’s get rid of those.
After sorting, another thing I noticed was that some of these company names belong to the same company, but the individuals wrote them differently.
An example of this is
'IBM Global Solution Center' and
'IBM'; for our purposes, these should both be classified as
Now, this solution is not perfect, but it will help draw out some similar companies. You should still run a manual inspection of the data (the IBM example I gave above is one that doesn’t show up in the fuzzy match results).
Based upon the results, let’s group together some of the companies that had matches.
The next thing you may have noticed is that in our
similar_companies dictionary, we cleaned up a
To stay aligned with our goal, let’s drop these entries, as well as your current company.
Aggregating the data
Now that our data is cleaned up a bit, let’s aggregate and sum the number of connections for each of the companies.
Creating the network
We have the numbers we want for each company, now let’s jump into using
NetworkX to recreate a network.
The first step will be to initialize our graph, and add yourself as the central node, as it is your network.
Then, we’ll loop through our
df_company_counts DataFrame and add each company as a node.
You’ll notice some HTML tags in the title below, this is just to make it more readable for later
And just like that, we’ve created our network of connections.
Our network graph is created, so let’s get into visualizing the network.
There are a few options for visualizing networks including
matplotlib.pyplot, but I found that
pyvis was the easiest to use for several reasons:
pyvisgenerates an HTML file
- Customization is made very easy
- The graph is interactive by default
Let’s look into generating this HTML file.
And it’s that simple! We specify a width and height, optional styling attributes, and then we can generate the network graph visual straight from what we created with NetworkX.
Now we can see the network we generated.
You can hover over each node to see the total number of connections that work at the respective company, and below is a list of the positions held by your connections.
As you can see, this is a bit hard to read into since there are a lot of nodes. Try and imagine reading this with +1,000 connections.
Improving the output
There are a few ways that our network could be narrowed down.
Being a Software Developer, the thought that first occurred to me was to try and dial in on tech-related companies through known positions titles.
To do this, I thought of a list of buzzwords/common job titles that I’ve seen across LinkedIn, and filtered down the initial DataFrame.
Then, we go through the same process we did in previous sections of generating and displaying the graph.
Again, this is not perfect, but it’s a good starting point.
Now, let’s look at the updated results.
Much better! This is more readable and easier to interact with.
And just like that, we achieved our goal of gaining a broader understanding of the companies in our LinkedIn network.
Possible improvements for those interested
- Scraping the profile location of each of your connections to segment by location
- Compiling a list of companies you’d like to work for/are interested in and creating a filtering system
- Researching salary data for positions and gathering average pay by company