The story of

The story of

The Story

Applying for an MS in the US was quite a daunting task for me back in 2015. It wasn’t exactly clear how to go about choosing universities. Seniors from college would be the major source of insights, sometimes siblings/relatives. They would suggest applying to a mix of ambitious, moderate and safe universities, while also keeping tuition fees and opportunities for RA/TA in mind. There was, however, limited information on past admits and rejects. I felt that was a critical piece of information as just by looking at the data, you can answer several questions:
  • Will my GRE score be sufficient to get admit from University X? Have they ever admitted someone with such scores?
  • Is there a university that has traditionally liked picking students from my college?
  • Which universities are best suited for a GPA say between 7and 8?
  • I have few years of work experience, which colleges give weight to that?
Answering these questions solves the problem of knowing not just which universities to apply to, but also the ones to NOT apply to. 100,000+ students apply for an MS in the US every year. That is a ton of data points that could be very useful to answer a lot of the above questions.
So me and a friend of mine, set out trying to gather data points from across the internet. At first, we were putting together a big list of Facebook Pages that contained a lot of google sheets, excel files floating around and compiled that list. Next, we created some Google forms and passed it around to folks currently doing their MS. We took the help of some friends to pass around the form. After this, we had a decently sized database. But we didn’t want to stop there. There were a lot of websites out there that contained profiles of students, with information on their admits and rejects. This ranged from forums where the data was highly unformatted, to some well-formatted sources. It was clear that manual effort wasn’t going to scale, hence we had to write a lot of scripts to fetch this data.
Fast forward 3 weeks and we, were sitting on top of 350,000+ admits and rejects. When we started, we were super skeptical of getting anywhere close to this number. Now, we were excited to share this data with the public! :)
But there were some key problems:
  1. Unclean and noisy data!
  1. How do we share this data? On google sheet? Pass around an excel file?

Problem 1 — Unclean & Noisy Data

The data from all the sources was unclean. There were hundreds of forms, for the same university eg ASU, Asu, arizona state university, Arizona State, etc. Some universities even had multiple abbreviations.
We even had data from the days when GRE was not in a 340 scale but a 1600 scale. Besides, Grade points were on different scales: 4, 10 and 100
Undergrad college names were a total mess. “BITS Pilani”, “B.I.T.S Pilani”“Birla Institute of Technology and Science”. Besides, there were also different sister campuses, in India, there were affiliated colleges. There were just too many of them!
We once again had to resort to writing code to deal with the scale. We had multiple techniques (out of scope for this article) that helped us massively clean the data at scale. It was initially an annoying process but we were able to make it a fun task, with some interesting solutions.

Problem 2 — Data Presentation

Now that we had all the data we needed, we weren’t sure how to share it with the others to use it! We tried putting it all in a Google Sheet but the data was too big for it! It would often not handle the load and crash. We decided to take the website route as we had a lot more control over the user interface. If you think about it, users are not interested in all the 350K+ data points. For any given question a user had, there were usually only about a few hundred data points of interest. We wanted to make it a smooth experience, and hence took help of a UX freelancer. Since we had gone through this problem ourselves, we had a couple of ideas on how we wanted the tool to be.

Our Solution:

We decided on a couple of key principles while building the tool:
  1. An easy to remember URL
  1. Clean data with limited redundancy in naming
  1. Quick access to the data (no sign up required)
  1. Free, no ads, unobtrusive interface
  1. Easy filtering of the data to answer questions
Our final solution is :)

Written by