Thursday, October 30, 2014

An offline StackOverflow clone

My current organization operates in private networks (no connectivity with the internet AT ALL). Beyond the regular arguments of "I don't have Facebook!" / "I can't read the news every 5 minutes!", there are some other, more serious problems: developing software is really hard with no access to the internet. Just think about it: when you write code, how many times a day do you search for info or problem solutions on the web? My guess is A LOT.

Developers in my organization are struggling with this issue, and iv'e seen the pain in their eyes when they are forced to look for an unoccupied internet computer. They have one internet computer per team, at best, and these too loose connectivity from time to time.

This issue is a major productivity killer, to which no one seriously addressed before. So, a few months ago I woke up in the morning and thought to myself: "Why shouldn't I bring the internet to them?". I figured that the most cost-effective thing to do was to bring some kind of a clone of StackOverflow into the network. This is a single source of data being used a lot by every developer.

Luckily, As it turns out, StackOverflow publishes its data as XMLs, every 3 months! It was a real pain getting the data in (it's 14GB compressed), but I finally got it into the network.
Now, I could get the data, but I was still missing a GUI to display the data. I can't just download StackOverflow's site.

So I built a nice little GUI using Play Framework 2, AngularJS and Twitter Bootstrap. That took about a week (maybe some day I will publish it, although it's not hard to build yourself).

I still had to find a solution for the data storage. The final architecture of the app was a web interface talking directly to a Elasticsearch node holding all the data. Getting the data in was not too fun - I wrote some Python scripts that took the XMLs, transformed them to JSONs (since Elasticsearch is a JSON document storage), and sent them (using cURL) to the Elasticsearch node. The uploading process took awhile, because of the large data volumes.

Currently, the application is running for several months and my organization has slightly happier developers (~600). :)

The project (named XXXOverflow - XXX being the name of the organization) apparently inspired some other developers that suggested all kind of interesting ideas for the application. In the future we plan to expand the searching sources of the application, and make it a highly customized little Google for the devs in my organization.

Other organizations, which are in the same position (disconnected from the internet), have asked me to give them the code of the app and help them implement it in their own networks.

Some technical notes
Elasticsearch is an open source search engine solution. I used it to store the data for the app. ES is really great and it made my life so much easier. It's default search algorithm searches through 70GB of textual data with a split of a second, which is pretty amazing to me. Although when I tried to customize the ranking algorithm using their Query DSL, it really slowed down the search speed (I used ES 1.0.0).
Also, I had (and still have) issues with failing shards. It's probably the amount of data, but occasional searches just bring down shards. And not too seldom. I hope these issues will be addressed in future releases.