Thursday, October 30, 2014

An offline StackOverflow clone

My current organization operates in private networks (no connectivity with the internet AT ALL). Beyond the regular arguments of "I don't have Facebook!" / "I can't read the news every 5 minutes!", there are some other, more serious problems: developing software is really hard with no access to the internet. Just think about it: when you write code, how many times a day do you search for info or problem solutions on the web? My guess is A LOT.

Developers in my organization are struggling with this issue, and iv'e seen the pain in their eyes when they are forced to look for an unoccupied internet computer. They have one internet computer per team, at best, and these too loose connectivity from time to time.

This issue is a major productivity killer, to which no one seriously addressed before. So, a few months ago I woke up in the morning and thought to myself: "Why shouldn't I bring the internet to them?". I figured that the most cost-effective thing to do was to bring some kind of a clone of StackOverflow into the network. This is a single source of data being used a lot by every developer.

Luckily, As it turns out, StackOverflow publishes its data as XMLs, every 3 months! It was a real pain getting the data in (it's 14GB compressed), but I finally got it into the network.
Now, I could get the data, but I was still missing a GUI to display the data. I can't just download StackOverflow's site.

So I built a nice little GUI using Play Framework 2, AngularJS and Twitter Bootstrap. That took about a week (maybe some day I will publish it, although it's not hard to build yourself).

I still had to find a solution for the data storage. The final architecture of the app was a web interface talking directly to a Elasticsearch node holding all the data. Getting the data in was not too fun - I wrote some Python scripts that took the XMLs, transformed them to JSONs (since Elasticsearch is a JSON document storage), and sent them (using cURL) to the Elasticsearch node. The uploading process took awhile, because of the large data volumes.

Currently, the application is running for several months and my organization has slightly happier developers (~600). :)

The project (named XXXOverflow - XXX being the name of the organization) apparently inspired some other developers that suggested all kind of interesting ideas for the application. In the future we plan to expand the searching sources of the application, and make it a highly customized little Google for the devs in my organization.

Other organizations, which are in the same position (disconnected from the internet), have asked me to give them the code of the app and help them implement it in their own networks.

Some technical notes
Elasticsearch is an open source search engine solution. I used it to store the data for the app. ES is really great and it made my life so much easier. It's default search algorithm searches through 70GB of textual data with a split of a second, which is pretty amazing to me. Although when I tried to customize the ranking algorithm using their Query DSL, it really slowed down the search speed (I used ES 1.0.0).
Also, I had (and still have) issues with failing shards. It's probably the amount of data, but occasional searches just bring down shards. And not too seldom. I hope these issues will be addressed in future releases.

2 comments:

  1. Hello Jenia,
    As I'm in a very similar situation to you, my team did a very similar thing, but we are using StackDump (http://stackapps.com/questions/3610/stackdump-an-offline-browser-for-stackexchange-sites).

    We were happy with how this simplified things for us, but encountered an issue - mainly with the search engine. StackDump uses Apache Solr, and like you can see here https://bitbucket.org/samuel.lai/stackdump/issue/13/stackdump-search-engine-tuning we are not the only ones feeling the search capabilities are sub-par.

    I was wondering if you can shed some light as to why you chose to implement using elasticsearch and not take StackDump for a spin (assuming you were aware of it). Also, if you made some headway with your implementation, that would also be interesting.

    Thanks, Y

    ReplyDelete
    Replies
    1. Hi Yossi,

      Actually, I was not aware of StackDump! :)
      This seems very nice.
      This solution is not good enough for me, though, because I want more control of the application. For example, we are now enhancing the app to support Q&A of developers in the organization (not only reading Q&A). This would be a pain to do with StackDump.
      Elasticsearch seemed like a good solution as it has a strong query API. I tried to play around with it by giving higher scores to questions with more views and answers and such. I really did got more relevant results but it came with a too big performance penalty so i dropped it for a while.It could probably be solved if I spread the data on several nodes.

      Thanks for the comment :)

      Delete