Search for Code in Pagure

Search for Code in Pagure

I was trying to get into code search in Pagure, thing that I land up on got really interesting and amazing.  If you want to have a code searching mechanism in your website you need to look into something called Indexing.

The way search happens in some E-commerce sites like Amazon or be it the search happening on Google, with Google its web scrapping and then indexing on the results. The point being the response time , while you are searching for something you get results in few microseconds.

Now imagine going through such a huge database and going through them in few micro second how much ever power you have but what you need is a clever way to manage it. I was looking at a CS50 video in which Mark Zuckerberg was telling about how he managed his DB, the first architectural design he took was have different MySql instance for different school so that they reduce time taken to search and form relation.

That was a really clever move.

While I was searching for ways to have code search feature on Pagure, I landed up on a pyhton based library called Whoosh. It blew me off with the way it was doing its searches and maintaining the database. I actually looked for a lot for tutorials on how one can understand indexing.

I landed up on Building Search Engines using Python and the way he explained things like N-grams , edge N-grams and how different files store different index words with the frequency and path to documents. I am yet to analyze git grep v/s whoosh.

While I was going through whoosh I saw that it has performance issues and then I started contemplating on the fact that if search is not fast enough then there is no point in having it. I actually looked into HyperKitty I figured out they were using Whoosh before and I assumed even they suffered form performance issues or may be because Django introduce Haystack . As the name suggest you can also use this to find the needle in haystack.

Yeah! you are right, I started looking for Haystack in Flask and I found Flask-whoosh. Again the draw back I had was it use to search through databases and not files, where as my application was to search through files on the system

There came the xapian there are a lot of core concepts involved while using or writing utilities in xapian. I went through the documentation for Xapian. They have covered a lot of concepts and have given examples of it, the bottleneck still persist when it comes to file searching and performance. I found a nice application Building Document Search which might give me some hope but still a lot of work is required there.

The whole concept being you need to do two things on a really high level:

  1. Indexing
  2. Search

Indexing

Indexing is required to go through the each file or record and build something called Index which has the search words filtering  stop words and the new database is build having the frequency and location of the word , this is the most time consuming process.

Search

This comprises of forming a query and searching through the formulated database and return the document in which word or phrase is found.

If you need to see a demo.

Till then Happy Coding an Bingo!

Advertisements

Setting Postgres For Pagure

Setting Postgres For Pagure

I normally use Sqlite for development because of the ease you get to see your file , browse through it and edit it. Having said that sqlite is good for development and not for production one of the foremost reason being it doesn’t support multi-thread querying.

The other disadvantage was sqlite doesn’t give a damn if you have dangling Foreign Key references, I land up on this problem recently. The way we categorize fork project in Pagure is on the basis of parent_id so if a project has parent_id its a fork and if it doesn’t then its not a fork.

This works out quite well unless recently we figured out a flaw , what if the main project is deleted, the expected behavior is the fork should be accessible but because of the parent_id  dependency the fork was getting inaccessible this was because as you delete the main repo , the FK references with the fork gets modified and becomes Null.

This creates anomaly because now the project is no more a fork , its a main repo and its treated like it which leads to a lot of repo path chaos. The relation of Postgres came here because I was able to have a dangling FK reference here in sqlite but when I try to achieve the same thing in Postgres it throws an integrity error.

Pagure uses Sqlalchemy as the ORM so I just need to set up postgres on my system and provide the URL in pagure/default_config.py  and ORM magic makes all the queries just work.

Setting up Postgres is really easy because of the amazing documentation provided in fedora-wiki . The only thing you need to care or a little tricky about is you need to be a superuser  before you change to user postgres .  So first sudo su and then su - postgres. Then the follow the steps in the wiki and create a user and create a database name pagure.

Private Repo on Pagure

Private Repo on Pagure

One of my proposal for Pagure was to have private repositories. Private repositories are basically repositories which are visible to you and people you give permission to.

To be honest , I thought it would require a few tweaks and I will be good to go, but that wasn’t the case and the insights I got working on this feature was amazing. I fiddle with this project on primarily  three stages. Each stage was a challenge in its own.

The three stages were:

  1. UI
  2.  Database Query
  3. Tests

UI

The UI  was suppose to have a checkbox saying “Private”  and when a user ticks it the existing project becomes private or the new project is private from the time it is conceived.

Achieving this was a joy ride, with flask I just need to make changes in the form and setting page UI and Voilla!

I introduced a column Private in the project table and that was pretty much it. Nice and beautiful.

DATABASE

This was the most challenging part for me , since I have not worked with databases, and this was out of my comfort zone, I actually went back to my database basics to see if I am doing things right.

We in Pagure use Sqlalchemy as the ORM layer, ORM stands for object relation mapper. It basically use to map databases to object-class model of representing data. Sqlalchemy is a really powerful tool.

While figuring out ways to get all admins who can view private projects , I struggled a lot since I was working with a function which forms the core of Pagure so if things go wrong with this function the whole Project will take a hit.

So the challenge was to make minimum changes which are independent so that it doesn’t compromise the existing functionality and yet able to introduce a new one. I struggle to achieve it I failed a lot of time , was working hard to get it working , constantly moving to the board to figure out a solution on paper. Then switching back to my screen to code it out.

I was so desperate to get this working that I even pinged Armin on IRC to ask my doubt about flask and Sqlalchemy.  All this while the best support I got was from my mentor Pingou.

Finally after struggling a lot I got a very beautiful solution and done !

Just when I thought I am done , there comes a question of writing tests. Since I have altered a very major functionality that means I need to test every aspect of it.

Selection_021

Testing

Testing was a herculean task since I have not done a lot of testing, I actually got a lot to learn for starting the DB used for testing is a in-memory DB and not the one used by the app.

The session maintained has to be replicated in a way to use them in the test and how to use pygit to actually initialize a repo with git init and use it.

Towards the end of this PR my development evolve from writing code and testing it , to write the test and then introduce code or write code that pass the test. It has been really amazing working on this feature and hope it will be integrated soon.

I think may be a little more work is required on this feature maybe. It feels really amazing to do this work.

The link to the branch on Pagure.

The link to the current Pull-Request.

Happy Hacking!