2018-05-16

How I use Git


Below are some thoughts on how I use Git and SourceTree.

First some terminology

Repository:
A database containing a bunch of objects of the following kinds
  • commits,
  • blobs (each of which represents the contents of a file at some time),
  • tree objects, and
  • annotated tags.
Each of these object kinds are described below. Each repository also contains
  • branches (also described),
  • an index (described below
  • information about how to contact other repositories.
Typically, for each project each developer has a repository on their local machine and there is also one repository that acts as a hub.  Typically the hub repository exists on a hosting service such as GitHub or Git Lab.  Each of the developers' repositories knows the hub repository as "origin".
Blob
A blob is just the contents of an ordinary file at some point in time.  In some cases the data is compressed in the database.; that's really not something you need to concern yourself with.  Blobs are immutable, so once created they always have the same contents. The interesting thing is how blobs are addressed.  Each blob is given an address that is a hash of its contents. Since Git uses a cyptographically secure hash function with 256 bits of output, the chances pretty good that any two blobs that have the same address. represent files with exactly the same content.
Tree
Just as a blob represents a snapshot of the contents of an ordinary file, a tree object represents a snapshot of the contents of a directory (aka folder).  Tree objects are immutable. A tree object can be thought of as a sequence of tuples, each of the form (t, p, n, a) where t gives the type of object (blob, tree, etc), p is a number representing file permissions, n is a file name, and a is the address of the file (i.e., its hash code).  Trees objects are also addressed by secure hash codes.
Commit:
A commit object consists of
    • the address of a tree object (this represents the value of the commit's file tree).
    • the addresses of this commit's parents (these parents are themselves commits),
    • a message,
    • a time stamp,
    • the names of the author and the committer.
Commits are immutable objects. The address of a commit is a secure hash of its value. Thus two commit objects in different repositories with the same address will (with near 100% probability) have the same value. Usually one commit in any repository has zero parents. Commits that result from merges usually have two parents. Other commits have one parent and typically represenWhen a commit is created, it's parents must already exist. So commits form a rooted directed acyclic graph. When we talk about a commit we might be talking about an object (which exists in one repository) or a value (which might be represented by objects in different repositories). It often doesn't matter which we mean.
Branch:
A variable whose value is the address of a commit. Branches are mutable and are local to repositories; for example there might be a branch called master in my local repository and a branch called master in the hub repository; they are not the same and might, at some points in time, have different values. Each repository has two sets of branches: Local branches are intended for local work. Remote tracking branches represent local branches of other (remote) repositories; however the value of a remote tracking branch could be out of date with respect to the branch that it is tracking.
Currently checked out branch:
The branch that was most recently checked out. Making a new commit typically updates this branch.
Working copy:
A state of the file tree represented in a computer's file system. Each time you check out a branch, the working copy gets overwritten.
Index (also called the Staging Area).
A place in a repository where Git keeps changes that will become part of a commit in the future. You can think of the Index as a sort of mutable commit. The commit action takes all the changes in the index, makes a proper commit out of them, and clears the index.
Merge:
A merge operation combines two commits to create another commit. If we merge two commits x and y that have a least common ancestor z, then the result commit w=merge(x,y) will contain all changes from z to x and also all the changes from z to y. Here is an example where we consider a file tree that contains only one file, so the state of the file tree is simply a sequence of characters.  Suppose z is a⏎b⏎c⏎d⏎e⏎f⏎ [The ⏎ represents the end of a line.] and x is a⏎c⏎d⏎e⏎f⏎ and y is a⏎b⏎c⏎d⏎e⏎f⏎g⏎. The changes from z to x is {delete the b between a and c}. The changes from y to z are {add a g after the f}. The union of the changes is {delete the b between a and c, add a g after the f}. So w is ac⏎d⏎e⏎f⏎g⏎. Sometimes it's not clear how to merge files, and in that case there is a "merge conflict". When y is the least common ancestor of x and y, then there is no need to create a new commit, so merge(x,y)=merge(y,x)=x. This is called merge by fast-forward.
Line of development:
A sequence of commits that may get added to over time. "Line of development" isn't really a Git concept, but I find it useful to think about lines of development. Often people use the term "branch" for this, but that's confusing because in Git a branch is a variable whose value is the address of a single commit; not a sequence of addresses of commits. Also, while each Git branch is associated with one repository, a line of development spans multiple repositories. I found Git much easier to use once I finally realized that branches and lines of development are different (but closely related) concepts. So next I'll try to explain with an example what I mean by a line of development.

More on "lines of development"

Consider this evolution of a system, In the pictures commits are ordered in the order they are created (from left to right)
[Arrows go from children to their parents.] There are three lines of development here: the shared line, the x line and the y line. The x and y lines represent two different features and might be done by two different programmers. The shared line represents the amalgam of all completed features. Once a feature is completed, the last commit on its line is merged into the shared line. Once we have finished with a feature, we can delete the branches associated with it, but the lines of development remain. Commit x3 is particularly important. This represents the programmer catching up with all features completed since they started work on their feature -- in this case, just y. It's a good idea to make these catch-up merges each time we notice the shared line has been added to. (In the example the developer on the x line could have caught up earlier.) Running unit tests after these merges is important, since it can alert us to any conflicts that aren't flagged by Git. It's particularly important to make these catch-up merges (if needed) before merging back into the shared line. This ensures, that untested combinations of features never make it onto the shared line. (And, as we will see below, it also prevents merge conflicts from happening on GitHub.) A point not captured by the diagram above is that, if we allow fast-forward merges, not all the commits shown in the picture are different. We will have y1=shared1 and x4=shared2. SourceTree  might display the graph above like this
which is simpler, in that it has fewer nodes, but doesn't clearly show the lines of development. Like I said above, lines of development do not correspond to anything in Git. They are just a product how we think about software development.

The five branches

Usually you only have to worry about two lines of development at a time: a shared line (typically called master) and a line that only you are working on. For illustration I'll call the shared line "shared" and the other line "feature". In implementation the lines of development are represented (sort of) by branches. But thanks to Git being distributed, line of development x is represented by actual branches in a number of places:
  • There is GitHub's x branch, i.e. a copy of the branch that is on the hub. [I'm assuming here that the central repository is GitHub, but it could just as well by Git Lab or Bit Bucket or a private server.]
  • There is a tracking branch in your repository; this is called origin/x.
  • And there is your local copy of the branch, which is called x.
That's 3 branches for each line of development and they can all have different values. I'll call them "GitHub's x", "my origin/x", and "my x". Plus everyone else may have one or two copies in their own repositories. So if 10 people are working on 1 feature each, that's 11 lines (10 feature lines + the shared line) and there could be up to 21 branches for each line of development (1 on GitHub and then each local repository can have a local and a tracking branch). So there are up to 231 branches in total. Luckily you usually only have to worry 2 lines at a time and you only have to worry about the copies on GitHub and the copies on your own machine. And, of these, I don't ever use my origin/feature, so that's only 5 branches I have to worry about:
  • GitHub's shared,
  • my origin/shared,
  • my shared,
  • my feature,
  • GitHub's feature.
plus the working copy and the index. We try to maintain the following relationships at all times between the commits that are the values of these 5 branches. (Here ≤ means "is equal to or an ancestor of".)
my shared ≤ my origin/shared ≤ GitHub's shared
and
GitHub's feature ≤ my feature
It's also a good idea to try to fold any changes made to the shared into our feature as soon as they show on GitHub's shared branch. So we try to keep
my shared = my origin/shared = GitHub's shared ≤ my feature
true as much as practical. (I.e., that my feature is descended from my shared, which is the same as the tracking branch which is up to date.) We do this with catch-up merges. This way, when we read, edit, and test our code, we are reading, editing, and testing it in the context of all completed features. Furthermore, when a pull-request is merged we want
my shared = my origin/shared = GitHub's shared ≤ GitHub's feature = my feature
That way merge-conflicts won't happen on GitHub's server.

Information flow

The flow of information that I use is shown in the figure. I'll explain each part below.

Basic operations

For the rest of the article I'll assume you are using SourceTree. Of course everything SourceTree does can also be done from the command line. Some of the basic operations of SourceTree work like this (somewhat simplified):
"Fetch" updates all your tracking branches. So Fetch means my origin/x := GitHub's x, for every x branch in GitHub's repository. Typically we use Fetch to bring changes made to GitHub's shared to my origin/shared.
"Pull" means update my current branch from GitHub's repository. So Pull means my origin/x := GitHub's x ; my x := merge( my x, my origin/x), where x is the currently checked-out branch. Typically this is a fast-forward merge. (Usually I do a Fetch first and then a Pull if x is behind origin/x. When x is behind origin/x the merge is done by "fast forward", i.e., we have my x := my origin/x). Typically we use Pull to bring changes made to GitHub's shared to my shared.
"Branch" means create a new branch.  It means y := x  where y is a new branch and x is the currently checked-out branch.  Typically we use Branch when we start working on a new feature.
"Merge" means my x := merge(my x, my y) where x is the currently checked-out branch and y is another branch. Usually we either merge my shared into my feature or the other way around. In the flow I use, merges are always merging my shared with my feature to make a new value for my feature branch.
"Check out" updates the working copy to the value of a particular commit. When you check out a branch it updates the working copy to be the same as the branch's value and it makes that branch the current branch. In the flow this is used to check out my feature branch. Some operation in SourceTree only apply to the currently checked out branch, so there are times you will check out a branch just so you can do something else with it, such as a pull.
"Stage" Staging means moving changes that are in the working copy to the index.
"Commit" A commit action takes all the changes in the Index, makes a proper commit out of them, and clears the index. The current value of the current branch will be the parent of the commit.  The address of this new commit is then assigned to the current branch.
"Push" means update GitHub's copy of the branch; it also updates the tracking branch. So Push means GitHub's x := my x; my origin/x := GitHub's x, where x is the currently checked-out branch. In the work flow, Push is used to push commits on my feature branch to GitHub's feature branch.
"Make and merge a pull request (or merge request)". A pull request is a request for someone else to review the changes on a branch and to merge one branch into another.  (Pull requests are called merge requests on Git Lab, which is a better name in my opinion.)  Pull requests are not a feature of Git, but rather of hosting services such as GitHub. SourceTree can help you create merge requests. The actual merging of the pull request is done using GitHub's web interface.

Recipes for common tasks

Here are some recipes for doing some common tasks with SourceTree.

Catch up the shared branch

  1. In source tree click on Fetch
  2. If shared and origin/shared are the same, stop
  3. Check out the shared branch by double clicking on "shared" under "Branches" on the left sidebar.
    Click on Pull to get the local shared branch up to date with origin/shared
Here is an example. In this case someone has extended the shared line of development with a new commit, c. We first update origin/shared and then shared. HEAD indicates which branch is the checked out branch.

    Make a feature branch

    1. Catch up the shared branch (see above)
    2. Check out shared (if not already there).
    3. Click on Branch.
    4. Type "feature" as the New Branch. Click ok.
    Here is an example. The branch operation does not create and new commits.

      Make your own changes

      1. Check out feature (if not already the checked out branch).
      2. Make changes to the files. Run tests.  Etc.
      3. Back in source tree, Cmd-R (Mac) or Cntl-R (Windows) or View >> Refresh
      4. Select "Uncommitted changes"
      5. Review all unstaged changes.
      6. Stage all changes you want as part of the commit.
      7. Click Commit. (This doesn't actually do the commit.)
      8. Enter commit message
      9. Click on "Commit" button at lower right. (This does the commit.)
      10. Push the new commit to the origin, by clicking Push and OK.
      11. If you've never pushed the branch before you may need to check a box in the previous step before clicking OK.
      Pushing the new commit to the origin is optional, but it is good to do for a couple of reasons. One is that it saves your work remotely. The other is that it lets other people on your team see what you are doing. Here is an example: Here We've already one commit, x, on the feature line and make another, y.

      Catch up the feature branch.

      (Do this fairly frequently)
       
      
      1. Catch up the shared branch (see above).
      2. If shared is an ancestor of feature you are caught up. Stop.
      3. Check out feature (if not already the checked out branch).
      4. Click on merge.
      5. Select shared.
      6. Click OK.
      7. Check for any merge conflicts. If there are merge conflicts they need to be resolved. That's a whole other story. (Maybe another blog post.)
      8. Even absent merge conflicts, there may be silent problems that prevent compilation or introduce bugs. So carefully inspect all differences between the merged version and the previous version of feature. Also recompile and run unit tests.
      9. Click on Push.
      The final push is optional, but it saves your work.  Also you need to do it if you are going to make a pull request -- more on that below. Here is an example. In this case some one has added a new commit, c, to the shared line and we have already pushed our commits, x and y, to the origin repository. We first update our local shared and then make a merge commit, z, to combine the changes from b to c with the changes from b to y. Finally the new commit is pushed.

        Merge your feature back to the shared branch.

        (Do this when you think it's complete and ready for review.)
        
        
        1. Catch up the feature branch. (See above.) Be sure to push the feature branch to the server.
        2. If there are any problems, such as merge conflicts or failed tests, make sure they are all resolved before going on.
        3. On GitHub, make a new "Pull Request", being careful that it is a request to pull feature into shared.
        4. At this point, you might want to request someone else to review the pull request.
        5. Wait for comments or for someone else to merge the pull request.
        6. Or if no one else merges the pull request, merge it your self.
        On some projects, there many be a requirement that someone else reviews each pull request. When there are comments that need to be addressed, you can modify your feature branch and push it again.  Pull requests are based on branches, not on commits. So when you push new commits on your branch they become part of the pull request.   If there are changes to the shared branch between the pull request being made and the the feature being merged, it's important to redo the process above so, for example all tests can be run on a caught up version of the commit. Example. Here we make a pull request. After suitable review it is merged by fast-forward.


          Another example. In this case, the reviewer found some problems that I fixed with commit w. In the mean time some one else added to the shared branch (commit d). This new work didn't seem to require any further modifications from me. So I did another merge on my machine (commit m), tested, and pushed both commits w and m to the origin. Finally, the pull request is merged on Github. Beacause pull requests reference branches rather than commits, the meaning of the pull request changes as the shared and feature branches on origin change.

          Merging a pull request should not create any merge commits; it is simply a fast forward. If I had not made commit m on my machine and merged the pull request with shared pointing to d and feature pointing to w, the same commit m would be created in the origin repository and shared would be set to point to m. But, that would mean the head of the shared branch would contain a version of the file tree that had not been seen by anyone or fully tested.
           

          Merging 

          There are three ways to merge. Say we are merging commit x and commit y. There is usually a unique most recent ancestor, call it w.  You can think of x-w are being the set of changes that would chance w's tree into x's tree.  Similarly y-w is the set of changes that would change w's tree into y's tree. I'll use w + C to mean the tree you get by applying a set of changes C to the tree of w, so in particular w + (x-w) = x, for all x and w.  (I'm being very sloppy here at distinguishing between commits and their associated trees; really I should say w.tree + (x.tree -w.tree) = x.tree; I hope this isn't too confusing.)

          * An ordinary commit makes a new commit, z, whose tree is equal to w + ((x-w) & (y-w)).  Where & is some way of combining the two sets of changes. When there are merge conflicts, the & operator isn't clearly defined and the developer needs to help git decide how to combine the two sets of changes. The parents of the new commit are x and y.

          * A fast forward merge. When y is the the least common ancestor of x and y (i.e. y=w), the set of changes (y-w) is the empty set, which means that w + ((x-w) & (y-w)) = w + ((x-w) & ∅) = w + (x-w) = x. So if we do a regular merge we would get a new commit z that has the same tree as x.  So the only difference between z and x will be the parentage, date stamp, message, and possibly author.  In this case there is the option of not creating a new commit and simply saying that the result of the merge is x.

          * Rebasing merge. The idea of rebasing is to find a set of changes that can be applied to x to get the same tree that we would with a regular merge. In the simplest case, suppose that y's parent is w. Let C = ((x-w) & (y-w). Then we can make a new set of changes C' so that w+C = x+C', now we can make a new commit z whose parent is x and whose tree is x+C'.  This is really just the same a an ordinary commit except that we don't make y a parent of the new commit.  In general, rebasing works on commit at a time, so if the sequence of commits from w to y is  w, y1, y2, y3, y, then 4 new commits are created and chained in front of x. Let's call these x1, x2, x3, z. The tree of x1 is x + (y1-w)&(x-w) and its parent is x. The tree of x2 is x1+(y2-y1) and its parent is x1. The tree of x3 is x2+(y3-y2) and its parent is x2. The tree of z is x3 + (y1-y3) and its parent is x3.  At least this is what I think happens.  z is the result of the merge. At this point, y1, y2, y3, and y are typically discarded.  Essentially what we are doing is saying what would the world be like if I didn't startworking on y1, y2, y3, and y, until after x was done.  I.e., what sequnce of commits would I have made on top of x to get the same effect as merging x and y.

          Some people seem to like rebasing. I don't because:

          1. It creates version of the software that are never tested. In our example x1, x2, and x3.
          2. It creates a timeline that doesn't reflect reality. For one thing thing the dates on x1, x2, and x3 will not be the same as those on y1, y2, and y3.
          3. It is complicated to roll back. Suppose we later decide that the changes from w to x were a mistake. We can't simply go back to y since commit y is now lost.
          4. It complicates life in other repositories.

          Fast forward merge is fine in most cases. But as mentioned above, I like to have at least one ordinary merge at the end of each line of development.

          No comments:

          Post a Comment