
git beginner here, so forgive me. As part of change management of a couple hundred servers, we do a regular 'git add .' on a central repository of half a million files, followed by a bit of munging, then a 'git commit -a -m ...' (rhel5, so 'git -a' is 'git -A' elsewhere). 'git add .' is very very slow at finding the additions and modifications (we don't care about the '-u' deletions at this stage, because of the future munging required) and staging them. I suspect it's actually staging every single file rather than just changes. 'git status -s' on a freshly changed repository prior to doing any git add is really quite quick (no, it's not a cold or undersized cache issue), and finds all additions, deletions and modifications. We could simply pipe that rather small output to 'git add' and it would be much much quicker at staging them (er, I think; but also, a bit of a kludge). Any known reason why git-add would appear to be recursing through the entire tree staging even unchanged files, and not just acting on the changed files that git-status obviously can find very quickly? Any missing bit of git magic I could be applying? -- Tim Connors

Try this? git add -i u 1-$max a 1-$max q git commit -m ... (where $max is the maximum number printed in the list it just gave you prior) On 13 December 2013 18:26, Tim Connors <tconnors@rather.puzzling.org> wrote:
git beginner here, so forgive me.
As part of change management of a couple hundred servers, we do a regular 'git add .' on a central repository of half a million files, followed by a bit of munging, then a 'git commit -a -m ...' (rhel5, so 'git -a' is 'git -A' elsewhere).
'git add .' is very very slow at finding the additions and modifications (we don't care about the '-u' deletions at this stage, because of the future munging required) and staging them. I suspect it's actually staging every single file rather than just changes.
'git status -s' on a freshly changed repository prior to doing any git add is really quite quick (no, it's not a cold or undersized cache issue), and finds all additions, deletions and modifications. We could simply pipe that rather small output to 'git add' and it would be much much quicker at staging them (er, I think; but also, a bit of a kludge). Any known reason why git-add would appear to be recursing through the entire tree staging even unchanged files, and not just acting on the changed files that git-status obviously can find very quickly? Any missing bit of git magic I could be applying?
-- Tim Connors _______________________________________________ luv-main mailing list luv-main@luv.asn.au http://lists.luv.asn.au/listinfo/luv-main
-- Turning and turning in the widening gyre The falcon cannot hear the falconer Things fall apart; the center cannot hold Mere anarchy is loosed upon the world

On 2013-12-13 18:26, Tim Connors wrote:
git beginner here, so forgive me.
As part of change management of a couple hundred servers, we do a regular 'git add .' on a central repository of half a million files, followed by a bit of munging, then a 'git commit -a -m ...' (rhel5, so 'git -a' is 'git -A' elsewhere).
'git add .' is very very slow at finding the additions and modifications (we don't care about the '-u' deletions at this stage, because of the future munging required) and staging them. I suspect it's actually staging every single file rather than just changes.
'git status -s' on a freshly changed repository prior to doing any git add is really quite quick (no, it's not a cold or undersized cache issue), and finds all additions, deletions and modifications. We could simply pipe that rather small output to 'git add' and it would be much much quicker at staging them (er, I think; but also, a bit of a kludge). Any known reason why git-add would appear to be recursing through the entire tree staging even unchanged files, and not just acting on the changed files that git-status obviously can find very quickly? Any missing bit of git magic I could be applying?
Hi Tim, My first bet would be write load vs read load. git add has to not only check the hashes of each file, but for each new file, after it hashes it, it has to add it to its object database. git status, however, needs to do zero writing. I find particularly when using git over NFS, that the add is far slower than the status. You could always do an strace to find out what it's doing in more detail. Keep in mind that by its very nature, git would not stage unchanged files, because it would hash the file, determine immediately via hash table that the hash already existed in the object store, and not bother to store it again. It *does* have to go through the entire process of *calculating* the hash for each file every time though, as far as I know. Hope this helps. -- Regards, Matthew Cengia

Matthew Cengia <mattcen@gmail.com> wrote:
Keep in mind that by its very nature, git would not stage unchanged files, because it would hash the file, determine immediately via hash table that the hash already existed in the object store, and not bother to store it again. It *does* have to go through the entire process of *calculating* the hash for each file every time though, as far as I know.
does it exclude unchanged files based on the last modification time? If so, it only needs to look at the directory entries. It already has the date/time of the commit relative to which changes are to be calculated.

On 2013-12-14 09:37, Jason White wrote:
Matthew Cengia <mattcen@gmail.com> wrote:
Keep in mind that by its very nature, git would not stage unchanged files, because it would hash the file, determine immediately via hash table that the hash already existed in the object store, and not bother to store it again. It *does* have to go through the entire process of *calculating* the hash for each file every time though, as far as I know.
does it exclude unchanged files based on the last modification time? If so, it only needs to look at the directory entries. It already has the date/time of the commit relative to which changes are to be calculated.
No, it does not check the datestamp, and nor should it; git doesn't track the modification dates of individulal files, it only cares about the date of commits, and heavy-weight tags. If it checked the date, the below example would allow me to trick git, and the user, into thinking the repo was unchanged, when in fact I'd changed the entire contents of file 'bar'. When it comes to files, git only cares about one thing: The sha1sum of the file. Elegant in its simplicity, even if it does take a bit longer to read data in order to verify its integrity. mattcen@owen:tmp$ git init Initialized empty Git repository in /tmp/tmp/.git/ mattcen@owen:tmp(master)$ echo foo > bar mattcen@owen:tmp(master)$ git add bar mattcen@owen:tmp(master)$ git commit -m "initial commit" [master (root-commit) bd218a4] initial commit 1 file changed, 1 insertion(+) create mode 100644 bar mattcen@owen:tmp(master)$ cp -a bar baz mattcen@owen:tmp(master)$ echo quz > bar mattcen@owen:tmp(master)$ touch -r baz bar mattcen@owen:tmp(master)$ ls -l --time-style=full-iso total 8 -rw-rw-r-- 1 mattcen mattcen 4 2013-12-14 10:21:07.000000000 +1100 bar -rw-rw-r-- 1 mattcen mattcen 4 2013-12-14 10:21:07.000000000 +1100 baz mattcen@owen:tmp(master)$ git status # On branch master # Changes not staged for commit: # (use "git add <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # modified: bar # # Untracked files: # (use "git add <file>..." to include in what will be committed) # # baz no changes added to commit (use "git add" and/or "git commit -a") -- Regards, Matthew Cengia

Matthew Cengia <mattcen@gmail.com> wrote:
On 2013-12-14 09:37, Jason White wrote:
does it exclude unchanged files based on the last modification time? If so, it only needs to look at the directory entries. It already has the date/time of the commit relative to which changes are to be calculated.
No, it does not check the datestamp, and nor should it; git doesn't track the modification dates of individulal files, it only cares about the date of commits, and heavy-weight tags. If it checked the date, the below example would allow me to trick git, and the user, into thinking the repo was unchanged, when in fact I'd changed the entire contents of file 'bar'. When it comes to files, git only cares about one thing: The sha1sum of the file. Elegant in its simplicity, even if it does take a bit longer to read data in order to verify its integrity.
Yes, of course it's possible to change the date stamp of files and avoid the detection of modifications, when date stamps are relied on. However, it's obviously more efficient to check date stamps than to compute hashes of file contents, users generally don't change the former, and a version control system isn't a security tool designed to resist unexpected user behaviour. As I remember, cvs and svn both use date stamps. It's interesting that Git has opted for a different trade-off in this case, a reasonable decision to make, arguably, but one which has its efficiency costs.

On 2013-12-14 15:43, Jason White wrote: [...]
contents, users generally don't change the former, and a version control system isn't a security tool designed to resist unexpected user behaviour. As
I disagree. Assuming you trust SHA1 (which is getting a bit long in the tooth), Git has end-to-end security: You can sign a tag with a GPG key and that tag points to a commit which can't be modified without changing its hash. The hash of the commit is dependent on all previous commits, so you can't change any of the previous commits either, without invalidating the signature. If it were possible for a user to clone a git repo and then have somebody edit a file in the working tree while maintaining the datestamp, git should be able to detect that, otherwise the entire security model breaks down. -- Regards, Matthew Cengia

Matthew Cengia <mattcen@gmail.com> writes:
On 2013-12-14 15:43, Jason White wrote: [...]
contents, users generally don't change the former, and a version control system isn't a security tool designed to resist unexpected user behaviour. As
I disagree. Assuming you trust SHA1 (which is getting a bit long in the tooth), Git has end-to-end security: You can sign a tag with a GPG key and that tag points to a commit which can't be modified without changing its hash. The hash of the commit is dependent on all previous commits, so you can't change any of the previous commits either, without invalidating the signature. If it were possible for a user to clone a git repo and then have somebody edit a file in the working tree while maintaining the datestamp, git should be able to detect that, otherwise the entire security model breaks down.

On Sat, Dec 14, 2013 at 03:43:53PM +1100, Jason White wrote: ...
Yes, of course it's possible to change the date stamp of files and avoid the detection of modifications, when date stamps are relied on. However, it's obviously more efficient to check date stamps than to compute hashes of file contents, users generally don't change the former, and a version control system isn't a security tool designed to resist unexpected user behaviour. As I remember, cvs and svn both use date stamps. It's interesting that Git has opted for a different trade-off in this case, a reasonable decision to make, arguably, but one which has its efficiency costs.
Don't forget, git is distributed. Neither cvs nor svn is by nature. Karl
participants (6)
-
Jason White
-
Karl Billeter
-
Matthew Cengia
-
Tim Connors
-
Toby Corkindale
-
trentbuck@gmail.com