stuck git import for a few days

Asked by Peter Sabaini

Git import for the ceph-mon charm seems stuck for a few days (since 2023-10-02)

https://code.launchpad.net/~openstack-charmers/charm-ceph-mon/+git/charm-ceph-mon

The logs seem to indicate a timeout:

2023-10-02 15:33:18 INFO Starting job.
2023-10-02 15:33:18 INFO Getting existing repository from hosting service.
2023-10-02 15:33:36 INFO remote: Counting objects: 100% (3189/3189)
2023-10-02 15:33:36 INFO remote: Counting objects: 100% (3189/3189), done.
2023-10-02 15:33:37 INFO remote: Compressing objects: 100% (3091/3091)
2023-10-02 15:33:37 INFO remote: Compressing objects: 100% (3091/3091), done.
2023-10-02 15:33:37 INFO Receiving objects: 99% (27243/27518), 20.93 MiB | 41.28 MiB/s
2023-10-02 15:33:37 INFO remote: Total 27518 (delta 2323), reused 122 (delta 96)
2023-10-02 15:33:37 INFO Receiving objects: 100% (27518/27518), 20.93 MiB | 41.28 MiB/s
2023-10-02 15:33:37 INFO Receiving objects: 100% (27518/27518), 21.51 MiB | 23.34 MiB/s, done.
2023-10-02 15:33:42 INFO Resolving deltas: 100% (16470/16470)
2023-10-02 15:33:42 INFO Resolving deltas: 100% (16470/16470), done.
2023-10-02 15:33:42 INFO Fetching remote repository.
Traceback (most recent call last):
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/scripts/code-import-worker.py", line 112, in <module>
    sys.exit(script.main())
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/scripts/code-import-worker.py", line 107, in main
    return import_worker.run()
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 581, in run
    return self._doImport()
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1197, in _doImport
    cwd="repository")
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1080, in _runGit
    for line in self._throttleProgress(git_process.stdout):
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1037, in _throttleProgress
    buffered, timeout=timeout):
  File "/srv/lp-codeimport/payloads/dfee8526b29e18a92919b26fe4a9b0587e7691ef-bionic/lib/lp/codehosting/codeimport/worker.py", line 1056, in _throttleProgress
    line = next(wrapped_file)
KeyboardInterrupt
Import failed:
Traceback (most recent call last):
Failure: twisted.internet.error.TimeoutError: User timeout caused connection failure.

However a manual clone from https://opendev.org/openstack/charm-ceph-mon.git works just fine:

git clone https://opendev.org/openstack/charm-ceph-mon.git
Cloning into 'charm-ceph-mon'...
remote: Enumerating objects: 7608, done.
remote: Counting objects: 100% (3003/3003), done.
remote: Compressing objects: 100% (789/789), done.
remote: Total 7608 (delta 2848), reused 2214 (delta 2214), pack-reused 4605
Receiving objects: 100% (7608/7608), 1.70 MiB | 537.00 KiB/s, done.
Resolving deltas: 100% (4974/4974), done.

I've triggered an import manually as well and it does seem to be stuck too.

Would you be able to help out?

Question information

Language:
English Edit question
Status:
Open
For:
Launchpad itself Edit question
Assignee:
Guruprasad Edit question
Last query:
Last reply:
Revision history for this message
Ines Almeida (ines-almeida) said :
#3

Hi Peter,

I see that this code import was tried a few times, each time the same error appears for different code-import workers, and those workers are working OK for other projects. So this seems weirdly very specific for this particular import, I'm not 100% sure how we can help currently, or if this is indeed an issue from Launchpad or not.

It was working on the 2nd of October (this Monday), then it started failing. I'm guessing the launchpad configuration for the code import has remained the same?

I also noticed that there was exactly 1 commit between things working vs. not working, though I don't see how that could be the root of the issue.

I'll ask if anyone else in the team has any other ideas

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#4

Ines,

indeed the launchpad configuration hasn't changed.

FTR. this is the MR where things stopped working: https://review.opendev.org/c/openstack/charm-ceph-mon/+/897011

It's a one line change of a parameter, can't see how this could affect things.

Revision history for this message
Jürgen Gmach (jugmac00) said :
#5
Revision history for this message
Ines Almeida (ines-almeida) said :
#6

Small update here, I tried doing the same code import in our qastaging environment and it succeeded: https://code.qastaging.launchpad.net/~ines-almeida/test-project-ines/+git/test-project-ines

This is more proof that this is an issue within production (given the configuration didn't change, then it shouldn't be due to the configuration in production either)

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#7

@Ines thanks for the update

@Juergen, I see the import was retried and failed at 2023-10-09 06:52:34
http://launchpadlibrarian.net/691047798/openstack-charmers-charm-ceph-mon-+git-charm-ceph-mon.log

Revision history for this message
Colin Watson (cjwatson) said :
#8

The import seems to be running into a deadlock between "git-remote-https" and "git fetch-pack". It looks quite like https://github.com/git/git/commit/b37fd14beb39b9f545bd72e42e1bdbb00bad4b3d, and I wonder if we should try cherry-picking that patch into the git backport that we're running.

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#9

Interesting!

From the linked commit it says this happens "when the server side prematurely throws an error and disconnects", does this mean we're running into errors when fetching from upstream?

Fwiw I'm of course very much +1 on trying that cherry-pick if it helps us get unstuck.

Thanks!

Revision history for this message
Colin Watson (cjwatson) said :
#10

I didn't see any such errors, but it's always possible that upstream's analysis of the exact set of situations that can result in this bug is incomplete. (Alternatively, my educated guess could be wrong.)

Revision history for this message
Jürgen Gmach (jugmac00) said :
#11
Revision history for this message
Guruprasad (lgp171188) said :
#12

I backported what looked like a relevant fix upstream to the bionic package that we are using in the code import workers. But that didn't solve the problem and the code import is still timing out. Colin from my team has suggested backporting the jammy version of git to bionic (if at all that is possible) to see if that resolves this issue. So fixing this is going to take more time.

Revision history for this message
Peter Sabaini (peter-sabaini) said :
#13

Hey @Guruprasad, thanks for the update (and bummer about that fix).

Just for avoidance of doubt, is the git backport jammy->bionic something you're considering implementing?

Cheers!

Revision history for this message
Guruprasad (lgp171188) said :
#14

Hi Peter, yes, I am going to try backporting the jammy version of git to bionic, if at all that is possible.

Can you help with this problem?

Provide an answer of your own, or ask Peter Sabaini for more information if necessary.

To post a message you must log in.