Restore transfer bug (last dev version)

Asked by Bartek

Hello,

I'm afraid that sometimes dev version of client have problems wth restoring transfer of broken files. It's easy to spot on big files.
If correct file size is 200mb and there is only 50mb downloaded, after restart client probably downloading whole file and adds it to partialy downloaded piece, so in effect there is 250mb file. After checking md5 file is correctly detected as wrong and downloaded again.
But sometimes restoring works ok.

Question information

Language:
English Edit question
Status:
Solved
For:
Xibo Edit question
Assignee:
No assignee Edit question
Solved by:
Bartek
Solved:
Last query:
Last reply:

This question was reopened

Revision history for this message
Dan Garner (dangarner) said :
#1

Thanks for doing this testing, it is very useful!

So if you have the file partially downloaded, cut the client (or the LAN I guess) then restart the client (or plug the LAN back in) the client will sometimes append the whole file onto the partial download?

So if the file is 10MB and you stop it at 5MB, after the restart you will end up with a 15MB file? Which presumably will then get deleted and replaced with the proper 10MB file when the next collection interval occurs?

I wonder if for some reason the file is locked when it tries to resume (which is not really a resume, but a delete and add). I'll take a look and see what I can find!

Thanks again

Revision history for this message
Bartek (czajka) said :
#2

Ok, maybe i'll describe whole scenario:

We have dummy layout 'Whole library' where all mpeg files are placed. I'm using it to sync client to have all materials before sending do destination location. When WL layout is downloaded correcty, default layout in management is changed to correct one and terminal leaves office.

Right now there is ~20gb of stuff.
I was preparing new terminal some days ago and started to sync WL. When there was some gbs already downloaded i've clicked file that was downloading right then in windows manager. Let's name it 150.mpeg. Then probably file was locked by windows or something. Xibo left it unfinished (50/200mb) and proceeded to next file.
Some time later i've decided to swith off computer during the transfer, so next file - 210.mpeg was downloaded only partialy - let's say 150/250mb.
Those files were downloaded in 500kb chunks.

After turning xibo on next day client started procedure from beginning. Xibo correctly spotted that 150 is unfinished and file started to grow from 50mb size (no from 0). It;s reached size much bigger that it should reach - 250mb instead of 200. Then was deleted - probably after md5 check, and client started downloading it form beginning, but using some other chunk size that 500kb - i don't know if its important.
When file was downloaded again from 0 size download completed sucesfully.
Next files 151-209 where correctly skipped as downloaded previous ok.

Scenario repeated whet xibo started 210.mpeg stuff. As before file started to growing and reached size much bigger that it should. Then was deleted and downloaded again from 0 size without problems.
Next files where downloaded ok and layout was downloaded ok to the end.

One more thing I can add - this effect exist also on previous (first) version of file corruption path written for Mariusz. We using it already on 75% of our terminals right now (1.0.4 was crashing on connection problems - main reason, and we need file restore function - better untested and buggy than none)

Revision history for this message
Bartek (czajka) said :
#3

Observations were made in explorer in details mode, sorted by date (so interesting files were on top of the list - easy to monitor). Window was refreshed by ctrl+r, and no files were clicked during observations (second day ofcourse :)) so i think where was no chance that files were locked some way and becouse of that download was appended instead of overwite.

Revision history for this message
Dan Garner (dangarner) said :
#4

Thanks again for your testing in this area, I think I have enough info to look into this now. I suspect that the client just needs to delete the current file if it detects that it is not correct.

Revision history for this message
Dan Garner (dangarner) said :
#5

Found the problem, this will be fixed by 1.0.5.

Cheers,
Dan

Revision history for this message
Bartek (czajka) said :
#6

One last thing in this subject :)

As far i know client sync downloaded files every 500kb. How does it work inside? Client simply ask server once to start streaming file X over http, or it keep asking server every 500kb like: give me the 500kb chunk no. Y of file X?.

I'm can guess that first answer is correct. But if second, maybe there can be implemented some very simple transfer restoring support, eg. when size(file on disk) << size(file on server) -> probably client was shutdowned during transfer -> open file for writing and move writing pointer on such file 1mbyte back to prevent downloading after gap generated by unfinished writing, eventually align to 500kb block size -> start downloading from there.
Of course there is no chunks cheksum verification, some time file will be damaged some other way etc, but it's probably statistically better than starting from 0, and as stable as deleting file and starting from beginning ;)

Revision history for this message
Dan Garner (dangarner) said :
#7

It is option 2.

I think that if we were to implement some resume support into the client we would want to checksum each 500Kb block so we knew exactly what was going on. There is already a blueprint for this here: https://blueprints.launchpad.net/xibo/+spec/dotnetclient-getfile-improvements

I understand your goal - having unstable connections as you do you are thinking that more often than not there is 0 corruption, the file just didn't complete before the connection dropped. Starting again may not solve the problem - as the connection may just drop again.

I will have a look at your suggestion and see if something is possible in the 1.0.5 code base (if it is I will update the blueprint I mentioned).

Revision history for this message
Alex Harrington (alexharrington) said :
#8

Hi Bartek

Dan and I discussed that but felt that it falls outside a bugfix so will be implemented in a development version.

The changes we've made for 1.0.5 are way beyond what I would consider reasonable in a stable release series already and had I realised the extent of those changes at the outset I would have pushed harder for all these fixes to go to a development cycle.

I'm therefore not keen on introducing yet another change.

Alex

--- original message ---
From: "Bartek" <email address hidden>
Subject: Re: [Question #93734]: Restore transfer bug (last dev version)
Date: 14th December 2009
Time: 4:50:01 pm

Question #93734 on Xibo changed:
https://answers.launchpad.net/xibo/+question/93734

Bartek posted a new comment:
One last thing in this subject :)

As far i know client sync downloaded files every 500kb. How does it work
inside? Client simply ask server once to start streaming file X over
http, or it keep asking server every 500kb like: give me the 500kb chunk
no. Y of file X?.

I'm can guess that first answer is correct. But if second, maybe there can be implemented some very simple transfer restoring support, eg. when size(file on disk) << size(file on server) -> probably client was shutdowned during transfer -> open file for writing and move writing pointer on such file 1mbyte back to prevent downloading after gap generated by unfinished writing, eventually align to 500kb block size -> start downloading from there.
Of course there is no chunks cheksum verification, some time file will be damaged some other way etc, but it's probably statistically better than starting from 0, and as stable as deleting file and starting from beginning ;)

--
You received this question notification because you are a member of Xibo
Developers, which is an answer contact for Xibo.

This email carries a disclaimer, a copy of which may be read at http://learning.longhill.org.uk/disclaimer

Revision history for this message
Bartek (czajka) said :
#9

Ok, file deletion is better than leaving its broken, so it also acceptable solution.

I've mentioned this very simple restoring way becouse it could be done without any heavy change in server and client. I can even say that restoring fron n chunk its almost no change in compare to deleting file and staring from 0. In theory its one size compare inside some if and setting counter and write pointer in postion different than 0, then using download code that already exits.
There is only one action change - instead of deleting file, client needs to set pointer somewhere inside it and keep writing.

Checking chunks md5 probably needs rethink of whole procedure, needs modifications in server side as far I can guess etc. Probably easy and short to code also, but after summarizing all small changes like that thats not exactly bugfixes, in effect we got unchecked client not similar to 1.0.4 codebase, and that's not my goal.

Revision history for this message
Dan Garner (dangarner) said :
#10

It may be a small code change - but it is a big behaviour change.

The client assumes that if the file is there then it is good and will therefore try to play it. Now it should fail to play corrupt media and move on - but the overall assumption is that if it is there then its OK. Which is why we delete files that fail the MD5 check when download is complete.

We could not consider leaving partially downloaded files unless we implemented a "validity" check on all layouts to ensure that all resources were correct. In other words - as long as the assumption "if its there then its OK" exists within the client, we cannot implement the resume functionality.

The Python client already has this solved in the long term - and this is definitely something we will be considering for 1.1 of the C# client.

Your input on this issue and the testing you have done is great - if you would transfer your thoughts on to the development blueprint (and maybe even a spec of how you think it should work) then that would be great... but we cannot put resume into 1.0 series of code, sorry!

Revision history for this message
Bartek (czajka) said :
#11

It's sound reasonable, i don't understand 100% but if You say that's cannot be done you're probably right ;)
As I said many times, deletion is also good solution, probably more expensive, but you (me?) can't have everything ;)

Revision history for this message
Bartek (czajka) said :
#12

I forgot to add:

Probably You'll better describe Blueprint if you think it's needed.
I'm only guessing that something can/can't be done.

Right now functionality is ok in my opinion. Any additions that i've written are extremal situations that shouldn't happen normally - file deletion from existing layout and broken files in effect of clicking them during transfer ;)
But i think it was worth mentioning ;)

And nice to hear that's there is a chance for 1.1 net client. 2x more bugs to hunt than in single python client ;)

Revision history for this message
Bartek (czajka) said :
#13

Hello,

Dan, I've a one more question in this subject.
Its GENERAL question, not 1.0.5 targetted, even not stable targetted :)

As far I understand your explanations (very big thanks for them!) there is a lot of architecture limitations due to simple client file logic: there is a file -> it's ok, let's play it. there is no file -> we need to download it.

Maybe there is a simple solution, that (again ;)) neens almost no new code, only changing existing one. And that's why I need your knowledge as a code author :)

Let's theoretically change download routine to name incompleted files 'file.incomplete' (ex. 145.mpeg.incomplete) instead of 'file'.
When file is correctly downloaded and checksumed rename it back to 'file'.
When file is detected as broken later by filecorruption routines let's rename it back to 'file.incomplete' (or delete if size==server size) until it'll be fixed.

So then, in point of view of rest no '.incomplete' aware routines there is simple no file until it's downloaded and checked by modified filecollector.

Advantages:
- _simple_ file restore mentioned before can be probably implemented without any change in server side and display routines.
- no problems with playing incomplete files (like in point 4 and 5 in related subject) until completing them, if no file named as complete, there is simple skip and no client hangups (of course incomplete in effect of server-> client transfer, not user->server)
- supporting events that could break transfer its not really nessesary, becouse there will be no visible effects of file errors - file going to be skipped during playback and probably downloaded again when client will be started again, or when new layout will request such file, and it's fine for workaround.
- more human readable library - nice to see at first look what files are incomplete instead of digging in file sizes.
- maybe something else ;)

So the question is: Do you think client could work as described?
If answer is yes, maybe You could write such one-time branch (maybe some donation is a option ;)) after 1.0.5 release, or we can hire some programmer who will implement such changes for us. Then we can stick to such version in production (probably there will be no need to changing for versions >=1.0.6 - all bugs spotted in 1.0.4 in out usage profile going to be fixed in 1.0.5) , and wait for stable 1.2 release, testing 1.1 in middle time :)

Revision history for this message
Dan Garner (dangarner) said :
#14

I think we need to talk about objectives before trying to decide on a solution (I agree that using a .incomplete MAY be an option, but I am still not clear on exactly what the behaviour should be).

Here are the objectives as I understand them:

- Client should detect incorrect files and decide if:
   a) They are corrupted and need to be deleted and started again
   b) They were interrupted and need to be resumed from the last chunk (not completely re-downloaded)
- Client shouldn't try to play any files that have not been successfully checksummed.
- If a layout which is already playing changes (becomes invalid), it should continue to play while the updated layout is downloaded.
- If a completed file which has been checksummed fails to play it should be "Black Listed" as corrupt media and not attempted again, unless it is updated on the server.
- Client should expose an easy way to determine which files are complete and valid (not sure about this one - in typical operation users are not expected to be looking in the library)

Feel free to add any objectives I have missed (anyone)... but try not to talk about implementation methods.

Once we have an agreed list of things you would like the client to do I will write a specification for them and we can agree on it / an amount for a bounty if we decide we cannot do it in 1.0.5 and you want it developed specifically for you (and then rolled into a later version).

I hope this makes sense?

Revision history for this message
Bartek (czajka) said :
#15

On Sat, 19 Dec 2009, Dan Garner wrote:

Your list probably generates a very good and predictable client
conception.
Below I've separated 'must have' and 'nice not nessesary' in our
particular situation. Probably not optimal for other users :)

> - Client should detect incorrect files and decide if:
> a) They are corrupted and need to be deleted and started again

If file was downloaded and checksumed before probably not really
nessesary.
When I was observing client repairing files, those files always were
appended becouse of bug, at probably there we had some misunderstanding,
becouse i assumed that resuming is a part of 'repairing'.

> b) They were interrupted and need to be resumed from the last chunk (not completely re-downloaded)

Yes, as You know it's quite important in out net

> - Client shouldn't try to play any files that have not been successfully checksummed.

Yes, checksumed at least once in the past.

> - If a layout which is already playing changes (becomes invalid), it should continue to play
> while the updated layout is downloaded.

It would be nice. It's not really important for me, but probably important
for our layout designers, they don't know what algorithm is ;), and hard
to explain to them how to use system to prevent splashscreens, and why
there is a splashscreen at all ;)

> - If a completed file which has been checksummed fails to play it should be "Black Listed"
> as corrupt media and not attempted again, unless it is updated on the server.

Not nessessary for us. We're checking layout on local display before
sending to production displays.
Also there no user->server transfer problems in effect of combining ftp
upload to remote pc, and vnc desktop on that pc with web browser connected
locally to server, where finally file upload to xibo is clicked.

> - Client should expose an easy way to determine which files are complete and
>valid (not sure about this one - in typical operation users are not
>expected to be looking in the library)

Not nessesary at all, only described as a little nice thing related to
.incomplete suffix.

> Feel free to add any objectives I have missed (anyone)... but try not to
> talk about implementation methods.

Client ip simply readed from $_SERVER['REMOTE_ADDR'], stored in db, and
displayed on clients list. I could probably implement it myself, but it's
hard to find right place in code, where client is already recognized and
authenticated :)

I don't have any more ideas right now (maybe with exception of last frame
on screen with specified duration mpegs - not checked Your bugfix in that
way, maybe already there?).
Generally all those requirements are the effect of two things:

- Our network construction, client terminal concept, clients quantity, and
profile usage - probably you know it well aready ;)

- I belive in statistic :) Probably there will be much more problems with
underlying hardware and OS, that xibo itself:

Windows is stored on read only CF cards, xibo lib and confs (and
nothing more) stored on hdd
with disabled cache so there is no unnessesary disk writes.
In some-already-stored-file related problem, xibo probably
won't be able to fix it. There probably will be some hardware malfunction,
so human intervention will be nessessary, independently of xibo's affort
to act nice...

> I hope this makes sense?

Yes. If You have any further comments, please share :)

PS. Some clients creashed recently on devel version (compiled somewhere
near first filecheck version after Mariusz's bugreport)
There was already unstable connection-> client crash fix?
If yes, maybe there is some way to debug them and help fixing xibo again
:)

> --
> To answer this request for more information, you can either reply to
> this email or enter your reply at the following page:
> https://answers.launchpad.net/xibo/+question/93734
>
> You received this question notification because you are a direct
> subscriber of the question.
>

Revision history for this message
Dan Garner (dangarner) said :
#16

OK great - I'll draft a better specification over the next few days and add it to the blueprint [https://blueprints.launchpad.net/xibo/+spec/dotnetclient-improve-filemanagement]. I'll include a "number of days" development effort in case anyone is interested in offering a bounty (assuming we don't decide to put it in 1.0.5). Can you "solve" this question and subscribe to the blueprint for updates.

Regarding the Client IP (which is really another question). What you suggest would work, but is a bit of a hack... the REMOTE_ADDR is not a reliable source of IP (for example when going through a proxy). There is a blueprint for this already [https://blueprints.launchpad.net/xibo/+spec/client-information-central]. If you did want to do your hack then you would put it in XMDS.php inside the RequiredFiles method.

Before we release 1.0.5 I'll build you a new EXE which you could run (if you don't mind). There have been a few things fixed recently that could have caused those crashes.

Thanks for all the time you are putting into this!

Revision history for this message
Bartek (czajka) said :
#17

OK, i've suscribed to blueprint. But no rush, it's not very urgent becouse there is already some quite fine broken files support.

About client ip - ofcourse You're right. There is a lot of different problems and possible cominations - proxy, transparent proxy, nat, etc, hard to guess what ip should be exposed in general situation.
For example when using client with nat: it's better to expose public ip where some port redirect could exist, or private client's ip that is completly unusable from server's network? Or maybe there is a lot of nics with different ips, some vpn, some ethernet, something else. So generally hard to say what is a best option :)

From the other hand, our network is based on openwrts routers (linksys 3g and linksys ethernet) in every single location. Each one have it's own ip range on lan (there also ip cameras and in some locations ip restarters that needs to be reachable, no xibo terminals only. cam's/restarter's ip must be reachable even when xibo pc die, so using such pc as vpn base was not a option :)), and lan ips ale routed to central server by vpns between router and server. So network is transparent and REMOTE_ADDR is fine. Terminal ip is related only to routers num, used internet connection is unimportant becouse of overlaying vpn. So thanks for Your advice, i'll check RequiredFiles soon :)

Crash in out current production revision (1.0.4+ something ;)) happens so so often at all. Right now it's one crash for one-two weeks. In 1.0.4 base it was something like one crash every two days. So I going to check new version with pleasure, but waitng for any test effect would took some long time, probably much longer that 1.0.5 expected release date.