Sending work

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Author	Message
Bernd Machenschalk Volunteer moderator Project administrator Project developer Send message Joined: 15 Oct 04 Posts: 1956 Credit: 6,218,130 RAC: 0	Message 111574 - Posted: 14 Dec 2011, 22:11:07 UTC - in response to Message 111572. This is quite a mission impossible from a BOINC perspective, isn't it? If the initial estimation is off too much, the WUs will ALL get terminated prematurely, and the server will NEVER get a valid result to adjust the estimation of the computation performance, which is actually needed to provide a good estimation for the max elapsed time in the first place!! Yep, that occurred to me, too. I already added some code to our plan-class stuff that should allow me to play around with the flops estimation a bit. I intend to do this tomorrow, together with some more analysis of the scheduler code (sched_version.cpp). BM ID: 111574 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 111575 - Posted: 14 Dec 2011, 22:19:43 UTC Last modified: 14 Dec 2011, 22:36:02 UTC @Jord: Well, because I fiddled manually in client_state.xml, finally a lot of workunits were completed (even validated, not sure if that matters), and my card took ca 4500 sec per WU. I think your card takes around 33k sec to complete a WU, so no matter what the theoretical (computed) peak performance is, the server would be right to assign a ca 7 times lower performance to your card. Why is your card slower? Actually the debugging output gives a hint that because of the physical capabilities of your card, the app was forced to re-size the internal processing layout to make it fit. I'm afraid it requires some deep analysis to find out whether this re-sizing is leading to differences in the result of the computation, and whether the differences are tolerable (==>validator adjustment) or intolerable (maybe the re-sizing has a bug). It would be instructive to see whether the debugging output in question is common to all 4xxx series cards. Ah..but that's a different subject. @Bernd: my host is now almost out of work (was on nomorework) so I'll give it a try tomorrow or whenever it's ready. CU HB ID: 111575 · Reply Quote

robertmiles Send message Joined: 16 Nov 11 Posts: 19 Credit: 4,468,368 RAC: 0	Message 111576 - Posted: 14 Dec 2011, 22:54:54 UTC Over on GPUGRID, I saw something about them finding that the HD4xxx series cards had some type of memory access problem - a limit on the amount of graphics memory each processor on the GPU can access before it starts using a much lower bandwidth path to the computer's main memory instead. I haven't kept up with whether more recent software updates have removed this restriction. ID: 111576 · Reply Quote

pragmatic prancing periodic problem child, left Send message Joined: 26 Jan 05 Posts: 1639 Credit: 70,000 RAC: 0	Message 111577 - Posted: 14 Dec 2011, 23:14:54 UTC - in response to Message 111575. I think your card takes around 33k sec to complete a WU, so no matter what the theoretical (computed) peak performance is, the server would be right to assign a ca 7 times lower performance to your card. 7 or 60? Quite some difference. But OK, I am running with a changed flops value, still only 11 digits long but different than what Albert gave me. Since its estimates are all too low (you're right about the ~32k seconds) I've made it think that the tasks are actually longer, not shorter. Just too bad I'm still quite busy with Skyrim. That hacks into the time anything else can use the GPU. ;-) Jord. BOINC FAQ Service They say most of your brain shuts down in cryo-sleep. All but the primitive side, the animal side. No wonder I'm still awake. ID: 111577 · Reply Quote

Bernd Machenschalk Volunteer moderator Project administrator Project developer Send message Joined: 15 Oct 04 Posts: 1956 Credit: 6,218,130 RAC: 0	Message 111580 - Posted: 15 Dec 2011, 19:02:22 UTC - in response to Message 111577. Last modified: 15 Dec 2011, 19:02:56 UTC The way of calculating (projected_)flops differs largely depending on how many tasks with this app version your host has successfully computed. Maybe this value differs between your hosts. Anyway, I did change the scheduler (the "projected_flops" supplied by th eplan classes should be much lower now). At least they should bot overestimate the actual flops now, which could lead to "maximum time exceeded" errors. Time estimates on the Client side may be far off now, though. Have a try. BM ID: 111580 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 111581 - Posted: 15 Dec 2011, 20:44:28 UTC - in response to Message 111580. Last modified: 15 Dec 2011, 20:45:37 UTC Hmmm....I get the same cut-off time as before (even tho I resetted the Albert project before allowing new work). In addition, the app now seems to be configured to use a full CPU core. http://albert.phys.uwm.edu/result.php?resultid=66620 CU HB ID: 111581 · Reply Quote

Bernd Machenschalk Volunteer moderator Project administrator Project developer Send message Joined: 15 Oct 04 Posts: 1956 Credit: 6,218,130 RAC: 0	Message 111582 - Posted: 15 Dec 2011, 21:16:35 UTC - in response to Message 111581. Ok, scheduler reverted. Needs further investigation. BM ID: 111582 · Reply Quote

pragmatic prancing periodic problem child, left Send message Joined: 26 Jan 05 Posts: 1639 Credit: 70,000 RAC: 0	Message 111585 - Posted: 16 Dec 2011, 0:32:33 UTC - in response to Message 111580. Last modified: 16 Dec 2011, 0:33:15 UTC The way of calculating (projected_)flops differs largely depending on how many tasks with this app version your host has successfully computed. Maybe this value differs between your hosts. LOL, like zero times for me? None of the tasks I do validate, remember? As for testing your over_flops, what do I do with the extra tasks? Stupid BOINC always fetches 6 tasks, doesn't matter that it then takes ~9 days to do them... Though you now have ~9 days to come up with a better schedule(r). ;-) Jord. BOINC FAQ Service They say most of your brain shuts down in cryo-sleep. All but the primitive side, the animal side. No wonder I'm still awake. ID: 111585 · Reply Quote

robertmiles Send message Joined: 16 Nov 11 Posts: 19 Credit: 4,468,368 RAC: 0	Message 111586 - Posted: 16 Dec 2011, 0:54:09 UTC Last modified: 16 Dec 2011, 0:55:32 UTC Have you thought of starting with a certain number of dummy tasks, to be replaced with similar information from tasks actually completed as soon as there are enough of them? Some BOINC projects limit the number of tasks any computer can have downloaded and in progress at first, with this limit relaxed as soon as there are enough tasks successfully completed by that computer to get a better idea of how often it can handle yet another workunit. ID: 111586 · Reply Quote

Bernd Machenschalk Volunteer moderator Project administrator Project developer Send message Joined: 15 Oct 04 Posts: 1956 Credit: 6,218,130 RAC: 0	Message 111592 - Posted: 16 Dec 2011, 13:21:13 UTC - in response to Message 111586. Some BOINC projects limit the number of tasks any computer can have downloaded and in progress at first, with this limit relaxed as soon as there are enough tasks successfully completed by that computer to get a better idea of how often it can handle yet another workunit. That's certainly an option to limit the effect of the runtime estimation / work fetch going mad. But actually I'd like to understand and fix what's going wrong in the first place. For now i raised the FLOPS estimation and thus the FLOPS limit by a factor of 10 for newly generated workunits. It will take some time (usually about 1.5d) until the first tasks from that will be sent out, though. BM ID: 111592 · Reply Quote

robertmiles Send message Joined: 16 Nov 11 Posts: 19 Credit: 4,468,368 RAC: 0	Message 111594 - Posted: 16 Dec 2011, 17:30:54 UTC I've read that at least some of the BOINC versions never initialize one of the variables often used in runtime estimation. You may want to add reporting of the variables you use so you can check for signs of this. ID: 111594 · Reply Quote

pragmatic prancing periodic problem child, left Send message Joined: 26 Jan 05 Posts: 1639 Credit: 70,000 RAC: 0	Message 111601 - Posted: 17 Dec 2011, 1:23:32 UTC - in response to Message 111592. That's certainly an option to limit the effect of the runtime estimation / work fetch going mad. But actually I'd like to understand and fix what's going wrong in the first place. There is something weird going on with the amount of tasks one has per day. As you can see from my double zero credit & RAC, I haven't had one task validate yet. So by now, the amount of tasks I should be able to download for the v1.19 app should be 1, maybe 2. Yesterday it was 26, now it is 32. Why is it going up? I am not returning any valid work. Shouldn't it, like in the old days, continue to go down and eventually only give me 1 task per device (CPU core or GPU) per day? As with this, I can continue ad infinitum doing 'bad work'. Jord. BOINC FAQ Service They say most of your brain shuts down in cryo-sleep. All but the primitive side, the animal side. No wonder I'm still awake. ID: 111601 · Reply Quote

Bernd Machenschalk Volunteer moderator Project administrator Project developer Send message Joined: 15 Oct 04 Posts: 1956 Credit: 6,218,130 RAC: 0	Message 111609 - Posted: 20 Dec 2011, 10:26:47 UTC Last modified: 20 Dec 2011, 10:29:50 UTC I incorporated D.A.s recent fix for using "conservative flops estimate" in case "we don't have enough statistics" (i.e. too few valid results) into the scheduler running on Albert. Let's see whether this helps ... BM PS: Besides I added some logging that should write the Client's max runtime for every job sent to the scheduler log. You may spot it in the logs for your hosts. ID: 111609 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 111622 - Posted: 22 Dec 2011, 22:46:41 UTC - in response to Message 111609. Last modified: 25 Dec 2011, 21:13:49 UTC Hi! I just got this: 2011-12-22 22:39:37.5065 [PID=14669] [version] Checking plan class 'atiOpenCL' 2011-12-22 22:39:37.5065 [PID=14669] [version] host_flops: 2.972295e+09, speedup: 15.00, projected_flops: 4.458442e+10, peak_flops: 4.176000e+12, peak_flops_factor: 1.00 Still, the estimated CPU time as displayed by boinccmd for such a task is below 50 seconds ... :-( It will actaully take almost 100 times longer. HB ID: 111622 · Reply Quote

Bikeman (Heinz-Bernd Eggenstein) Volunteer moderator Project administrator Project developer Send message Joined: 28 Aug 06 Posts: 1483 Credit: 1,864,017 RAC: 0	Message 111625 - Posted: 25 Dec 2011, 21:16:15 UTC - in response to Message 111622. I guess I got a few from the old batch. Now everything is fine, the runtime estimate is reasonably pessimistic now and tasks validate ok. HB ID: 111625 · Reply Quote

Oliver Behnke Volunteer moderator Project administrator Project developer Send message Joined: 4 Sep 07 Posts: 130 Credit: 8,545,955 RAC: 0	Message 111646 - Posted: 4 Jan 2012, 11:58:47 UTC - in response to Message 111573. no hint from Oliver either. Where is he by the way, seems like he evaporated. ;-) Sort of, holiday season... :-) Happy new year! ID: 111646 · Reply Quote

Oliver Behnke Volunteer moderator Project administrator Project developer Send message Joined: 4 Sep 07 Posts: 130 Credit: 8,545,955 RAC: 0	Message 111647 - Posted: 4 Jan 2012, 12:02:15 UTC - in response to Message 111575. It would be instructive to see whether the debugging output in question is common to all 4xxx series cards. Ah..but that's a different subject. They will. The 4xxx series doesn't support local memory, it's emulated via global memory which incurs a big impact on performance. Also, this series only allows for 64 work items per work group when local memory is used, hence the resizing. However, I doubt that the resizing actually affects the accuracy of the computation, but if it does, it needs to be fixed! Oliver ID: 111647 · Reply Quote

oz Send message Joined: 28 Feb 05 Posts: 10 Credit: 1,285,478 RAC: 0	Message 111660 - Posted: 6 Jan 2012, 20:38:39 UTC Hi, I also have aborted task due to exceeded elapsed time limit 19036.53 (28000000.00G/1470.86G) problem . The GPU is in bad state with reboot required. All other downloaded openCL tasks are started by BOINC and immediately aborted with: Output file p2030.20100913.G44.55+00.20.N.b6s0g0.00000_2424_1_3 for task p2030.20100913.G44.55+00.20.N.b6s0g0.00000_2424_1 absent This is finished after reaching the daily quota of task I successfully finished atiopenCL tasks with 50000s runtime. System: Linux Ubuntu Oneiric OpenCL: ATI GPU 0: Juniper (driver version CAL 1.4.1646, device version OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213), 1024MB) Catalyst 11.11 ID: 111660 · Reply Quote

Oliver Behnke Volunteer moderator Project administrator Project developer Send message Joined: 4 Sep 07 Posts: 130 Credit: 8,545,955 RAC: 0	Message 111776 - Posted: 31 Jan 2012, 11:21:05 UTC - in response to Message 111647. They will. The 4xxx series doesn't support local memory, it's emulated via global memory which incurs a big impact on performance. Also, this series only allows for 64 work items per work group when local memory is used, hence the resizing. However, I doubt that the resizing actually affects the accuracy of the computation, but if it does, it needs to be fixed! Well, it turned out it does indeed! We'll fix it ASAP. Oliver ID: 111776 · Reply Quote

Oliver Behnke Volunteer moderator Project administrator Project developer Send message Joined: 4 Sep 07 Posts: 130 Credit: 8,545,955 RAC: 0	Message 111779 - Posted: 1 Feb 2012, 12:19:53 UTC - in response to Message 111776. Ok, bug fix implemented and tested. We'll release v1.20 shortly... Oliver ID: 111779 · Reply Quote