Posts by Claggy

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

1) Message boards : Problems and Bug Reports : Intel GPU tasks error out (Message 113281) Posted 28 Jul 2014 by Claggy Post: Edit: Never mind. Half an hour later, it got some work. Odd. But no longer an issue. Maybe not a one-off quirk after all. My TITAN cannot get any work for half a day now. It asks, and the server says it gets nothing. No explanation why though. If you look at the task list you'll see the reason why, all it's tasks were aborted via the GUI yesterday, now it's got a small allowance for a few of the app versions, it has managed to pick up 5 tasks today. Claggy
2) Message boards : News : Project server code update (Message 113276) Posted 26 Jul 2014 by Claggy Post: My Ubuntu C2D T8100 Laptop has been crunching both Astrouplse_v7 and Gamma-ray pulsar search #3 v1.12 and (FGRPSSE) tasks at the same time, the Astropulse tasks from the four app_versions initially were each estimated at sometime like one hundred and fifty hours, once their 100 validations were in, their estimates dropped to a value below reality, All tasks for computer 68093 Application details for host 68093 With Gamma-ray pulsar search #3 v1.12 and (FGRPSSE) the same has happened, the task durations are also under estimated, meaning Boinc over fetches, and can't complete the tasks in time, (I think it under estimated from the start through), now it's validations have passed 11, Boinc has a better gasp on how long these tasks take and hasn't fetched so many, and is slowly catching up again, It's cache setting is set to about one day to one and a half days (It's remote from me at the moment) All tasks for computer 10230 Application details for host 10230 Shouldn't the post 100 validation overall and pre 11 host app validation still be a bit conservative, and not cause over fetch? Claggy
3) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113256) Posted 11 Jul 2014 by Claggy Post: That Wu is completed, and is showing as 'Completed, waiting for validation', but the 'In progress' wingman is an OpenCL 1.1 Intel GPU, so it's either going to be inconclusive or Validate error. A lot of my other Wu's were round 1 inconclusives with OpenCL 1.1 Intel GPUs, so they should validate straight away. Claggy
4) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113255) Posted 11 Jul 2014 by Claggy Post: I hope the validate error problem has been fixed, I'm just about to start my first, all my wingmen that have completed this Wu already have validate error: https://albert.phys.uwm.edu/workunit.php?wuid=603947 Otherwise this is going to be Another real long journey. Claggy
5) Message boards : Problems and Bug Reports : Getting Validate errors on Gamma-ray pulsar search #3 tasks (Message 113254) Posted 11 Jul 2014 by Claggy Post: Has anyone done anything to fix this? Claggy
6) Message boards : Problems and Bug Reports : Getting Validate errors on Gamma-ray pulsar search #3 tasks (Message 113248) Posted 10 Jul 2014 by Claggy Post: Over the last couple of days my three main hosts here have been picking up validate errors, my wingmen that complete the tasks get validate errors too: All Gamma-ray pulsar search #3 tasks for computer 8143 All Gamma-ray pulsar search #3 tasks for computer 9008 All Gamma-ray pulsar search #3 tasks for computer 10230 Claggy
7) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113244) Posted 6 Jul 2014 by Claggy Post: Exactly. Specifically during what we are calling "stage 2 of the onramp", between 100 global validations for the project as a whole, and 11 local validations for the individual host - the phase during which flops determined by "PFC avg" can be seen in the server logs. And to get to those 100 global validations, and 11 local validations, tasks need to validate, having masses of hosts throwing inconclusives into the works is slowing down the process of recovering from the -197 errors, at least for the Gamma-ray pulsar search #3 Claggy
8) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113238) Posted 6 Jul 2014 by Claggy Post: Got my first invalid, where my task was matched against two OpenCL 1.1 running intel GPUs: Workunit 603716 Claggy
9) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113231) Posted 5 Jul 2014 by Claggy Post: And to put the icing on the cake, all three of my returned results have been paired with different anonymous Intel HD 2500 GPUs running with the dodgy OpenCL 1.1 driver that Claggy noticed. Inconclusive, the lot of them. It's going to take a while to get the server averages back into kilter... I've got something like 26 inconclusives spread across all these intel GPU hosts, all of them running openCL 1.1 drivers, most of them are anonymous with an i3-3220 and a HD Graphics 2500 and Boinc 7.0.64: https://albert.phys.uwm.edu/show_host_detail.php?hostid=4792 https://albert.phys.uwm.edu/show_host_detail.php?hostid=5414 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9043 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9046 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9048 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9041 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9045 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9048 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9089 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9090 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9091 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9094 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9095 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9099 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9101 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9106 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9114 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9115 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9119 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9122 https://albert.phys.uwm.edu/show_host_detail.php?hostid=9129 https://albert.phys.uwm.edu/show_host_detail.php?hostid=10714 Can we have an OpenCL 1.2 requirement put into FGRPopencl-intel_gpu app please. Claggy
10) Message boards : News : Project server code update (Message 113219) Posted 2 Jul 2014 by Claggy Post: Yea, I've got something similar, 13.01 cr for 150 minutes of HD7770 work: https://albert.phys.uwm.edu/workunit.php?wuid=619367 Claggy
11) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113205) Posted 1 Jul 2014 by Claggy Post: My HD7770's estimates have just got to the point where one of the apps for Binary Radio Pulsar Search (Perseus Arm Survey) now completes without error, the other app version is still erroring at 422 seconds. All Binary Radio Pulsar Search (Perseus Arm Survey) tasks for computer 8143 Claggy
12) Message boards : News : Project server code update (Message 113203) Posted 1 Jul 2014 by Claggy Post: I wonder why Claggy's laptop gets such variable credit? Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there. I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count). Claggy
13) Message boards : News : Project server code update (Message 113178) Posted 29 Jun 2014 by Claggy Post: I can't quickly find the client GFLOPS peak number for Claggy's ATI 'Capeverde' with "based on PFC avg: 34968.78G". I'd like to look for the variable (presumably a struct member) where we might expect GFLOPS peak to be stored, and see what it's multiplied by in those initial stages before 11 completions establish an APR. We might expect 0.1 from the words, but we seem to be using >10 by the numbers. 17/06/2014 18:17:17 \| \| CAL: ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (CAL version 1.4.1848, 1024MB, 984MB available, 3584 GFLOPS peak) 17/06/2014 18:17:17 \| \| OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (driver version 1348.5 (VM), device version OpenCL 1.2 AMD-APP (1348.5), 1024MB, 984MB available, 3584 GFLOPS peak) 17/06/2014 18:17:17 \| \| OpenCL CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1348.5 (sse2,avx), device version OpenCL 1.2 AMD-APP (1348.5)) Claggy
14) Message boards : News : Project server code update (Message 113170) Posted 29 Jun 2014 by Claggy Post: Ah allright, Yeah only interested in fixing current code, rather than diagnosing/patching old versions :) I was thinking that they were using Einstein customisations here that might not be needed, looking at robl's Einstein log shows it's the durations that get scaled there: http://einstein.phys.uwm.edu/hosts_user.php?userid=613597 2014-06-29 09:28:50.6296 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS 2014-06-29 09:28:50.6312 [PID=17986] [send] est. duration for WU 193304662: unscaled 9004.88 scaled 18527.18 2014-06-29 09:28:50.6312 [PID=17986] [HOST#7536795] Sending [RESULT#443159459 PB0024_00191_182_0] (est. dur. 18527.18 seconds) 2014-06-29 09:28:50.6324 [PID=17986] [send] est. duration for WU 193307638: unscaled 9004.88 scaled 18527.18 2014-06-29 09:28:50.6324 [PID=17986] [send] [WU#193307638] meets deadline: 18527.18 + 18527.18 < 1209600 2014-06-29 09:28:50.6332 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS 2014-06-29 09:28:50.6347 [PID=17986] [send] est. duration for WU 193307638: unscaled 9004.88 scaled 18527.18 2014-06-29 09:28:50.6347 [PID=17986] [HOST#7536795] Sending [RESULT#443165551 PB0024_00141_24_0] (est. dur. 18527.18 seconds) 2014-06-29 09:28:50.6356 [PID=17986] [send] est. duration for WU 193249827: unscaled 9004.88 scaled 18527.18 2014-06-29 09:28:50.6356 [PID=17986] [send] [WU#193249827] meets deadline: 37054.37 + 18527.18 < 1209600 2014-06-29 09:28:50.6364 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS 2014-06-29 09:28:50.6380 [PID=17986] [send] est. duration for WU 193249827: unscaled 9004.88 scaled 18527.18 2014-06-29 09:28:50.6381 [PID=17986] [HOST#7536795] Sending [RESULT#443038987 PB0023_01561_144_0] (est. dur. 18527.18 seconds) Claggy
15) Message boards : News : Project server code update (Message 113168) Posted 29 Jun 2014 by Claggy Post: I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way. Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used? Claggy There is no adjustment, the adjustment is a lie. <dont_use_dcf> is hard wired active for all clients >= 7.0.28. But only on projects that don't use dcf, Einstein on my i7-2600K/HD7770 has a dcf of: <duration_correction_factor>1.267963</duration_correction_factor> Claggy Well you've lost me there, because every scheduler reply to a >= 7.0.28 client, accirding to the scheduler code, pushes <dont_use_dcf/> , [and there is no configuration switch for it ] Einstein has an older scheduler than Albert (or at least server version): 29/06/2014 11:45:58 \| Einstein@Home \| sched RPC pending: Requested by user 29/06/2014 11:45:58 \| Einstein@Home \| [sched_op] Starting scheduler request 29/06/2014 11:45:58 \| Einstein@Home \| Sending scheduler request: Requested by user. 29/06/2014 11:45:58 \| Einstein@Home \| Not requesting tasks: "no new tasks" requested via Manager 29/06/2014 11:45:58 \| Einstein@Home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 29/06/2014 11:45:58 \| Einstein@Home \| [sched_op] ATI work request: 0.00 seconds; 0.00 devices 29/06/2014 11:46:00 \| Einstein@Home \| Scheduler request completed 29/06/2014 11:46:00 \| Einstein@Home \| [sched_op] Server version 611 29/06/2014 11:46:00 \| Einstein@Home \| Project requested delay of 60 seconds 29/06/2014 11:46:00 \| Einstein@Home \| [sched_op] Deferring communication for 00:01:00 29/06/2014 11:46:00 \| Einstein@Home \| [sched_op] Reason: requested by project 29/06/2014 11:46:05 \| Albert@Home \| sched RPC pending: Requested by user 29/06/2014 11:46:05 \| Albert@Home \| [sched_op] Starting scheduler request 29/06/2014 11:46:05 \| Albert@Home \| Sending scheduler request: Requested by user. 29/06/2014 11:46:05 \| Albert@Home \| Reporting 2 completed tasks 29/06/2014 11:46:05 \| Albert@Home \| Not requesting tasks: don't need 29/06/2014 11:46:05 \| Albert@Home \| [sched_op] CPU work request: 0.00 seconds; 0.00 devices 29/06/2014 11:46:05 \| Albert@Home \| [sched_op] ATI work request: 0.00 seconds; 0.00 devices 29/06/2014 11:46:08 \| Albert@Home \| Scheduler request completed 29/06/2014 11:46:08 \| Albert@Home \| [sched_op] Server version 703 29/06/2014 11:46:08 \| Albert@Home \| Project requested delay of 60 seconds 29/06/2014 11:46:08 \| Albert@Home \| [sched_op] handle_scheduler_reply(): got ack for task h1_0997.10_S6Direct__S6CasAf40_997.55Hz_1017_1 29/06/2014 11:46:08 \| Albert@Home \| [sched_op] handle_scheduler_reply(): got ack for task p2030.20130202.G202.32-01.96.N.b2s0g0.00000_2384_5 29/06/2014 11:46:08 \| Albert@Home \| [sched_op] Deferring communication for 00:01:00 29/06/2014 11:46:08 \| Albert@Home \| [sched_op] Reason: requested by project Claggy
16) Message boards : News : Project server code update (Message 113166) Posted 29 Jun 2014 by Claggy Post: I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way. Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used? Claggy There is no adjustment, the adjustment is a lie. <dont_use_dcf> is hard wired active for all clients >= 7.0.28. But only on projects that don't use dcf, Einstein on my i7-2600K/HD7770 has a dcf of: <duration_correction_factor>1.267963</duration_correction_factor> Albert has of cause: <dont_use_dcf/> Claggy
17) Message boards : News : Project server code update (Message 113163) Posted 29 Jun 2014 by Claggy Post: I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way. Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used? Claggy
18) Message boards : News : Project server code update (Message 113141) Posted 26 Jun 2014 by Claggy Post: A lot of my Gamma-ray pulsar search #3 v1.11 results are coming out as inconclusive, in each case they are matched with an intel GPU, and in each case that intel GPU is running OpenCL 1.1 drivers, shouldn't that app be restricted to Intel GPUs with OpenCL 1.2 drivers? Validation inconclusive Gamma-ray pulsar search #3 tasks for computer 8143 Claggy
19) Message boards : News : Project server code update (Message 113139) Posted 25 Jun 2014 by Claggy Post: I think that's further evidence of the kind of instability we need to cure. Yes, local estimates need to be responsive to running conditions. It's unfortunate that the existing mechanism for that was disabled instead of completed/fixed. Seti Beta deployed the Blunkit based Optimised AP v7 yesterday, there the estimates are the other way round, my Ubuntu 12.04 C2D T8100 took ~12 hours on it's first Wu, shame the estimates start at ~228 hours. Claggy LoL, do the results mix with traditional non-blankit versions ? That's going to mess with cross app normalisation bigtime. AP v6 should only mix with AP v6, and AP v7 should only mix with AP v7. Even better, the SSE2 app is Optimised, the SSE app in non-Optimised, the difference in runtimes is going to be huge, I see carnage ahead. Claggy
20) Message boards : News : Project server code update (Message 113137) Posted 25 Jun 2014 by Claggy Post: I think that's further evidence of the kind of instability we need to cure. Yes, local estimates need to be responsive to running conditions. It's unfortunate that the existing mechanism for that was disabled instead of completed/fixed. Seti Beta deployed the Blunkit based Optimised AP v7 yesterday, there the estimates are the other way round, my Ubuntu 12.04 C2D T8100 took ~12 hours on it's first Wu, shame the estimates start at ~228 hours. Claggy

Next 20

This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration