WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Posts by Claggy

1) Message boards : Problems and Bug Reports : Intel GPU tasks error out (Message 113281)
Posted 28 Jul 2014 by Claggy
Post:
Edit: Never mind. Half an hour later, it got some work. Odd. But no longer an issue.


Maybe not a one-off quirk after all. My TITAN cannot get any work for half a day now. It asks, and the server says it gets nothing. No explanation why though.

If you look at the task list you'll see the reason why, all it's tasks were aborted via the GUI yesterday, now it's got a small allowance for a few of the app versions,
it has managed to pick up 5 tasks today.

Claggy
2) Message boards : News : Project server code update (Message 113276)
Posted 26 Jul 2014 by Claggy
Post:
My Ubuntu C2D T8100 Laptop has been crunching both Astrouplse_v7 and Gamma-ray pulsar search #3 v1.12 and (FGRPSSE) tasks at the same time,
the Astropulse tasks from the four app_versions initially were each estimated at sometime like one hundred and fifty hours, once their 100 validations were in, their estimates dropped to a value below reality,

All tasks for computer 68093

Application details for host 68093

With Gamma-ray pulsar search #3 v1.12 and (FGRPSSE) the same has happened, the task durations are also under estimated, meaning Boinc over fetches, and can't complete the tasks in time, (I think it under estimated from the start through),
now it's validations have passed 11, Boinc has a better gasp on how long these tasks take and hasn't fetched so many, and is slowly catching up again,
It's cache setting is set to about one day to one and a half days (It's remote from me at the moment)

All tasks for computer 10230

Application details for host 10230

Shouldn't the post 100 validation overall and pre 11 host app validation still be a bit conservative, and not cause over fetch?

Claggy
3) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113256)
Posted 11 Jul 2014 by Claggy
Post:
That Wu is completed, and is showing as 'Completed, waiting for validation', but the 'In progress' wingman is an OpenCL 1.1 Intel GPU, so it's either going to be inconclusive or Validate error.

A lot of my other Wu's were round 1 inconclusives with OpenCL 1.1 Intel GPUs, so they should validate straight away.

Claggy
4) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113255)
Posted 11 Jul 2014 by Claggy
Post:
I hope the validate error problem has been fixed, I'm just about to start my first, all my wingmen that have completed this Wu already have validate error:

https://albert.phys.uwm.edu/workunit.php?wuid=603947

Otherwise this is going to be Another real long journey.

Claggy
5) Message boards : Problems and Bug Reports : Getting Validate errors on Gamma-ray pulsar search #3 tasks (Message 113254)
Posted 11 Jul 2014 by Claggy
Post:
Has anyone done anything to fix this?

Claggy
6) Message boards : Problems and Bug Reports : Getting Validate errors on Gamma-ray pulsar search #3 tasks (Message 113248)
Posted 10 Jul 2014 by Claggy
Post:
Over the last couple of days my three main hosts here have been picking up validate errors, my wingmen that complete the tasks get validate errors too:

All Gamma-ray pulsar search #3 tasks for computer 8143

All Gamma-ray pulsar search #3 tasks for computer 9008

All Gamma-ray pulsar search #3 tasks for computer 10230

Claggy
7) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113244)
Posted 6 Jul 2014 by Claggy
Post:
Exactly. Specifically during what we are calling "stage 2 of the onramp", between 100 global validations for the project as a whole, and 11 local validations for the individual host - the phase during which flops determined by "PFC avg" can be seen in the server logs.

And to get to those 100 global validations, and 11 local validations, tasks need to validate, having masses of hosts throwing inconclusives into the works is slowing down the process of recovering from the -197 errors, at least for the Gamma-ray pulsar search #3

Claggy
8) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113238)
Posted 6 Jul 2014 by Claggy
Post:
Got my first invalid, where my task was matched against two OpenCL 1.1 running intel GPUs:

Workunit 603716

Claggy
9) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113231)
Posted 5 Jul 2014 by Claggy
Post:
And to put the icing on the cake, all three of my returned results have been paired with different anonymous Intel HD 2500 GPUs running with the dodgy OpenCL 1.1 driver that Claggy noticed. Inconclusive, the lot of them. It's going to take a while to get the server averages back into kilter...

I've got something like 26 inconclusives spread across all these intel GPU hosts, all of them running openCL 1.1 drivers, most of them are anonymous with an i3-3220 and a HD Graphics 2500 and Boinc 7.0.64:

https://albert.phys.uwm.edu/show_host_detail.php?hostid=4792
https://albert.phys.uwm.edu/show_host_detail.php?hostid=5414
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9043
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9046
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9048
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9041
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9045
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9048
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9089
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9090
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9091
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9094
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9095
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9099
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9101
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9106
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9114
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9115
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9119
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9122
https://albert.phys.uwm.edu/show_host_detail.php?hostid=9129
https://albert.phys.uwm.edu/show_host_detail.php?hostid=10714

Can we have an OpenCL 1.2 requirement put into FGRPopencl-intel_gpu app please.

Claggy
10) Message boards : News : Project server code update (Message 113219)
Posted 2 Jul 2014 by Claggy
Post:
Yea, I've got something similar, 13.01 cr for 150 minutes of HD7770 work:

https://albert.phys.uwm.edu/workunit.php?wuid=619367

Claggy
11) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113205)
Posted 1 Jul 2014 by Claggy
Post:
My HD7770's estimates have just got to the point where one of the apps for Binary Radio Pulsar Search (Perseus Arm Survey) now completes without error,
the other app version is still erroring at 422 seconds.

All Binary Radio Pulsar Search (Perseus Arm Survey) tasks for computer 8143

Claggy
12) Message boards : News : Project server code update (Message 113203)
Posted 1 Jul 2014 by Claggy
Post:
I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.

I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).

Claggy
13) Message boards : News : Project server code update (Message 113178)
Posted 29 Jun 2014 by Claggy
Post:
I can't quickly find the client GFLOPS peak number for Claggy's ATI 'Capeverde' with "based on PFC avg: 34968.78G". I'd like to look for the variable (presumably a struct member) where we might expect GFLOPS peak to be stored, and see what it's multiplied by in those initial stages before 11 completions establish an APR. We might expect 0.1 from the words, but we seem to be using >10 by the numbers.

17/06/2014 18:17:17 | | CAL: ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (CAL version 1.4.1848, 1024MB, 984MB available, 3584 GFLOPS peak)
17/06/2014 18:17:17 | | OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (driver version 1348.5 (VM), device version OpenCL 1.2 AMD-APP (1348.5), 1024MB, 984MB available, 3584 GFLOPS peak)
17/06/2014 18:17:17 | | OpenCL CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1348.5 (sse2,avx), device version OpenCL 1.2 AMD-APP (1348.5))

Claggy
14) Message boards : News : Project server code update (Message 113170)
Posted 29 Jun 2014 by Claggy
Post:
Ah allright,
Yeah only interested in fixing current code, rather than diagnosing/patching old versions :)

I was thinking that they were using Einstein customisations here that might not be needed, looking at robl's Einstein log shows it's the durations that get scaled there:

http://einstein.phys.uwm.edu/hosts_user.php?userid=613597

2014-06-29 09:28:50.6296 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS
2014-06-29 09:28:50.6312 [PID=17986] [send] est. duration for WU 193304662: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6312 [PID=17986] [HOST#7536795] Sending [RESULT#443159459 PB0024_00191_182_0] (est. dur. 18527.18 seconds)
2014-06-29 09:28:50.6324 [PID=17986] [send] est. duration for WU 193307638: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6324 [PID=17986] [send] [WU#193307638] meets deadline: 18527.18 + 18527.18 < 1209600
2014-06-29 09:28:50.6332 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS
2014-06-29 09:28:50.6347 [PID=17986] [send] est. duration for WU 193307638: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6347 [PID=17986] [HOST#7536795] Sending [RESULT#443165551 PB0024_00141_24_0] (est. dur. 18527.18 seconds)
2014-06-29 09:28:50.6356 [PID=17986] [send] est. duration for WU 193249827: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6356 [PID=17986] [send] [WU#193249827] meets deadline: 37054.37 + 18527.18 < 1209600
2014-06-29 09:28:50.6364 [PID=17986] [send] [HOST#7536795] Sending app_version 483 einsteinbinary_BRP5 7 139 BRP5-cuda32-nv270; 49.97 GFLOPS
2014-06-29 09:28:50.6380 [PID=17986] [send] est. duration for WU 193249827: unscaled 9004.88 scaled 18527.18
2014-06-29 09:28:50.6381 [PID=17986] [HOST#7536795] Sending [RESULT#443038987 PB0023_01561_144_0] (est. dur. 18527.18 seconds)

Claggy
15) Message boards : News : Project server code update (Message 113168)
Posted 29 Jun 2014 by Claggy
Post:
I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.

Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?

Claggy


There is no adjustment, the adjustment is a lie. <dont_use_dcf> is hard wired active for all clients >= 7.0.28.

But only on projects that don't use dcf, Einstein on my i7-2600K/HD7770 has a dcf of:

<duration_correction_factor>1.267963</duration_correction_factor>

Claggy


Well you've lost me there, because every scheduler reply to a >= 7.0.28 client, accirding to the scheduler code, pushes <dont_use_dcf/> , [and there is no configuration switch for it ]

Einstein has an older scheduler than Albert (or at least server version):

29/06/2014 11:45:58 | Einstein@Home | sched RPC pending: Requested by user
29/06/2014 11:45:58 | Einstein@Home | [sched_op] Starting scheduler request
29/06/2014 11:45:58 | Einstein@Home | Sending scheduler request: Requested by user.
29/06/2014 11:45:58 | Einstein@Home | Not requesting tasks: "no new tasks" requested via Manager
29/06/2014 11:45:58 | Einstein@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
29/06/2014 11:45:58 | Einstein@Home | [sched_op] ATI work request: 0.00 seconds; 0.00 devices
29/06/2014 11:46:00 | Einstein@Home | Scheduler request completed
29/06/2014 11:46:00 | Einstein@Home | [sched_op] Server version 611
29/06/2014 11:46:00 | Einstein@Home | Project requested delay of 60 seconds
29/06/2014 11:46:00 | Einstein@Home | [sched_op] Deferring communication for 00:01:00
29/06/2014 11:46:00 | Einstein@Home | [sched_op] Reason: requested by project
29/06/2014 11:46:05 | Albert@Home | sched RPC pending: Requested by user
29/06/2014 11:46:05 | Albert@Home | [sched_op] Starting scheduler request
29/06/2014 11:46:05 | Albert@Home | Sending scheduler request: Requested by user.
29/06/2014 11:46:05 | Albert@Home | Reporting 2 completed tasks
29/06/2014 11:46:05 | Albert@Home | Not requesting tasks: don't need
29/06/2014 11:46:05 | Albert@Home | [sched_op] CPU work request: 0.00 seconds; 0.00 devices
29/06/2014 11:46:05 | Albert@Home | [sched_op] ATI work request: 0.00 seconds; 0.00 devices
29/06/2014 11:46:08 | Albert@Home | Scheduler request completed
29/06/2014 11:46:08 | Albert@Home | [sched_op] Server version 703
29/06/2014 11:46:08 | Albert@Home | Project requested delay of 60 seconds
29/06/2014 11:46:08 | Albert@Home | [sched_op] handle_scheduler_reply(): got ack for task h1_0997.10_S6Direct__S6CasAf40_997.55Hz_1017_1
29/06/2014 11:46:08 | Albert@Home | [sched_op] handle_scheduler_reply(): got ack for task p2030.20130202.G202.32-01.96.N.b2s0g0.00000_2384_5
29/06/2014 11:46:08 | Albert@Home | [sched_op] Deferring communication for 00:01:00
29/06/2014 11:46:08 | Albert@Home | [sched_op] Reason: requested by project


Claggy
16) Message boards : News : Project server code update (Message 113166)
Posted 29 Jun 2014 by Claggy
Post:
I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.

Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?

Claggy


There is no adjustment, the adjustment is a lie. <dont_use_dcf> is hard wired active for all clients >= 7.0.28.

But only on projects that don't use dcf, Einstein on my i7-2600K/HD7770 has a dcf of:

<duration_correction_factor>1.267963</duration_correction_factor>

Albert has of cause: <dont_use_dcf/>

Claggy
17) Message boards : News : Project server code update (Message 113163)
Posted 29 Jun 2014 by Claggy
Post:
I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.

Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?

Claggy
18) Message boards : News : Project server code update (Message 113141)
Posted 26 Jun 2014 by Claggy
Post:
A lot of my Gamma-ray pulsar search #3 v1.11 results are coming out as inconclusive, in each case they are matched with an intel GPU, and in each case that intel GPU is running OpenCL 1.1 drivers, shouldn't that app be restricted to Intel GPUs with OpenCL 1.2 drivers?

Validation inconclusive Gamma-ray pulsar search #3 tasks for computer 8143

Claggy
19) Message boards : News : Project server code update (Message 113139)
Posted 25 Jun 2014 by Claggy
Post:

I think that's further evidence of the kind of instability we need to cure.


Yes, local estimates need to be responsive to running conditions. It's unfortunate that the existing mechanism for that was disabled instead of completed/fixed.

Seti Beta deployed the Blunkit based Optimised AP v7 yesterday, there the estimates are the other way round,
my Ubuntu 12.04 C2D T8100 took ~12 hours on it's first Wu, shame the estimates start at ~228 hours.

Claggy


LoL, do the results mix with traditional non-blankit versions ? That's going to mess with cross app normalisation bigtime.

AP v6 should only mix with AP v6, and AP v7 should only mix with AP v7.

Even better, the SSE2 app is Optimised, the SSE app in non-Optimised, the difference in runtimes is going to be huge, I see carnage ahead.

Claggy
20) Message boards : News : Project server code update (Message 113137)
Posted 25 Jun 2014 by Claggy
Post:

I think that's further evidence of the kind of instability we need to cure.


Yes, local estimates need to be responsive to running conditions. It's unfortunate that the existing mechanism for that was disabled instead of completed/fixed.

Seti Beta deployed the Blunkit based Optimised AP v7 yesterday, there the estimates are the other way round,
my Ubuntu 12.04 C2D T8100 took ~12 hours on it's first Wu, shame the estimates start at ~228 hours.

Claggy


Next 20



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration