Project server code update

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Author	Message
jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113030 - Posted: 17 Jun 2014, 17:33:20 UTC - in response to Message 113028. I've seen that annotation before, somewhere. rr_sim I think - can you look at a sample please, to check local boinc log against server values? yes, were were there the other day digging out where whetstone was hiding. sched_version.cpp, estimate_flops() functions. That one for non- anon, and another slightly different for anon. For non-anon, Before statistics are gathered it's Boinc Whetstone for CPU (incidentally SIMD aware oin Android but not x86), and some mystery guesstimate for GPUs Those mystery guesstimates for GPUs are one of the major quarries for our quest. Claggy's ATI is running at 2.95 Teraflops, to put it in simpler numbers. Yep. Also be aware in that area, just to complicate matters, that there is a scheduler config option David's thrown in, enabling a random multiplier across the project_flops for each app_version, so that app versions get juggled at least before stats are gathered. I'm getting the distinct impression he's 'lost' the old 0.1 GPU flops scaling there (haven't come across it yet anyway, still looking), meaning that'll probably be using the raw client supplied marketing flops value, possibly by some random number... On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113030 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113031 - Posted: 17 Jun 2014, 17:34:11 UTC - in response to Message 113029. Unfortunately I missed the server log for a fetch - just got a 'report only' RPC instead. Could you grab a log if it does another work_fetch, please? I did another request, and suspended network: https://albert.phys.uwm.edu/host_sched_logs/8/8143 Claggy [version] [AV#911] (FGRPopencl-ati) adjusting projected flops based on PFC avg: 2950.33G ID: 113031 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113032 - Posted: 17 Jun 2014, 17:37:24 UTC - in response to Message 113031. Last modified: 17 Jun 2014, 17:40:19 UTC Unfortunately I missed the server log for a fetch - just got a 'report only' RPC instead. Could you grab a log if it does another work_fetch, please? I did another request, and suspended network: https://albert.phys.uwm.edu/host_sched_logs/8/8143 Claggy [version] [AV#911] (FGRPopencl-ati) adjusting projected flops based on PFC avg: 2950.33G ~~That's not TeraFlops (speed), That's peak flop count, as in # of operations.~~ (verifying in code now) scratch that looks broken, walking the lot with beer On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113032 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113033 - Posted: 17 Jun 2014, 17:41:22 UTC - in response to Message 113032. Unfortunately I missed the server log for a fetch - just got a 'report only' RPC instead. Could you grab a log if it does another work_fetch, please? I did another request, and suspended network: https://albert.phys.uwm.edu/host_sched_logs/8/8143 Claggy [version] [AV#911] (FGRPopencl-ati) adjusting projected flops based on PFC avg: 2950.33G ~~That's not TeraFlops (speed), That's peak flop count, as in # of operations.~~ (verifying in code now) scratch that looks broken, walking the lot with beer The server is using it as a speed for estimation purposes. Maybe that's our problem. ID: 113033 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113034 - Posted: 17 Jun 2014, 17:42:19 UTC - in response to Message 113032. scratch that looks broken, walking the lot with beer peanut gallery: that's like saying that water is wet after falling in andd getting soaked... Enjoy the beer. Valium might be the better choice. Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113034 · Reply Quote

Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113035 - Posted: 17 Jun 2014, 17:47:40 UTC - in response to Message 113032. Last modified: 17 Jun 2014, 18:35:27 UTC Unfortunately I missed the server log for a fetch - just got a 'report only' RPC instead. Could you grab a log if it does another work_fetch, please? I did another request, and suspended network: https://albert.phys.uwm.edu/host_sched_logs/8/8143 Claggy [version] [AV#911] (FGRPopencl-ati) adjusting projected flops based on PFC avg: 2950.33G ~~That's not TeraFlops (speed), That's peak flop count, as in # of operations.~~ (verifying in code now) scratch that looks broken, walking the lot with beer Boinc startup says: 17/06/2014 18:17:17 \| \| CAL: ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (CAL version 1.4.1848, 1024MB, 984MB available, 3584 GFLOPS peak) 17/06/2014 18:17:17 \| \| OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (driver version 1348.5 (VM), device version OpenCL 1.2 AMD-APP (1348.5), 1024MB, 984MB available, 3584 GFLOPS peak) 17/06/2014 18:17:17 \| \| OpenCL CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1348.5 (sse2,avx), device version OpenCL 1.2 AMD-APP (1348.5)) The GTX460 always had a lot lower GFLOPS peak value, but was a lot more effective at Seti v6, v7 and AP v6, the exception being here, and the OpenCL Gamma-ray pulsar search #3 1.07 app, where the HD7770 was a little faster: https://albert.phys.uwm.edu/host_app_versions.php?hostid=8143 Gamma-ray pulsar search #3 1.07 windows_x86_64 (FGRPopencl-ati) Number of tasks completed 13 Max tasks per day 45 Number of tasks today 0 Consecutive valid tasks 13 Average processing rate 3.55 GFLOPS Average turnaround time 0.37 days Gamma-ray pulsar search #3 1.07 windows_x86_64 (FGRPopencl-nvidia) Number of tasks completed 12 Max tasks per day 44 Number of tasks today 0 Consecutive valid tasks 12 Average processing rate 2.87 GFLOPS Average turnaround time 0.88 days http://boinc.berkeley.edu/dev/forum_thread.php?id=8767&postid=51659 04/12/2013 21:25:07 \| \| CUDA: NVIDIA GPU 0: GeForce GTX 460 (driver version 331.58, CUDA version 6.0, compute capability 2.1, 1024MB, 854MB available, 1075 GFLOPS peak) 04/12/2013 21:25:07 \| \| CAL: ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (CAL version 1.4.1848, 1024MB, 984MB available, 3584 GFLOPS peak) 04/12/2013 21:25:07 \| \| OpenCL: NVIDIA GPU 0: GeForce GTX 460 (driver version 331.58, device version OpenCL 1.1 CUDA, 1024MB, 854MB available, 1075 GFLOPS peak) 04/12/2013 21:25:07 \| \| OpenCL: AMD/ATI GPU 0: AMD Radeon HD 7700 series (Capeverde) (driver version 1348.4 (VM), device version OpenCL 1.2 AMD-APP (1348.4), 1024MB, 984MB available, 3584 GFLOPS peak) 04/12/2013 21:25:07 \| \| OpenCL CPU: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz (OpenCL driver vendor: Advanced Micro Devices, Inc., driver version 1348.4 (sse2,avx), device version OpenCL 1.2 AMD-APP (1348.4)) Claggy ID: 113035 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113036 - Posted: 17 Jun 2014, 17:49:29 UTC - in response to Message 113033. Last modified: 17 Jun 2014, 17:51:04 UTC Unfortunately I missed the server log for a fetch - just got a 'report only' RPC instead. Could you grab a log if it does another work_fetch, please? I did another request, and suspended network: https://albert.phys.uwm.edu/host_sched_logs/8/8143 Claggy [version] [AV#911] (FGRPopencl-ati) adjusting projected flops based on PFC avg: 2950.33G ~~That's not TeraFlops (speed), That's peak flop count, as in # of operations.~~ (verifying in code now) scratch that looks broken, walking the lot with beer The server is using it as a speed for estimation purposes. Maybe that's our problem. of course it;s speed, it's APR later - 'based on' is our problem - something is being factored in incorrectly. AFAIK on SETI there's no such gross overestimation of GPU speed. @ Claggy what is the peak flop count for that card? (sorry if you posted that aready) edit: ta. peak flops x pfc_ave ? the latter being <1 ? Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113036 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113037 - Posted: 17 Jun 2014, 17:58:00 UTC - in response to Message 113036. Last modified: 17 Jun 2014, 17:59:16 UTC yes, this is bizarre: once stats are gathered: if (av.pfc.n > MIN_VERSION_SAMPLES) { hu.projected_flops = hu.peak_flops/av.pfc.get_avg(); if (config.debug_version_select) { log_messages.printf(MSG_NORMAL, "[version] [AV#%d] (%s) adjusting projected flops based on PFC avg: %.2fG\n", av.id, av.plan_class, hu.projected_flops/1e9 ); } Dodgy average aside (which we know all about the problems of sampled averages there, particularly with very few samples), looks like ratio of marketing flops estimate (from client) to operations (effective claimed) Going to check if he's tweaked the definition of pfc here, because flops rate over average operations would give average time in seconds to me... chgecking that pfc with that beer... [Edit:] no sign of our 0.1x scaling for GPU either, at least in albert code. On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113037 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113038 - Posted: 17 Jun 2014, 18:31:02 UTC Jason, with the high-scoring late validations, your average is now above par, at 1003.97 And your median is higher still, at 1168.97 ID: 113038 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113039 - Posted: 17 Jun 2014, 19:09:51 UTC Last modified: 17 Jun 2014, 19:11:13 UTC Ok, so it is effectively using a scaled (marketing) peak flops value - iow a totally unrealistic estimate. We do need something as a starting point though. Those peak flops are as inadequate as using 10X CPU speed was. Eve comes in at 91e9 peak flops. From SETI (too small to run here) her GPU is slightly faster than her CPU. CPU needs ~2h for BRP. So roughly the GPU tasks would take 32 hours. That makes her about 32x slower than a 780 - that's the span we are dealing with and it will only grow larger as GPUs get ever faster. 91*32 = 2912 - which is about the figure we saw earlier for fast GPUs - so the slope of the peak flops is not too bad, but the offset is. With an APR of 33 for the 780 and about 1 for Eve we are looking at a ~90x overestimate. For BRP at least. that scaling value that is being applied must bring the estimates into the correct magnitude over on seti... any chance to get that number from Eric? I don't know. If you underestimate the speed, you cache too few tasks - more frequent top up - only a problem if you really can't connect for longer periods of time as you'd run dry (not really a problem either ;) ). It's the overestimation that runs afoul of the built-in safety-checks. So how about using 1/100 of peak flops as a GPU starting point? I mean you have to start _somewhere_ ... Any problems with underestimating I've failed to consider? Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113039 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113040 - Posted: 17 Jun 2014, 19:12:44 UTC - in response to Message 113038. Jason, with the high-scoring late validations, your average is now above par, at 1003.97 And your median is higher still, at 1168.97 good. better late than never :D Yes we'll definitely need to stabilise CPU here first. GPU is going to take a bit more digging yet, and whether or not there is any connection at estimate, scheduler or validation determined before that one's tackled in detail There are definitely those dicey averages in play (everywhere) to start with, then also I'm surprised to be finding reliance on those (nearly useless) GPU marketing flops figures embedded even after stats are gathered. Until the primary CPU scales are fixed, and averages for all kinds are replaced with damped values, any particular odd logic choice in there is likely to be obliterated in the noise anyway. (Paraphrasing the comments about chaos burying the noise, lol ) On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113040 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113041 - Posted: 17 Jun 2014, 19:16:47 UTC - in response to Message 113039. Last modified: 17 Jun 2014, 19:22:28 UTC Ok, so it is effectively using a scaled (marketing) peak flops value - iow a totally unrealistic estimate. We do need something as a starting point though. Those peak flops are as inadequate as using 10X CPU speed was. ... I agree, though 'true' averages can be fine and established quickly. 10% of the marketing flops should be near enough ballpark for a new host to get it going... which scaling or combination of scalings, is breaking the initial GPU estimate is a mystery to me at the moment, though I have no doubt it'll be much easier to spot with new hostIds in phase 2 when all the averages get replaced with actively controlled dampers. Pass1 (starting point) CPU coarse scaling correction -- look for unexpected effects (e.g. are the GPU apps completely unconnected as expected here) Pass2 (replace sampled averages with controllers, actively damped) -- look for GPU scaling errors, particularly new hostids / apps Pass3 -- GPU scaling logic refinement if needed (probably is) Got enough to draw up something for passes one and two, will get a coffee & a break, then get to some documenting and coding On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113041 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113042 - Posted: 17 Jun 2014, 19:22:03 UTC - in response to Message 113041. Ok, so it is effectively using a scaled (marketing) peak flops value - iow a totally unrealistic estimate. We do need something as a starting point though. Those peak flops are as inadequate as using 10X CPU speed was. ... I agree, though 'true' averages can be fine and established quickly. 10% of the marketing flops should be near enough ballpark for a new host to get it going... ... Didn't I just extensively calculate that 1% is more like it?! Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113042 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113043 - Posted: 17 Jun 2014, 19:24:25 UTC - in response to Message 113042. Last modified: 17 Jun 2014, 19:32:12 UTC Ok, so it is effectively using a scaled (marketing) peak flops value - iow a totally unrealistic estimate. We do need something as a starting point though. Those peak flops are as inadequate as using 10X CPU speed was. ... I agree, though 'true' averages can be fine and established quickly. 10% of the marketing flops should be near enough ballpark for a new host to get it going... ... Didn't I just extensively calculate that 1% is more like it?! Yes, I'm talking from the intent written in code and comments at this point, not what it's actually achieving. If I were to comment on what it's actually achieving, I would have to invent some more words [Edit:] something like "Bandaids on top of fudge factors applied to magic numbers" comes to mind, though doesn't quite capture it. On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113043 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113044 - Posted: 17 Jun 2014, 19:35:09 UTC um, no. It's achieving chaos. :D Chaos theory tells us that that means that at least 3 coupled differential equations are in play :) 'three is chaos'. To get the system into a steady-state, means either uncoupling or stabilising sub-equations. From a mathematical pov this is quite fascinating. I doubt you'd as easily produce a chaotic system if you were actually trying to get one. :D Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113044 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113045 - Posted: 17 Jun 2014, 19:51:47 UTC - in response to Message 113044. Last modified: 17 Jun 2014, 19:52:06 UTC ... On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113045 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113046 - Posted: 17 Jun 2014, 19:51:49 UTC - in response to Message 113044. Last modified: 17 Jun 2014, 19:55:22 UTC um, no. It's achieving chaos. :D Chaos theory tells us that that means that at least 3 coupled differential equations are in play :) 'three is chaos'. To get the system into a steady-state, means either uncoupling or stabilising sub-equations. From a mathematical pov this is quite fascinating. I doubt you'd as easily produce a chaotic system if you were actually trying to get one. :D Yes, reminds me of a tongue in cheek comment I made suggesting the climate people might be interested in this... oh well Yes we can, after poking the CPU app scale in pass 1, in pass 2 place the two scaling equations (scheduler & validation) into separate time domains so they stop interacting in weird ways, and damp the third, which is stochastic non-linear non-deterministic ( elapsed time based samples), then look for more logic issues. I'm pretty convinced that there is a logic breakage there for new GPU hosts, but can't put my finger on it yet. It'll fall out during the first 2 passes I reckon. [Edit:] I see the boinc messageboard echo in here works fine :) On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113046 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113047 - Posted: 17 Jun 2014, 19:55:13 UTC - in response to Message 113046. [Edit:] I see the boinc messageboard echo in here works fine :) Beg your pardon? Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113047 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113048 - Posted: 17 Jun 2014, 19:56:06 UTC - in response to Message 113047. Last modified: 17 Jun 2014, 19:56:21 UTC [Edit:] I see the boinc messageboard echo in here works fine :) Beg your pardon? Double posts seem to happen a lot (to me anyway) [not this time] On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113048 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113049 - Posted: 17 Jun 2014, 19:58:07 UTC - in response to Message 113048. [Edit:] I see the boinc messageboard echo in here works fine :) Beg your pardon? Double posts seem to happen a lot (to me anyway) [not this time] Your resident moderator(s) will probbaly be pleased if you red-x them for hiding. That's tongue in cheek. For once it's not me getting those reports :D Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113049 · Reply Quote