Project server code update

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Author	Message
Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113141 - Posted: 26 Jun 2014, 9:10:30 UTC Last modified: 26 Jun 2014, 9:11:12 UTC A lot of my Gamma-ray pulsar search #3 v1.11 results are coming out as inconclusive, in each case they are matched with an intel GPU, and in each case that intel GPU is running OpenCL 1.1 drivers, shouldn't that app be restricted to Intel GPUs with OpenCL 1.2 drivers? Validation inconclusive Gamma-ray pulsar search #3 tasks for computer 8143 Claggy ID: 113141 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113142 - Posted: 26 Jun 2014, 17:17:20 UTC OK, the effect of my configuration change continues and is even clearer. I simply changed the nature (but not the number) of the tasks running on the CPU while this BRP4G test was running on the GPU. Here are the runtime stats of the two runs (Maximum / Minimum / Average / Median / Std Dev / nSamples): (before) (after) 4191.43 5034.97 4061.45 4417.27 4128.11 4707.30 4127.66 4668.20 20.45 181.84 339 43 and the corresponding graph I'm told some new hosts are coming online, so that we can watch and examine the "new host / stable (!) project" scenario in detail. I'll add them to the graphs - probably replacing the old hosts on the log graph, since none of them are returning much data now - as soon as I see successful BRP4G tasks coming back in. ID: 113142 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113143 - Posted: 26 Jun 2014, 23:41:32 UTC - in response to Message 113142. Last modified: 26 Jun 2014, 23:42:46 UTC Once you're happy with that, there are other ways to simulate 'perfectly normal running conditions' that may induce similar divergent behaviour (or worse). One would be to downclock the GPU while Boinc's running (simulating a lower power state, driver timeout/failsafe, deliberate underclock, extended use of the GPU without suspending Boinc etc ...) I think a key takeaway is that the mechanism isn't really adaptive to reasonably normal variable running conditions. On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113143 · Reply Quote

juan BFB Send message Joined: 10 Dec 12 Posts: 8 Credit: 1,674,320 RAC: 0	Message 113145 - Posted: 27 Jun 2014, 0:14:34 UTC Last modified: 27 Jun 2014, 0:19:42 UTC Starting to crunch with my hosts. I compare the firsts crunched WUÂ´s against the allready validated by jasonÂ´s 780SC my 780FTW host is aparently crunching the BRP4G WU allmost 20% faster but itÂ´s receiving 2-3x more credit. IÂ´m running 1 WU at a time on each GPU only. Theoricaly i expect similar credit or i miss something? https://albert.phys.uwm.edu/result.php?resultid=1514929 https://albert.phys.uwm.edu/result.php?resultid=1515731 ID: 113145 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113146 - Posted: 27 Jun 2014, 0:35:38 UTC - in response to Message 113145. Starting to crunch with my hosts. I compare the firsts crunched WUÂ´s against the allready validated by jasonÂ´s 780SC my 780FTW host is aparently crunching the BRP4G WU allmost 20% faster but itÂ´s receiving 2-3x more credit. IÂ´m running 1 WU at a time on each GPU only. Theoricaly i expect similar credit or i miss something? https://albert.phys.uwm.edu/result.php?resultid=1514929 https://albert.phys.uwm.edu/result.php?resultid=1515731 Yes, existing CreditNew (no mods yet) with new app+host in all its glory. One of the big parts we're studying, because of its importance to keeping new users and applications or devices coming on-line. That's the onramp period as the system tries to establish how fast you're crunching. It doesn't do it very well, but at least you're getting high credit and giving me some, Thanks! :P You will be crunching faster than me because I'm doing lots of stuff with my machine lately, and haven't tweaked anything.... also I only have an old Core2Duo driving it. On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113146 · Reply Quote

Eyrie Send message Joined: 20 Feb 14 Posts: 47 Credit: 2,410 RAC: 0	Message 113147 - Posted: 27 Jun 2014, 8:31:58 UTC Right, so Richard's were sort of converging but are all over the place now. For Juan, I prefer to wait that Richard has done all the hard work and produced a graph :) [Thanks Richard, really appreciated] Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons. ID: 113147 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113149 - Posted: 27 Jun 2014, 11:54:12 UTC - in response to Message 113147. Right, so Richard's were sort of converging but are all over the place now. For Juan, I prefer to wait that Richard has done all the hard work and produced a graph :) [Thanks Richard, really appreciated] Two more for your viewing pleasure. I've started to take out the older hosts, which are returning very few tasks these days, but they served their purpose. Red is now Juan's 10351 (the one he linked two tasks from) - classic view for a new host. And this is mine, still showing scatter from the new configuration. We'll have to wait a few days before Juan will fit on the same scale (although he validated a couple of my oldies overnight - thank you). I'll keep the configuration stable until sunday night/monday morning, but I'll have to flip back then - I have some held tasks with deadlines. ID: 113149 · Reply Quote

juan BFB Send message Joined: 10 Dec 12 Posts: 8 Credit: 1,674,320 RAC: 0	Message 113150 - Posted: 27 Jun 2014, 12:31:22 UTC Last modified: 27 Jun 2014, 12:34:08 UTC Nice graph Richard, maybe you could consider to add one of my 2x690 hosts 10512 or 10352 since there are no 690 on the graph and they produce a lot of WU. Now i understand what you all are talking about new hosts, their RAC oscilate a lot and converge to a relative stable range (1.5-2.5 K) no matter the GPU or the host used. ID: 113150 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113151 - Posted: 27 Jun 2014, 12:42:51 UTC - in response to Message 113150. Nice graph Richard, maybe you could consider to add one of my 2x690 hosts 10512 or 10352 since there are no 690 on the graph and they produce a lot of WU. Now i understand what you all are talking about new hosts, their RAC oscilate a lot and converge to a relative stable range (1.5-2.5 K) no matter the GPU or the host used. Yes, I'm planning to refresh the graph with new hosts, and they might be suitable. What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes). The sheer number of tasks pushed through isn't particularly important, but the consistency is. It didn't help that Zombie took the two hosts I'd picked off to another project (he's still running other hosts - they crop up in my wingmate lists from time to time), and Mikey leaving the project because it isn't exporting public stats would rule him out. It's quite time-consuming to switch things over, so bear with me - for the time being at least, old results aren't being deleted here, so there's no rush. ID: 113151 · Reply Quote

juan BFB Send message Joined: 10 Dec 12 Posts: 8 Credit: 1,674,320 RAC: 0	Message 113152 - Posted: 27 Jun 2014, 13:18:40 UTC - in response to Message 113151. Last modified: 27 Jun 2014, 13:25:39 UTC What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes). If you kave some time choose any one of my hosts (or more than one if you wish) and tell me, i will leave the host continuously crunching only Albert for a week or more if needed, and since them are running 24/7 with allmost no other apps running, they could give you some of the continuous flow of data you are looking for. If i could, i wish to help all i can to finaly fix the creditscrew problem. ID: 113152 · Reply Quote

Snow Crash Send message Joined: 11 Aug 13 Posts: 10 Credit: 5,011,603 RAC: 0	Message 113153 - Posted: 27 Jun 2014, 16:51:24 UTC - in response to Message 113152. RH - Please let me know if it would be more helpful to simply switch my 7950 from BRP5 to BRP4 or to "remove project" / "add project" (presumably that would create a new host and therefore start credit calcs fresh). Also, is it easier for you if I only run 1 WU at a time? ID: 113153 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113154 - Posted: 27 Jun 2014, 17:04:28 UTC - in response to Message 113153. RH - Please let me know if it would be more helpful to simply switch my 7950 from BRP5 to BRP4 or to "remove project" / "add project" (presumably that would create a new host and therefore start credit calcs fresh). Also, is it easier for you if I only run 1 WU at a time? Remove project / add project doesn't normally change the HostID - BOINC is designed to recycle the numbers, if for example it recognises the IP address and hardware configuration. Doesn't matter if it's one at a time or multiples at at time, but it's probably best if you don't mix task types (whether from this project or across projects). If I do start monitoring your host - thanks for the offer - it would help the other observers if you could tell us a bit about any configuration details which can't be observed from the outside - and GPU utilisation factor is one of those. Don't bust a gut changing things over. I need a bit of a breather, and to set up and get used to a replacement monitor: and Bernd needs to test some more new server code fixes next week, which will give us a new set of apps (designated as 'beta', but in reality the same as the existing ones) with blank application_details records to have a go at. ID: 113154 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113155 - Posted: 28 Jun 2014, 13:08:26 UTC With new hosts and a new monitor, let's see how that looks. I've knock out the old data (and with it, the extreme data points) - but even so, Juan's new machines show very wide scatter. Here's that in figures: Jason Holmis Claggy Juan Juan Juan RH RH Host: 11363 2267 9008 10352 10512 10351 5367 5367 GTX 780 GTX 660 GT 650M GTX 690 GTX 690 GTX 780 GTX 670 GTX 670 Credit for BRP4G, GPU Maximum 2708.58 2197.18 10952.0 7209.47 6889.8 6652.9 4137.85 Minimum 115.82 88.84 153.90 1667.23 1244.41 1546.02 1355.49 Average 1326.79 1277.87 3631.58 2728.70 2198.10 2463.06 2007.02 Median 1541.35 1411.09 2426.03 2135.67 1948.04 2091.49 1910.19 Std Dev 628.07 690.05 2712.34 1403.91 942.62 969.59 305.80 nSamples 76 102 71 52 43 44 459 Runtime (seconds) (before)(after) Maximum 5027.36 5088.99 11295.0 5605.83 8922.7 3182.0 4191.43 5099.40 Minimum 3239.20 3294.83 8122.09 3081.97 3854.24 1852.2 4061.45 4284.52 Average 3645.57 4549.28 8902.94 4411.88 6305.41 2342.3 4128.08 4686.13 Median 3535.46 4769.05 8847.82 3673.33 5127.40 1864.0 4127.35 4672.83 Std Dev 344.17 456.55 508.22 998.49 1932.50 615.41 20.40 204.66 365 94 Turnround (days) Maximum 6.09 3.91 2.75 0.08 0.45 0.22 0.91 Minimum 0.13 0.07 0.13 0.04 0.05 0.02 0.15 Average 1.94 1.46 0.90 0.05 0.09 0.03 0.67 Median 1.46 1.54 0.79 0.04 0.06 0.03 0.69 Std Dev 1.78 1.00 0.65 0.01 0.06 0.03 0.12 All three of Juan's machines are showing a very wide variation in runtime - he'll have to explain that by local observation, I can't pick it up from the website. ID: 113155 · Reply Quote

treblehit Send message Joined: 12 Mar 05 Posts: 5 Credit: 35,119 RAC: 0	Message 113156 - Posted: 28 Jun 2014, 20:11:41 UTC - in response to Message 113151. What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes). The sheer number of tasks pushed through isn't particularly important, but the consistency is. <snip> It's quite time-consuming to switch things over, so bear with me - for the time being at least, old results aren't being deleted here, so there's no rush. On the basis of that guidance I am going to provide multiple weak systems that will run only Albert and will remain untouched after initial setup. Also, I'll go "natural" without multiple work units or doing anything with the clocks. These will be new hosts (really low-powered hosts) so won't carry any prior statistics or other baggage with them. I'll get on it, shortly. If you need something different, I think Juan and I are both ready to make any sacrifice of "credits" if we are being helpful. ID: 113156 · Reply Quote

treblehit Send message Joined: 12 Mar 05 Posts: 5 Credit: 35,119 RAC: 0	Message 113157 - Posted: 29 Jun 2014, 4:58:42 UTC Last modified: 29 Jun 2014, 5:00:38 UTC Computer 11519 Pretending to be a new user. New install of GPU, new install of drivers, new install of BOINC. First work fetch of BRP4G-opencl-ati has estimated runtime of 10 seconds. Obviously, they are erroring-out. Run time 3 min 40 sec Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED I know what the fix is, but I'm not concerned with fixing it. I'm concerned with helping you fix it. What do you want me to do? <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> Maximum elapsed time exceeded </message> <stderr_txt> Activated exception handling... [22:05:40][3552][INFO ] Starting data processing... [22:05:41][3552][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc. [22:05:41][3552][INFO ] Using OpenCL device "Juniper" by: Advanced Micro Devices, Inc. [22:05:41][3552][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory). ------> Starting from scratch... [22:05:41][3552][INFO ] Header contents: ------> Original WAPP file: ./p2030.20130202.G202.32-01.96.N.b0s0g0.00000_DM209.60 ------> Sample time in microseconds: 65.4762 ------> Observation time in seconds: 274.62705 ------> Time stamp (MJD): 56326.065838408722 ------> Number of samples/record: 0 ------> Center freq in MHz: 1214.289551 ------> Channel band in MHz: 0.33605957 ------> Number of channels/record: 960 ------> Nifs: 1 ------> RA (J2000): 62454.7106018 ------> DEC (J2000): 83413.5978003 ------> Galactic l: 0 ------> Galactic b: 0 ------> Name: G202.32-01.96.N ------> Lagformat: 0 ------> Sum: 1 ------> Level: 3 ------> AZ at start: 0 ------> ZA at start: 0 ------> AST at start: 0 ------> LST at start: 0 ------> Project ID: -- ------> Observers: -- ------> File size (bytes): 0 ------> Data size (bytes): 0 ------> Number of samples: 4194304 ------> Trial dispersion measure: 209.6 cm^-3 pc ------> Scale factor: 0.00111372 [22:05:46][3552][INFO ] Seed for random number generator is 1168661235. [22:05:56][3552][INFO ] Derived global search parameters: ------> f_A probability = 0.08 ------> single bin prob(P_noise > P_thr) = 1.32531e-008 ------> thr1 = 18.139 ------> thr2 = 21.241 ------> thr4 = 26.2686 ------> thr8 = 34.6478 ------> thr16 = 48.9581 [22:06:42][3552][INFO ] Checkpoint committed! [22:07:44][3552][INFO ] Checkpoint committed! [22:08:46][3552][INFO ] Checkpoint committed! [22:09:20][3552][INFO ] OpenCL shutdown complete! [22:09:20][3552][WARN ] BOINC wants us to quit prematurely or we lost contact! Exiting... </stderr_txt> ID: 113157 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113158 - Posted: 29 Jun 2014, 5:22:00 UTC Last modified: 29 Jun 2014, 5:24:31 UTC Thanks, I had hoped newhost+app onramp for GPUs would improve, but see that it hasn't. I'm not surprised given we know two precise mechanisms there (default GPU efficiency pinned at 10% (0.1) and improperly applied normalisation (you can't normalise time estimates without a functional host_scale, which is disabled for the onramp period.) New user, host &/or application is central to this effort, so thanks again for the information. At this point you could either choose to jigger the bounds of tasks (allowing it to reach where host_scale kicks in) or alternatively let it go on erroring & see what happens (I imagine it'd just keep erroring & rediuce quota to 1/day). Both options have merit so it's your choice, though I think the jiggering option has been pretty thoroughly used, and the second one more likely in common usage cases. Up to you Jason On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113158 · Reply Quote

treblehit Send message Joined: 12 Mar 05 Posts: 5 Credit: 35,119 RAC: 0	Message 113159 - Posted: 29 Jun 2014, 6:05:36 UTC - in response to Message 113158. At this point you could either choose to jigger the bounds of tasks (allowing it to reach where host_scale kicks in) or alternatively let it go on erroring & see what happens (I imagine it'd just keep erroring & rediuce quota to 1/day). That's what happened. Down to 1 wu/day and I'm done for the day. Man, am I ever glad I drove that one hour round trip in a 15mpg vehicle to try to get a steady stream of work headed Albert's direction. There's always the 1 wu I'll get tomorrow. <heavy sigh> ID: 113159 · Reply Quote

jason_gee Send message Joined: 4 Jun 14 Posts: 109 Credit: 1,043,639 RAC: 0	Message 113160 - Posted: 29 Jun 2014, 7:04:37 UTC - in response to Message 113159. lol, yeah, all in a good cause though :) obvious breakage like that makes the case put forward in some quarters that it's working fine look a tad on the ridiculous side. The more 'normal' situations like that, that simply don't work, the better we understand, and can push to get it fixed once and for all. On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage ID: 113160 · Reply Quote

Richard Haselgrove Send message Joined: 10 Dec 05 Posts: 450 Credit: 5,409,572 RAC: 0	Message 113162 - Posted: 29 Jun 2014, 9:36:06 UTC From treblehit's server log https://albert.phys.uwm.edu/host_sched_logs/11/11519 2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G 2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best app version is now AV738 (0.89 GFLOP) 2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G 2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best version of app einsteinbinary_BRP5 is [AV#738] (16250.85 GFLOPS) I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way. ID: 113162 · Reply Quote

Claggy Send message Joined: 29 Dec 06 Posts: 78 Credit: 4,040,969 RAC: 0	Message 113163 - Posted: 29 Jun 2014, 10:29:45 UTC - in response to Message 113162. I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way. Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used? Claggy ID: 113163 · Reply Quote