Deprecated: Function get_magic_quotes_gpc() is deprecated in /srv/BOINC/live-webcode/html/inc/util.inc on line 640
Project server code update

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Project server code update

Message boards : News : Project server code update
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 17 · Next

AuthorMessage
Claggy

Send message
Joined: 29 Dec 06
Posts: 78
Credit: 4,040,969
RAC: 0
Message 113141 - Posted: 26 Jun 2014, 9:10:30 UTC
Last modified: 26 Jun 2014, 9:11:12 UTC

A lot of my Gamma-ray pulsar search #3 v1.11 results are coming out as inconclusive, in each case they are matched with an intel GPU, and in each case that intel GPU is running OpenCL 1.1 drivers, shouldn't that app be restricted to Intel GPUs with OpenCL 1.2 drivers?

Validation inconclusive Gamma-ray pulsar search #3 tasks for computer 8143

Claggy
ID: 113141 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113142 - Posted: 26 Jun 2014, 17:17:20 UTC

OK, the effect of my configuration change continues and is even clearer. I simply changed the nature (but not the number) of the tasks running on the CPU while this BRP4G test was running on the GPU.

Here are the runtime stats of the two runs (Maximum / Minimum / Average / Median / Std Dev / nSamples):

(before)	(after)
4191.43		5034.97
4061.45		4417.27
4128.11		4707.30
4127.66		4668.20
20.45		181.84
339		43


and the corresponding graph



I'm told some new hosts are coming online, so that we can watch and examine the "new host / stable (!) project" scenario in detail. I'll add them to the graphs - probably replacing the old hosts on the log graph, since none of them are returning much data now - as soon as I see successful BRP4G tasks coming back in.
ID: 113142 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113143 - Posted: 26 Jun 2014, 23:41:32 UTC - in response to Message 113142.  
Last modified: 26 Jun 2014, 23:42:46 UTC

Once you're happy with that, there are other ways to simulate 'perfectly normal running conditions' that may induce similar divergent behaviour (or worse). One would be to downclock the GPU while Boinc's running (simulating a lower power state, driver timeout/failsafe, deliberate underclock, extended use of the GPU without suspending Boinc etc ...) I think a key takeaway is that the mechanism isn't really adaptive to reasonably normal variable running conditions.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113143 · Report as offensive     Reply Quote
juan BFB

Send message
Joined: 10 Dec 12
Posts: 8
Credit: 1,674,320
RAC: 0
Message 113145 - Posted: 27 Jun 2014, 0:14:34 UTC
Last modified: 27 Jun 2014, 0:19:42 UTC

Starting to crunch with my hosts. I compare the firsts crunched WU´s against the allready validated by jason´s 780SC my 780FTW host is aparently crunching the BRP4G WU allmost 20% faster but it´s receiving 2-3x more credit. I´m running 1 WU at a time on each GPU only. Theoricaly i expect similar credit or i miss something?

https://albert.phys.uwm.edu/result.php?resultid=1514929
https://albert.phys.uwm.edu/result.php?resultid=1515731
ID: 113145 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113146 - Posted: 27 Jun 2014, 0:35:38 UTC - in response to Message 113145.  

Starting to crunch with my hosts. I compare the firsts crunched WU´s against the allready validated by jason´s 780SC my 780FTW host is aparently crunching the BRP4G WU allmost 20% faster but it´s receiving 2-3x more credit. I´m running 1 WU at a time on each GPU only. Theoricaly i expect similar credit or i miss something?

https://albert.phys.uwm.edu/result.php?resultid=1514929
https://albert.phys.uwm.edu/result.php?resultid=1515731


Yes, existing CreditNew (no mods yet) with new app+host in all its glory. One of the big parts we're studying, because of its importance to keeping new users and applications or devices coming on-line.

That's the onramp period as the system tries to establish how fast you're crunching. It doesn't do it very well, but at least you're getting high credit and giving me some, Thanks! :P

You will be crunching faster than me because I'm doing lots of stuff with my machine lately, and haven't tweaked anything.... also I only have an old Core2Duo driving it.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113146 · Report as offensive     Reply Quote
Eyrie

Send message
Joined: 20 Feb 14
Posts: 47
Credit: 2,410
RAC: 0
Message 113147 - Posted: 27 Jun 2014, 8:31:58 UTC

Right, so Richard's were sort of converging but are all over the place now.

For Juan, I prefer to wait that Richard has done all the hard work and produced a graph :) [Thanks Richard, really appreciated]
Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons.
ID: 113147 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113149 - Posted: 27 Jun 2014, 11:54:12 UTC - in response to Message 113147.  

Right, so Richard's were sort of converging but are all over the place now.

For Juan, I prefer to wait that Richard has done all the hard work and produced a graph :) [Thanks Richard, really appreciated]

Two more for your viewing pleasure.

I've started to take out the older hosts, which are returning very few tasks these days, but they served their purpose. Red is now Juan's 10351 (the one he linked two tasks from) - classic view for a new host.



And this is mine, still showing scatter from the new configuration. We'll have to wait a few days before Juan will fit on the same scale (although he validated a couple of my oldies overnight - thank you). I'll keep the configuration stable until sunday night/monday morning, but I'll have to flip back then - I have some held tasks with deadlines.

ID: 113149 · Report as offensive     Reply Quote
juan BFB

Send message
Joined: 10 Dec 12
Posts: 8
Credit: 1,674,320
RAC: 0
Message 113150 - Posted: 27 Jun 2014, 12:31:22 UTC
Last modified: 27 Jun 2014, 12:34:08 UTC

Nice graph Richard, maybe you could consider to add one of my 2x690 hosts 10512 or 10352 since there are no 690 on the graph and they produce a lot of WU.

Now i understand what you all are talking about new hosts, their RAC oscilate a lot and converge to a relative stable range (1.5-2.5 K) no matter the GPU or the host used.
ID: 113150 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113151 - Posted: 27 Jun 2014, 12:42:51 UTC - in response to Message 113150.  

Nice graph Richard, maybe you could consider to add one of my 2x690 hosts 10512 or 10352 since there are no 690 on the graph and they produce a lot of WU.

Now i understand what you all are talking about new hosts, their RAC oscilate a lot and converge to a relative stable range (1.5-2.5 K) no matter the GPU or the host used.

Yes, I'm planning to refresh the graph with new hosts, and they might be suitable.

What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes).

The sheer number of tasks pushed through isn't particularly important, but the consistency is. It didn't help that Zombie took the two hosts I'd picked off to another project (he's still running other hosts - they crop up in my wingmate lists from time to time), and Mikey leaving the project because it isn't exporting public stats would rule him out.

It's quite time-consuming to switch things over, so bear with me - for the time being at least, old results aren't being deleted here, so there's no rush.
ID: 113151 · Report as offensive     Reply Quote
juan BFB

Send message
Joined: 10 Dec 12
Posts: 8
Credit: 1,674,320
RAC: 0
Message 113152 - Posted: 27 Jun 2014, 13:18:40 UTC - in response to Message 113151.  
Last modified: 27 Jun 2014, 13:25:39 UTC

What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes).

If you kave some time choose any one of my hosts (or more than one if you wish) and tell me, i will leave the host continuously crunching only Albert for a week or more if needed, and since them are running 24/7 with allmost no other apps running, they could give you some of the continuous flow of data you are looking for. If i could, i wish to help all i can to finaly fix the creditscrew problem.
ID: 113152 · Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 11 Aug 13
Posts: 10
Credit: 5,011,603
RAC: 0
Message 113153 - Posted: 27 Jun 2014, 16:51:24 UTC - in response to Message 113152.  

RH - Please let me know if it would be more helpful to simply switch my 7950 from BRP5 to BRP4 or to "remove project" / "add project" (presumably that would create a new host and therefore start credit calcs fresh). Also, is it easier for you if I only run 1 WU at a time?
ID: 113153 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113154 - Posted: 27 Jun 2014, 17:04:28 UTC - in response to Message 113153.  

RH - Please let me know if it would be more helpful to simply switch my 7950 from BRP5 to BRP4 or to "remove project" / "add project" (presumably that would create a new host and therefore start credit calcs fresh). Also, is it easier for you if I only run 1 WU at a time?

Remove project / add project doesn't normally change the HostID - BOINC is designed to recycle the numbers, if for example it recognises the IP address and hardware configuration.

Doesn't matter if it's one at a time or multiples at at time, but it's probably best if you don't mix task types (whether from this project or across projects). If I do start monitoring your host - thanks for the offer - it would help the other observers if you could tell us a bit about any configuration details which can't be observed from the outside - and GPU utilisation factor is one of those.

Don't bust a gut changing things over. I need a bit of a breather, and to set up and get used to a replacement monitor: and Bernd needs to test some more new server code fixes next week, which will give us a new set of apps (designated as 'beta', but in reality the same as the existing ones) with blank application_details records to have a go at.
ID: 113154 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113155 - Posted: 28 Jun 2014, 13:08:26 UTC

With new hosts and a new monitor, let's see how that looks.



I've knock out the old data (and with it, the extreme data points) - but even so, Juan's new machines show very wide scatter.

Here's that in figures:

		Jason	Holmis	Claggy	Juan	Juan	Juan	RH	RH
Host:		11363	2267	9008	10352	10512	10351	5367	5367
		GTX 780	GTX 660	GT 650M	GTX 690	GTX 690	GTX 780	GTX 670	GTX 670

Credit for BRP4G, GPU									
Maximum		2708.58	2197.18	10952.0	7209.47	6889.8	6652.9	4137.85	
Minimum		115.82	88.84	153.90	1667.23	1244.41	1546.02	1355.49	
Average		1326.79	1277.87	3631.58	2728.70	2198.10	2463.06	2007.02	
Median		1541.35	1411.09	2426.03	2135.67	1948.04	2091.49	1910.19	
Std Dev		628.07	690.05	2712.34	1403.91	942.62	969.59	305.80	
									
nSamples	76	102	71	52	43	44	459	

Runtime (seconds)						(before)(after)
Maximum		5027.36	5088.99	11295.0	5605.83	8922.7	3182.0	4191.43	5099.40
Minimum		3239.20	3294.83	8122.09	3081.97	3854.24	1852.2	4061.45	4284.52
Average		3645.57	4549.28	8902.94	4411.88	6305.41	2342.3	4128.08	4686.13
Median		3535.46	4769.05	8847.82	3673.33	5127.40	1864.0	4127.35	4672.83
Std Dev		344.17	456.55	508.22	998.49	1932.50	615.41	20.40	204.66
								365	94
Turnround (days)									
Maximum		6.09	3.91	2.75	0.08	0.45	0.22	0.91	
Minimum		0.13	0.07	0.13	0.04	0.05	0.02	0.15	
Average		1.94	1.46	0.90	0.05	0.09	0.03	0.67	
Median		1.46	1.54	0.79	0.04	0.06	0.03	0.69	
Std Dev		1.78	1.00	0.65	0.01	0.06	0.03	0.12	

All three of Juan's machines are showing a very wide variation in runtime - he'll have to explain that by local observation, I can't pick it up from the website.
ID: 113155 · Report as offensive     Reply Quote
treblehit

Send message
Joined: 12 Mar 05
Posts: 5
Credit: 35,119
RAC: 0
Message 113156 - Posted: 28 Jun 2014, 20:11:41 UTC - in response to Message 113151.  



What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes).

The sheer number of tasks pushed through isn't particularly important, but the consistency is. <snip>

It's quite time-consuming to switch things over, so bear with me - for the time being at least, old results aren't being deleted here, so there's no rush.


On the basis of that guidance I am going to provide multiple weak systems that will run only Albert and will remain untouched after initial setup. Also, I'll go "natural" without multiple work units or doing anything with the clocks.

These will be new hosts (really low-powered hosts) so won't carry any prior statistics or other baggage with them.

I'll get on it, shortly.

If you need something different, I think Juan and I are both ready to make any sacrifice of "credits" if we are being helpful.
ID: 113156 · Report as offensive     Reply Quote
treblehit

Send message
Joined: 12 Mar 05
Posts: 5
Credit: 35,119
RAC: 0
Message 113157 - Posted: 29 Jun 2014, 4:58:42 UTC
Last modified: 29 Jun 2014, 5:00:38 UTC

Computer 11519

Pretending to be a new user. New install of GPU, new install of drivers, new install of BOINC.

First work fetch of BRP4G-opencl-ati has estimated runtime of 10 seconds.

Obviously, they are erroring-out.

Run time 3 min 40 sec

Exit status 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED

I know what the fix is, but I'm not concerned with fixing it. I'm concerned with helping you fix it.

What do you want me to do?

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Activated exception handling...
[22:05:40][3552][INFO ] Starting data processing...
[22:05:41][3552][INFO ] Using OpenCL platform provided by: Advanced Micro Devices, Inc.
[22:05:41][3552][INFO ] Using OpenCL device "Juniper" by: Advanced Micro Devices, Inc.
[22:05:41][3552][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
[22:05:41][3552][INFO ] Header contents:
------> Original WAPP file: ./p2030.20130202.G202.32-01.96.N.b0s0g0.00000_DM209.60
------> Sample time in microseconds: 65.4762
------> Observation time in seconds: 274.62705
------> Time stamp (MJD): 56326.065838408722
------> Number of samples/record: 0
------> Center freq in MHz: 1214.289551
------> Channel band in MHz: 0.33605957
------> Number of channels/record: 960
------> Nifs: 1
------> RA (J2000): 62454.7106018
------> DEC (J2000): 83413.5978003
------> Galactic l: 0
------> Galactic b: 0
------> Name: G202.32-01.96.N
------> Lagformat: 0
------> Sum: 1
------> Level: 3
------> AZ at start: 0
------> ZA at start: 0
------> AST at start: 0
------> LST at start: 0
------> Project ID: --
------> Observers: --
------> File size (bytes): 0
------> Data size (bytes): 0
------> Number of samples: 4194304
------> Trial dispersion measure: 209.6 cm^-3 pc
------> Scale factor: 0.00111372
[22:05:46][3552][INFO ] Seed for random number generator is 1168661235.
[22:05:56][3552][INFO ] Derived global search parameters:
------> f_A probability = 0.08
------> single bin prob(P_noise > P_thr) = 1.32531e-008
------> thr1 = 18.139
------> thr2 = 21.241
------> thr4 = 26.2686
------> thr8 = 34.6478
------> thr16 = 48.9581
[22:06:42][3552][INFO ] Checkpoint committed!
[22:07:44][3552][INFO ] Checkpoint committed!
[22:08:46][3552][INFO ] Checkpoint committed!
[22:09:20][3552][INFO ] OpenCL shutdown complete!
[22:09:20][3552][WARN ] BOINC wants us to quit prematurely or we lost contact! Exiting...

</stderr_txt>
ID: 113157 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113158 - Posted: 29 Jun 2014, 5:22:00 UTC
Last modified: 29 Jun 2014, 5:24:31 UTC

Thanks,
I had hoped newhost+app onramp for GPUs would improve, but see that it hasn't. I'm not surprised given we know two precise mechanisms there (default GPU efficiency pinned at 10% (0.1) and improperly applied normalisation (you can't normalise time estimates without a functional host_scale, which is disabled for the onramp period.)

New user, host &/or application is central to this effort, so thanks again for the information. At this point you could either choose to jigger the bounds of tasks (allowing it to reach where host_scale kicks in) or alternatively let it go on erroring & see what happens (I imagine it'd just keep erroring & rediuce quota to 1/day).

Both options have merit so it's your choice, though I think the jiggering option has been pretty thoroughly used, and the second one more likely in common usage cases. Up to you

Jason
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113158 · Report as offensive     Reply Quote
treblehit

Send message
Joined: 12 Mar 05
Posts: 5
Credit: 35,119
RAC: 0
Message 113159 - Posted: 29 Jun 2014, 6:05:36 UTC - in response to Message 113158.  

At this point you could either choose to jigger the bounds of tasks (allowing it to reach where host_scale kicks in) or alternatively let it go on erroring & see what happens (I imagine it'd just keep erroring & rediuce quota to 1/day).


That's what happened. Down to 1 wu/day and I'm done for the day.

Man, am I ever glad I drove that one hour round trip in a 15mpg vehicle to try to get a steady stream of work headed Albert's direction.

There's always the 1 wu I'll get tomorrow. <heavy sigh>
ID: 113159 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113160 - Posted: 29 Jun 2014, 7:04:37 UTC - in response to Message 113159.  

lol, yeah, all in a good cause though :) obvious breakage like that makes the case put forward in some quarters that it's working fine look a tad on the ridiculous side. The more 'normal' situations like that, that simply don't work, the better we understand, and can push to get it fixed once and for all.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113160 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113162 - Posted: 29 Jun 2014, 9:36:06 UTC

From treblehit's server log https://albert.phys.uwm.edu/host_sched_logs/11/11519

2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G
2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best app version is now AV738 (0.89 GFLOP)
2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G
2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best version of app einsteinbinary_BRP5 is [AV#738] (16250.85 GFLOPS)

I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.
ID: 113162 · Report as offensive     Reply Quote
Claggy

Send message
Joined: 29 Dec 06
Posts: 78
Credit: 4,040,969
RAC: 0
Message 113163 - Posted: 29 Jun 2014, 10:29:45 UTC - in response to Message 113162.  

I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.

Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?

Claggy
ID: 113163 · Report as offensive     Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 17 · Next

Message boards : News : Project server code update



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration