WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Posts by Richard Haselgrove

21) Message boards : News : Project server code update (Message 113212)
Posted 2 Jul 2014 by Richard Haselgrove
Post:
BTW I slow down my cruchers here since don´t belive quantity is what you´re looking for and now they will produce a stable number of daily WU.

I think that's probably a good idea. We're already at the stage where my last 12 consecutive validations have been against one or other of your hosts (5 different machines, I think). And the machines are all pretty similar, to each other and to mine: GTX 670/690/780, running Win7/64 or (in one case) Server 2008.

In order to see (now) and test (later) BOINC's behaviour in the real world, we probably need a reasonable variation in hosts to give us realistic variation in the times and credits.

Bernd has launched a new 'BRP5' (Persueus Arm Survey) v1.40, with a Beta app tag on it, to test that new feature in the BOINC scheduler. I'm in the process of switching my machine over to run that instead: some company would be nice, but be warned: we're half expecting to fall over the 'EXIT_TIME_LIMIT_EXCEEDED' problem at some stage with BRP5 Beta, so hosts running it probably need to be watched quite closely for strange estimated runtimes, and you need to be ready to take action to correct it.
22) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113208)
Posted 2 Jul 2014 by Richard Haselgrove
Post:
I think we're going to have real problems with the Gamma-ray pulsar search #3 app for a while.

I posted that my host 11362 was getting runtime estimates of 12 minutes, time allowed 4 hours, @ 20 GHz.

Turns out that two of the three tasks I've returned so far would have exceeded bounds if I hadn't inoculated them. So my GTX 470 GPU is running at an effective rate of 1 GHz or less. As is described elsewhere, this app is very much still a work-in-progress, where very little work is done on the GPU, and most of it still on the CPU - it wants a full CPU core, and uses it to the hilt.

Similarly, TJ's GTX 660 has been taking around three hours for the matching tasks over at the main Einstein project. So that makes even more of a mockery of the server dishing out a bounds limit of four minutes for his machine - his speed must be mis-estimated by a factor of 1,000 or so.

And to put the icing on the cake, all three of my returned results have been paired with different anonymous Intel HD 2500 GPUs running with the dodgy OpenCL 1.1 driver that Claggy noticed. Inconclusive, the lot of them. It's going to take a while to get the server averages back into kilter...
23) Message boards : News : Project server code update (Message 113207)
Posted 2 Jul 2014 by Richard Haselgrove
Post:
I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.

I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).

Claggy

Could be the wingmen. (There's a number of combinations of wingmen types that'll give random results between two regions. Two similar wingmen tend to cancel with averaging and become 'normal')

Conversely, when he's paired with me - now back to lower, stable, runtimes - no jackpot, no bonus. Sorry 'bout that.
24) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113202)
Posted 1 Jul 2014 by Richard Haselgrove
Post:
I have the same errors, but my wing(wo)man with nVidia cards also have this error. If done by a CPU then it is validated. So I think it has something to do with the GPU app.
Only Gamma-ray pulsar search #3 v1.11 (FGRPopencl-nvidia) have this error (at my side).

@ tjreuter,

Could you possibly unhide your host(s) at this project, or give us a direct link to the one you're having problems with?

It would help us to give you more specific advice, and it would also help us (and the project) to understand more clearly why this problem happens in the first place.

Rigs should be visible now Richard. However I have checked the Gamma-ray pulsar search out. (At Einstein@home, they work though).

Yes, visible now, thanks.

I assume we're talking about

Error Gamma-ray pulsar search #3 tasks for computer 7731 - tasks issued yesterday.

Unfortunately, Application details for host 7731 shows no APR for that app, because none of the tasks completed successfully.

And the server log https://albert.phys.uwm.edu/host_sched_logs/7/7731 isn't much use either, because the last scheduler contact was to report work only, with no new work requested.

What I'd like to see, if at all possible, is a copy of the server log for an example of a work request where an FGRP task was issued. It would look something like

2014-07-01 17:18:03.1608 [PID=30917] [version] Checking plan class 'FGRPopencl-nvidia'
2014-07-01 17:18:03.1608 [PID=30917] [version] plan_class_spec: parsed project prefs setting 'gpu_util_fgrp' : true : 1.000000
2014-07-01 17:18:03.1609 [PID=30917] [version] [AV#913] (FGRPopencl-nvidia) using conservative projected flops: 20.12G
2014-07-01 17:18:03.1609 [PID=30917] [version] Best app version is now AV913 (29.38 GFLOP)
2014-07-01 17:18:03.1610 [PID=30917] [version] [AV#913] (FGRPopencl-nvidia) 11362
2014-07-01 17:18:03.1610 [PID=30917] [version] Best version of app hsgamma_FGRP3 is [AV#913] (20.12 GFLOPS)
2014-07-01 17:18:03.1610 [PID=30917] [send] est delay 0, skipping deadline check
2014-07-01 17:18:03.1629 [PID=30917] [send] Sending app_version hsgamma_FGRP3 2 111 FGRPopencl-nvidia; projected 20.12 GFLOPS
2014-07-01 17:18:03.1630 [PID=30917] [CRITICAL] No filename found in [WU#605548 LATeah0109C_32.0_99_-5.66e-10]
2014-07-01 17:18:03.1630 [PID=30917] [send] est. duration for WU 605548: unscaled 745.62 scaled 745.95
2014-07-01 17:18:03.1630 [PID=30917] [send] [HOST#11362] sending [RESULT#1453006 LATeah0109C_32.0_99_-5.66e-10_0] (est. dur. 745.95s (0h12m25s95)) (max time 14912.31s (4h08m32s31))

Note that in my case (from host 11362) the server is estimating - last line - that the task will run for 746 seconds (which is what I'm seeing locally too), and won't be thrown out with a time limit error for over four hours.

That's calculated from "using conservative projected flops: 20.12G" a few lines above (which is a new one on me). Since your tasks error out in under 4 minutes, I assume the initial estimates must have been 20 times smaller than that - 12 seconds or something.

What I'd ideally like to see is a similar server log from your machine, showing the GFlops value it's using to calculate your runtime. You have to be quick to catch it: there seem to be very few tasks around at the moment, and I had to try several times. Then, you have to capture the server log within a minute, otherwise another attempt will overwrite the successful one (unless you set NNT before your computer asks again). There's something very odd about the way the Albert server is setting these estimated speeds, and we haven't fully got to the bottom of it yet.
25) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113200)
Posted 1 Jul 2014 by Richard Haselgrove
Post:
I have the same errors, but my wing(wo)man with nVidia cards also have this error. If done by a CPU then it is validated. So I think it has something to do with the GPU app.
Only Gamma-ray pulsar search #3 v1.11 (FGRPopencl-nvidia) have this error (at my side).

@ tjreuter,

Could you possibly unhide your host(s) at this project, or give us a direct link to the one you're having problems with?

It would help us to give you more specific advice, and it would also help us (and the project) to understand more clearly why this problem happens in the first place.
26) Message boards : News : Project server code update (Message 113197)
Posted 1 Jul 2014 by Richard Haselgrove
Post:
Latest scattergram.



I've reverted my 5367 to normal running (early afternoon yesterday), so my timings *should* be lower and steadier - doesn't really seem to show in credit yet. I wonder why Claggy's laptop gets such variable credit?
27) Message boards : Problems and Bug Reports : Errors - 197 (0xc5) EXIT_TIME_LIMIT_EXCEEDED (Message 113196)
Posted 1 Jul 2014 by Richard Haselgrove
Post:
Specifically:

Fetch some FGRP (Gamma-ray pulsar search) work.
Exit BOINC completely
Edit <rsc_fpops_bound> as Eyrie describes. You'll find it in the <workunit> definition for each of the tasks you've downloaded.
Restart BOINC, and allow the tasks to run and report as usual. Probably best to set 'No New Tasks' while you do this.

Once you've reported and validated 11 tasks, the procedure should no longer be necessary. If you didn't get 11 validations from the first batch, repeat as needed.
28) Message boards : News : Project server code update (Message 113189)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
Um, if you don't mind, I think it might be best to wait a little time. The administrators on this project are based in Europe, and as you know Jason is ahead of our time-zone, in Australia. I think it might be better to wait 12 hours or so, until we have a chance to compare notes by email when the lab opens in the morning.

After all, we don't want to use up our entire supply of unattached new hosts in one hit, or else we won't have anything left to test Jason's patches with....
29) Message boards : News : Project server code update (Message 113185)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
At least one of those must be upside down.


In a sense yes. GPU app+device+conditions efficiency would be actual/peak, and must be less than 1 (and it is, e.g. it should be around 0.05 for single task Cuda GPU). Normalisation could be viewed as turning it upside down. It'll raise the GFlops & shrink the time estimate artificially --> the exact opposite of the kindof behaviour we want for new hosts/apps.

A bit will become clearer when I have the next dodgy diagram ready. Getting bogged down in broken code is a bit of a red-herring at the moment, as there are design level issues to tackle first.

In particular, debugging the normalisation, including the absurd GFlops numbers it produces, is pointless in the context of estimates. That's because neither the time nor Gflops should be being normalised [AT ALL], so it all get's disabled in estimates, and restricted to credit related uses where it's applicable to get the same credit claims from different apps.

Well, we do (crudely) have two separate cases to deal with.

1) initial attach. We have to get rid of that divide-by-almost-zero, or hosts can't run. They get the absurdly low runtime estimate/bound and error when they exceed it.

2) steady state. In my (political) opinion, trying to bring back client-side DCF will be flogging one dead horse too many. We need some sort of server-side control of runtime estimates, so that client scheduling works and user expectations are met. I'm happy to accept that the new version will be different to the one we have now, and look forward to seeing it.

OK, I'll get out of your hair, and take my coffee downstairs to grab some more stats.
30) Message boards : News : Project server code update (Message 113182)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
See edit to my last. In my view, if the relevant numbers are all <<1, we should be multiplying by them, not dividing by them.

Out of coffee error - going shopping. Back soon.
31) Message boards : News : Project server code update (Message 113180)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
app version pfc is normalised to 0.1 (design flaw), and any real samples would have driven it toward 0.05 or lower . so that text should be 10-20x+ marketing flops, and is NOT the intent, nor remotely correct design. It's Gibberish.

The advice given to project administrators in http://boinc.berkeley.edu/trac/wiki/AppPlanSpec is:

<gpu_peak_flops_scale>x</gpu_peak_flops_scale>
scale GPU peak speed by this (default 1).

I'm wondering whether they put in 0.1, expecting this to be a multiplier (real flops are lower than peak flops), but end up dividing by 0.1 instead? And from what you say, 'default 1' doesn't match the code either?

Edit: the alternative C++ documentation for plan_classes is in http://boinc.berkeley.edu/trac/wiki/PlanClassFunc. There, the example is

.21            // estimated GPU efficiency (actual/peak FLOPS)

At least one of those must be upside down.
32) Message boards : News : Project server code update (Message 113176)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
Now the server side, that 'Best version of app' striing comes from sched_version.cpp (scheduler inbuilt functions) and uses the following resources:
app->name, bavp->avp->id, bavp->host_usage.projected_flops/1e9

That projected_flops is set during app version selection, as number os samples will be < 10 , flops will be adjusted based on pfc samples average for the app version (there will be 100 of those from other users).

Since that's normalised elsewhere (see red ellipse on dodgy diagram), net effect translates pfc of 0.1 used for the original estimate, to 1, so peak_flops is x10-20

Richard do you want code line numbers for that ?

That's OK, I can do a text search in sched_version.cpp same as you.

What would perhaps be most useful would be an expanded table of all those TLA variable names, with your assessment of what David intended them to mean, and of what they actually mean in practice.

Looking back at the thread openers, I reported:
client 192 GFLOPS peak, based on PFC avg: 2124.60G

I can't quickly find the client GFLOPS peak number for Claggy's ATI 'Capeverde' with "based on PFC avg: 34968.78G". I'd like to look for the variable (presumably a struct member) where we might expect GFLOPS peak to be stored, and see what it's multiplied by in those initial stages before 11 completions establish an APR. We might expect 0.1 from the words, but we seem to be using >10 by the numbers.
33) Message boards : News : Project server code update (Message 113174)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
Ah allright,
Yeah only interested in fixing current code, rather than diagnosing/patching old versions :)

Yes, concentrating on the current code and moving it forward is certainly the right approach - but it's probably worth just being aware of the steps we moved through to reach this point, because it can influence compatibility problems that could arise in the future.

As we've discussed, DCF was deprecated from client v7.0.28, and in the server code from a little earlier. But not everything in the BOINC world moves in lockstep, so we have older and newer servers in use, and we also have older and newer clients in use.

Older servers take account of client DCF when scaling runtime estimates prior to allocating work:
[send] active_frac 0.999987 on_frac 0.999802 DCF 0.776980

Newer servers don't:
[send] on_frac 0.999802 active_frac 0.999987 gpu_active_frac 0.999978

Those are both the same machine (the one I've been graphing here), which explains why on_frac and active_frac are identical. But the first line comes from the Einstein server log, and the second line from the Albert server log.

So, even my late-alpha version of BOINC (v7.3.19) is maintaining, using and reporting DCF against an 'old server' project which needs it. Good compatibility choice.

But the reverse case is not so happy. An older client (I'm talking standard stock clients here, not Jason's specially-tweaked client) will do on using and reporting DCF as before, because it doesn't parse the <dont_use_dcf/> tag. But the newer server code has discarded DCF completely, and doesn't scale its internal runtime estimates when presented with a work request from a client which is still using it.

This can - and does - result in servers allocating vastly different volumes of work from what the client expects, because the estimation process doesn't have all the same inputs.

Say, for the sake of argument, that an 'old' (pre-v7.0.28) client has got itself into a state with DCF=100, and asks for 1 day of work. For the BRP4G tasks we're studying here, we'd all expect the server to allocate maybe 20 tasks, and the client to agree with the server calculation of estimated runtime, slightly over 1 day. But if the client is using DCF, and the server isn't, that can appear as a 100 day work work cache when the client does the local calculation. That's a case where server-client compatibility breaks down, and breaks down badly.
34) Message boards : News : Project server code update (Message 113162)
Posted 29 Jun 2014 by Richard Haselgrove
Post:
From treblehit's server log https://albert.phys.uwm.edu/host_sched_logs/11/11519

2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G
2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best app version is now AV738 (0.89 GFLOP)
2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G
2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best version of app einsteinbinary_BRP5 is [AV#738] (16250.85 GFLOPS)

I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.
35) Message boards : News : Project server code update (Message 113155)
Posted 28 Jun 2014 by Richard Haselgrove
Post:
With new hosts and a new monitor, let's see how that looks.



I've knock out the old data (and with it, the extreme data points) - but even so, Juan's new machines show very wide scatter.

Here's that in figures:

		Jason	Holmis	Claggy	Juan	Juan	Juan	RH	RH
Host:		11363	2267	9008	10352	10512	10351	5367	5367
		GTX 780	GTX 660	GT 650M	GTX 690	GTX 690	GTX 780	GTX 670	GTX 670

Credit for BRP4G, GPU									
Maximum		2708.58	2197.18	10952.0	7209.47	6889.8	6652.9	4137.85	
Minimum		115.82	88.84	153.90	1667.23	1244.41	1546.02	1355.49	
Average		1326.79	1277.87	3631.58	2728.70	2198.10	2463.06	2007.02	
Median		1541.35	1411.09	2426.03	2135.67	1948.04	2091.49	1910.19	
Std Dev		628.07	690.05	2712.34	1403.91	942.62	969.59	305.80	
									
nSamples	76	102	71	52	43	44	459	

Runtime (seconds)						(before)(after)
Maximum		5027.36	5088.99	11295.0	5605.83	8922.7	3182.0	4191.43	5099.40
Minimum		3239.20	3294.83	8122.09	3081.97	3854.24	1852.2	4061.45	4284.52
Average		3645.57	4549.28	8902.94	4411.88	6305.41	2342.3	4128.08	4686.13
Median		3535.46	4769.05	8847.82	3673.33	5127.40	1864.0	4127.35	4672.83
Std Dev		344.17	456.55	508.22	998.49	1932.50	615.41	20.40	204.66
								365	94
Turnround (days)									
Maximum		6.09	3.91	2.75	0.08	0.45	0.22	0.91	
Minimum		0.13	0.07	0.13	0.04	0.05	0.02	0.15	
Average		1.94	1.46	0.90	0.05	0.09	0.03	0.67	
Median		1.46	1.54	0.79	0.04	0.06	0.03	0.69	
Std Dev		1.78	1.00	0.65	0.01	0.06	0.03	0.12	

All three of Juan's machines are showing a very wide variation in runtime - he'll have to explain that by local observation, I can't pick it up from the website.
36) Message boards : News : Project server code update (Message 113154)
Posted 27 Jun 2014 by Richard Haselgrove
Post:
RH - Please let me know if it would be more helpful to simply switch my 7950 from BRP5 to BRP4 or to "remove project" / "add project" (presumably that would create a new host and therefore start credit calcs fresh). Also, is it easier for you if I only run 1 WU at a time?

Remove project / add project doesn't normally change the HostID - BOINC is designed to recycle the numbers, if for example it recognises the IP address and hardware configuration.

Doesn't matter if it's one at a time or multiples at at time, but it's probably best if you don't mix task types (whether from this project or across projects). If I do start monitoring your host - thanks for the offer - it would help the other observers if you could tell us a bit about any configuration details which can't be observed from the outside - and GPU utilisation factor is one of those.

Don't bust a gut changing things over. I need a bit of a breather, and to set up and get used to a replacement monitor: and Bernd needs to test some more new server code fixes next week, which will give us a new set of apps (designated as 'beta', but in reality the same as the existing ones) with blank application_details records to have a go at.
37) Message boards : News : Project server code update (Message 113151)
Posted 27 Jun 2014 by Richard Haselgrove
Post:
Nice graph Richard, maybe you could consider to add one of my 2x690 hosts 10512 or 10352 since there are no 690 on the graph and they produce a lot of WU.

Now i understand what you all are talking about new hosts, their RAC oscilate a lot and converge to a relative stable range (1.5-2.5 K) no matter the GPU or the host used.

Yes, I'm planning to refresh the graph with new hosts, and they might be suitable.

What is most helpful is finding hosts with a nice, steady, continuous flow of data, and as little variation as possible in the running conditions (so that any noise in the credit granted can be attributed to external causes).

The sheer number of tasks pushed through isn't particularly important, but the consistency is. It didn't help that Zombie took the two hosts I'd picked off to another project (he's still running other hosts - they crop up in my wingmate lists from time to time), and Mikey leaving the project because it isn't exporting public stats would rule him out.

It's quite time-consuming to switch things over, so bear with me - for the time being at least, old results aren't being deleted here, so there's no rush.
38) Message boards : News : Project server code update (Message 113149)
Posted 27 Jun 2014 by Richard Haselgrove
Post:
Right, so Richard's were sort of converging but are all over the place now.

For Juan, I prefer to wait that Richard has done all the hard work and produced a graph :) [Thanks Richard, really appreciated]

Two more for your viewing pleasure.

I've started to take out the older hosts, which are returning very few tasks these days, but they served their purpose. Red is now Juan's 10351 (the one he linked two tasks from) - classic view for a new host.



And this is mine, still showing scatter from the new configuration. We'll have to wait a few days before Juan will fit on the same scale (although he validated a couple of my oldies overnight - thank you). I'll keep the configuration stable until sunday night/monday morning, but I'll have to flip back then - I have some held tasks with deadlines.

39) Message boards : News : Project server code update (Message 113142)
Posted 26 Jun 2014 by Richard Haselgrove
Post:
OK, the effect of my configuration change continues and is even clearer. I simply changed the nature (but not the number) of the tasks running on the CPU while this BRP4G test was running on the GPU.

Here are the runtime stats of the two runs (Maximum / Minimum / Average / Median / Std Dev / nSamples):

(before)	(after)
4191.43		5034.97
4061.45		4417.27
4128.11		4707.30
4127.66		4668.20
20.45		181.84
339		43


and the corresponding graph



I'm told some new hosts are coming online, so that we can watch and examine the "new host / stable (!) project" scenario in detail. I'll add them to the graphs - probably replacing the old hosts on the log graph, since none of them are returning much data now - as soon as I see successful BRP4G tasks coming back in.
40) Message boards : News : Project server code update (Message 113134)
Posted 25 Jun 2014 by Richard Haselgrove
Post:
Updating both graphs, to show a new effect.





This morning, I was asked to change the running configuration on my host 5367, for an unrelated reason. As a result, the maximum runtime for these tasks went up from 4137.85 seconds to 4591.35 - nearly 11%.

The first task back after that - before APR had a chance to respond, obviously - is the high outlier at 2474.34

I think that's further evidence of the kind of instability we need to cure.


Previous 20 · Next 20



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration