WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Posts by jason_gee

21) Message boards : News : Project server code update (Message 113171)
Posted 29 Jun 2014 by jason_gee
Post:
Now the server side, that 'Best version of app' striing comes from sched_version.cpp (scheduler inbuilt functions) and uses the following resources:
app->name, bavp->avp->id, bavp->host_usage.projected_flops/1e9

That projected_flops is set during app version selection, as number os samples will be < 10 , flops will be adjusted based on pfc samples average for the app version (there will be 100 of those from other users).

Since that's normalised elsewhere (see red ellipse on dodgy diagram), net effect translates pfc of 0.1 used for the original estimate, to 1, so peak_flops is x10-20
22) Message boards : News : Project server code update (Message 113169)
Posted 29 Jun 2014 by jason_gee
Post:
Ah allright,
Yeah only interested in fixing current code, rather than diagnosing/patching old versions :)
23) Message boards : News : Project server code update (Message 113167)
Posted 29 Jun 2014 by jason_gee
Post:
I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.

Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?

Claggy


There is no adjustment, the adjustment is a lie. <dont_use_dcf> is hard wired active for all clients >= 7.0.28.

But only on projects that don't use dcf, Einstein on my i7-2600K/HD7770 has a dcf of:

<duration_correction_factor>1.267963</duration_correction_factor>

Claggy


Well you've lost me there, because every scheduler reply to a >= 7.0.28 client, accirding to the scheduler code, pushes <dont_use_dcf/> , [and there is no configuration switch for it ]
24) Message boards : News : Project server code update (Message 113165)
Posted 29 Jun 2014 by jason_gee
Post:
I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.

Doesn't the Main project have this adjustment because they have a single DCF there, But we don't use DCF here, so this adjustment shouldn't be used?

Claggy


There is no adjustment, the adjustment is a lie. <dont_use_dcf> is hard wired active for all clients >= 7.0.28.
25) Message boards : News : Project server code update (Message 113164)
Posted 29 Jun 2014 by jason_gee
Post:
From treblehit's server log https://albert.phys.uwm.edu/host_sched_logs/11/11519

2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G
2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best app version is now AV738 (0.89 GFLOP)
2014-06-29 09:21:30.4581 [PID=3880 ] [version] [AV#738] (BRP5-opencl-ati) adjusting projected flops based on PFC avg: 16250.85G
2014-06-29 09:21:30.4581 [PID=3880 ] [version] Best version of app einsteinbinary_BRP5 is [AV#738] (16250.85 GFLOPS)

I do think we ought to try and work out exactly where those figures come from. As with the numbers Claggy and I saw right at the beginning of this thread, they are vastly higher than any known 'peak FLOPs' value calculated and displayed by the BOINC client for any known GPU. At the very most, that calculated speed (or some rule-of-thumb fraction of it) should be used as a sanity cap on the PFC avg number - once we've understood what PFC avg is in this context, and how it came to be that way.


Sure, first from client perspective:
referring to the dodgy diagram, factoring in the bad onramp period default pfc_scale of 0.1 for GPUs, and inactive host_scale (x1) results in:

wu pfc ('peak flop claim') est = 0.1*1*wu_est (10% of minimum possible)
device peak_flops likely standard GPU ~20x actual rate (app, card & system dependant)
--> est about 1/200th of required elapsed --> bound exceed

Now digging through server end...
26) Message boards : News : Project server code update (Message 113160)
Posted 29 Jun 2014 by jason_gee
Post:
lol, yeah, all in a good cause though :) obvious breakage like that makes the case put forward in some quarters that it's working fine look a tad on the ridiculous side. The more 'normal' situations like that, that simply don't work, the better we understand, and can push to get it fixed once and for all.
27) Message boards : News : Project server code update (Message 113158)
Posted 29 Jun 2014 by jason_gee
Post:
Thanks,
I had hoped newhost+app onramp for GPUs would improve, but see that it hasn't. I'm not surprised given we know two precise mechanisms there (default GPU efficiency pinned at 10% (0.1) and improperly applied normalisation (you can't normalise time estimates without a functional host_scale, which is disabled for the onramp period.)

New user, host &/or application is central to this effort, so thanks again for the information. At this point you could either choose to jigger the bounds of tasks (allowing it to reach where host_scale kicks in) or alternatively let it go on erroring & see what happens (I imagine it'd just keep erroring & rediuce quota to 1/day).

Both options have merit so it's your choice, though I think the jiggering option has been pretty thoroughly used, and the second one more likely in common usage cases. Up to you

Jason
28) Message boards : News : Project server code update (Message 113146)
Posted 27 Jun 2014 by jason_gee
Post:
Starting to crunch with my hosts. I compare the firsts crunched WU´s against the allready validated by jason´s 780SC my 780FTW host is aparently crunching the BRP4G WU allmost 20% faster but it´s receiving 2-3x more credit. I´m running 1 WU at a time on each GPU only. Theoricaly i expect similar credit or i miss something?

https://albert.phys.uwm.edu/result.php?resultid=1514929
https://albert.phys.uwm.edu/result.php?resultid=1515731


Yes, existing CreditNew (no mods yet) with new app+host in all its glory. One of the big parts we're studying, because of its importance to keeping new users and applications or devices coming on-line.

That's the onramp period as the system tries to establish how fast you're crunching. It doesn't do it very well, but at least you're getting high credit and giving me some, Thanks! :P

You will be crunching faster than me because I'm doing lots of stuff with my machine lately, and haven't tweaked anything.... also I only have an old Core2Duo driving it.
29) Message boards : News : Project server code update (Message 113143)
Posted 26 Jun 2014 by jason_gee
Post:
Once you're happy with that, there are other ways to simulate 'perfectly normal running conditions' that may induce similar divergent behaviour (or worse). One would be to downclock the GPU while Boinc's running (simulating a lower power state, driver timeout/failsafe, deliberate underclock, extended use of the GPU without suspending Boinc etc ...) I think a key takeaway is that the mechanism isn't really adaptive to reasonably normal variable running conditions.
30) Message boards : News : Project server code update (Message 113138)
Posted 25 Jun 2014 by jason_gee
Post:

I think that's further evidence of the kind of instability we need to cure.


Yes, local estimates need to be responsive to running conditions. It's unfortunate that the existing mechanism for that was disabled instead of completed/fixed.

Seti Beta deployed the Blunkit based Optimised AP v7 yesterday, there the estimates are the other way round,
my Ubuntu 12.04 C2D T8100 took ~12 hours on it's first Wu, shame the estimates start at ~228 hours.

Claggy


LoL, do the results mix with traditional non-blankit versions ? That's going to mess with cross app normalisation bigtime.
31) Message boards : News : Project server code update (Message 113136)
Posted 25 Jun 2014 by jason_gee
Post:

I think that's further evidence of the kind of instability we need to cure.


Yes, local estimates need to be responsive to running conditions. It's unfortunate that the existing mechanism for that was disabled instead of completed/fixed.
32) Message boards : News : Project server code update (Message 113133)
Posted 23 Jun 2014 by jason_gee
Post:
No that's OK Thanks. When you're dealing with control systems it's the intuition that counts, as control is a subjective experience based thing. With what we have We CAN already say there is apparent convergence (after a long time) which is good enough for computers, but not all that crash hot for human perception/intuition.

It's quite acceptable for the mechanism to be applying some small offset either way, to cover some little understood phenomena... that's why a cointrol system and not a fixed knob.

As we're dealing with human concepts, The keys will be improvement in the convergence and noise, which are the key things that are failing projects and users for this GPU only app example.

{Edit:] if we get that 'right', plus the CPU coarse scaling issues, then the projects that cross-normalise should also see stabilisation (intuitive level)
33) Message boards : News : Project server code update (Message 113131)
Posted 23 Jun 2014 by jason_gee
Post:
Has the median for your results there, restricted to the seoncd half of the month, settled to around 2340 ? If so, then the global (app_version) pfc_scale for the GPU app has settled roughly where expected. That indicates the GPU portion is not normalised againt a CPU version, and that 300-400 variation is likely the remaining host_scale and averaging instabilities.

Without the CPU app ~2.25x (down)scaling attractor, I feel the level is reasonable/correct, though (imo) a month far too long to dial in a new app version, from too wide of a start, and that the noise is unnecessary & induced.

extra notes & predictions before going forward:
In this context (CPU app free, critically *AFTER* app_version scale has converged), the *correct* claim will be the lower of two as David had implemented, and the averaging between the two claims adding noise. With damping of the scales the awards would become a relatively smooth curve. That final wingman average (IIRC Eric added) is critically important elsewhere to mitigate the CPU downscaling error induced underclaim. With that the upward spread is overclaim by the high pfc hosts.

Summarising, we should be able to confidently address the coarse scaling errors on both CPU and initial GPU, which will speed new application and new host convergence. We should be able to remove the bulk of the noise and remove the sensitivity to the initial project estimate, which combined will address all of the main concerns people have with the current implementation from user and project perspectives, so once the weather settles a bit here in Oz (and I've cleaned up the royal mess in the yard) time to roll up the sleeves & get patching.

At this point I doubt any new major observations will jump out, though I'd like to keep an eye on things while code-digging in the background. Unstable is as unstable does, and things can jump out and surprise us, though since that data appears to characterise all the known issues well, then IMO we have a good baseline to improve on.
34) Message boards : News : Project server code update (Message 113120)
Posted 20 Jun 2014 by jason_gee
Post:
Additional notes.

Cherry-pickig.
It is very difficult to prevent it selectively, as it can be done not by aborting only, also by killing process via taskmanager. And aborting could be really cherry-picking or:
- missed deadlines (could be sorted),
- unexpected reasons (HW fault of host...),
- end of a challenge (Pentathlon, PG challenge series - PG explicitly asks for aborting unneeded tasks),
- overestimated fpops and consequential preventing of a panic mode...
Lowering of the daily quota (according to actual formula N=N-n_of_errored_tasks) is not enough for preventing, because it can be simply passed by time-to-time finished work.
It is mission impossible by my POW.

Regulation process.
Reaching the asymptote can be accelerated using granted credit boundaries (independetely on rsc_fpops_est) on a validator side for a sort/batch of WUs if it makes sense. Yes, it is additional work for developers and administrators, could be partially automated using "reference" machine. On the other side - it can be a wrong way theoretically because of two regulation parameters of the same quantity.
I feel you did not implicate credit bounds to see design or implementation flaws in raw algorithm.



Yep there are fixed bounds in play, and the 'safeties' are being tripped as new apps come online, actually introducing more problems ( like a car airbag that goes off at the slightest bump, and then has a spike mounted on the steering wheel behind it).

Part of the cause of that is gross scaling error for the onramp period. Further looks & fix decisions at where bounds should really be set are needed after gross scaling errors in CPU and GPU are improved. It appears that they may be too tight even with good initial scaling, because of diversity in applications and how people use their machines ... Basically the system seems to assume dedicated crunching to some degree, which is quite a false assumption. Then you have the multiple tasks per GPU situation, which is not factored in anywhere.

The reason some of these weaknesses are known is in part that I experienced none of the client side failsafes, because I use a modified client where I widened them. I could see the scales hard-lock to max limits and still give estimates too short. This means the initial server side scale being applied was way out of whack in comparison to the estimate of GPU speed.

It turns out one assumption, that GPUs are 10% of their peakflops, is commented in the server code as being 'generous'. In fact even for the common case of 1 GPU task per GPU this is not generous, but results in estimates that are too short, and when combined with the initial app scalings result in estimates divided by ~300-1000x... (common scaling case 0.05x0.05=0.0025 -> 1/400) before considering the multiple task per GPU case. Double application of scales that overlap there is also a problem easily remedied.

SO that's how both the initial scalings, stability, and possibly overtight failsafe margins interact, and addressign the first two issues should give a clear indication of where the third should be adjusted into safe limits.
35) Message boards : News : Project server code update (Message 113117)
Posted 20 Jun 2014 by jason_gee
Post:
Yep, the same concepts of feedback, slew rates, over-undershoot, and damping/oscillation apply the same.

On the cherry-picking, I think it takes all sorts to make the world work, and some would consider that cheating, and others a useful tool. I tend to think it's not that black and white, and that exploits usually point to deeper design flaws. As those flaws affect other things, then they should be fixed for those reasons, and the cherry-pickers can move to look for more flaws ;)

Yeah with proper scaling & normalisation, some things become obvious that are not so obvious when buried in noise. Some of that can be exploits (and so design or implementation flaws), and some can be legitimate special situations. The Jackpot situation found here was an unexpected weakness, and something that will have to be examined closely as the system is stabilised.

Thanks for the input, I appreciate bouncing the ideas back and forwards a lot.
36) Message boards : News : Project server code update (Message 113115)
Posted 19 Jun 2014 by jason_gee
Post:
Thanks nenym,

Yes some of the objectives will not be as clear here straight away, mostly because of the cosistency of the tasks, and until now it's been fixed credit. It's those same features here that make it a great sandbox to put the system under the microscope.

More detailed objectives & experimental procedure are being drafted, but very short overview is this:

Time estimates and credit as estimate of 'work':
- On projects with multiple application, like at Seti, there exists a mismatch between applications which can lead to hoarding and mass aborting in some cases to juggle work & cherry-pick. The intent written into the system basically says that should not happen as much as it does (about 2x discepancy)
- +/- 30%+ makes RAC or credit pretty useless (IMO and some others) for its purpose (to users) of comparing host to host, hardware to hardware, application to application etc
- The randomness can and does upset bringing new hosts online, in particular when estimates start really bad, such as when here starts and the GPUs hit time exceeded errors. That's bad juju for retaining new users, and possibly for application development
- It also appears that the current system penalises optimisation.

So yes, on one hand it's easy to regard the credit system as academic and not critical for the science, but on the other it becomes critical for time estimation, which is key to scheduling from server side right through to the client.

That covers why I feel it has to be addressed.

As for why have a scaling credit system at all ? In another direction, fixed estimates and credit make work for project developers (often with little funding) every time there is a major new application or platform. Something that will dial in automatically saves effort and money in the long run (like cruise-control)

For can such a chaotic system be corrected ? Yes, if you use control theory. Here is an example from CPU 'shorties' on a host from Seti-beta. Those tasks are all the same length so should really get similar credit etc. The smoother curve is one that used a simple 'PID controller' to replace the credit calculation. it was started deliberately off target so as to see how it settled. Note also that the CPU credit system there has a scale error, so for easy comparison the new smooth & correct line is divided by 3. (it would be up over 90 credits if not scaled down)



Here is the not scaled down version:


and here is a picture of a PID controller in mechanical form:
37) Message boards : News : Project server code update (Message 113113)
Posted 19 Jun 2014 by jason_gee
Post:
sample size is a bit low for good statistics but with THAT graph you can;t expect a reasonable SD.
better not try plotting credit against runtime - at least not when you expect the linear regression to have any sensible R^2 value...


Yeah, David's using 10 validations and 100 for his average sample sets, host and app-version scales. That's why the nyquist limit kicks in to create artefacts when scales are adjusted with each validation. The frequency of change is higher that the nyquist limit. Ideal would be continuous damped averages (controller), Still musing whether to that as separate pass 2, or combine it with the CPU coarse scale correction. Probably easier to monitor/analyse the effects if separate, so I'll keep going on those lines.

Off for a break, then back to more documentation for the patch passes. Hopefully skeletons will be ready for inspection soon.
38) Message boards : News : Project server code update (Message 113112)
Posted 19 Jun 2014 by jason_gee
Post:
Probably time for a stats show, then.

What happens, to the std deviations in particular, if you filter out the on-ramping converge attempt period, of say 10-20 results worth ?

I think it would be difficult to define an on-ramp in this case. Zombie, in particular, has been crunching for ages - I'm not sure how come we started the whole new convergence with the server code upgrade (I'd have expected all the runtime averages to have been in play for a long time, just disguised by the project's fixed credit policy). As you'll have been seeing, there hasn't been much tweaking in this area of code since it was first deployed four years ago.

What I would like to say to David is that, with fixed-size workunits, and a runtime standard deviation of 20 seconds in 4,000, I expect credit with an SD of 5 in 1000.


Hah! good idea, and there's the rub. Without damping those averages, won't happen.

I felt the need for number reset to observe what any new application would do when installed, point being that number of platforms coming on line is accelerating.

Your own (SD ~20 seconds -> variance 400 seconds -> 10%), then the same with your identical wingman. When averaging Murphy says he'll be 10% high and you'll be 10% low, so now it's 20%. Then both your host scales factor in some percentage you use your hosts for anything, and background tasks. That's a natural world input unfiltered/unconditioned (which is a no-no).
39) Message boards : News : Project server code update (Message 113108)
Posted 19 Jun 2014 by jason_gee
Post:
'bubble graph' what the hack does a bubble graph graph?
yes i see differently sized bubbles. What does the bubble size stand for?

seconds elapsed per credit, so smaller is better paying.

That one's less for technical value, more for the fractal art competition anyway ;)

Oh you are plotting against n not against N .....

sorry my maths genes need a serious liedown now.


*points Boinc bony fractal finger* everyone's an art critic :P
40) Message boards : News : Project server code update (Message 113105)
Posted 19 Jun 2014 by jason_gee
Post:
Probably time for a stats show, then.


What happens, to the std deviations in particular, if you filter out the on-ramping converge attempt period, of say 10-20 results worth ?


Previous 20 · Next 20



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration