Deprecated: Function get_magic_quotes_gpc() is deprecated in /srv/BOINC/live-webcode/html/inc/util.inc on line 640
Project server code update

WARNING: This website is obsolete! Please follow this link to get to the new Albert@Home website!

Project server code update

Message boards : News : Project server code update
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · Next

AuthorMessage
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113184 - Posted: 29 Jun 2014, 13:03:37 UTC - in response to Message 113180.  
Last modified: 29 Jun 2014, 13:15:27 UTC

At least one of those must be upside down.


In a sense yes. GPU app+device+conditions efficiency would be actual/peak, and must be less than 1 (and it is, e.g. it should be around 0.05 for single task Cuda GPU). Normalisation could be viewed as turning it upside down. It'll raise the GFlops & shrink the time estimate artificially --> the exact opposite of the kindof behaviour we want for new hosts/apps.

A bit will become clearer when I have the next dodgy diagram ready. Getting bogged down in broken code is a bit of a red-herring at the moment, as there are design level issues to tackle first.

In particular, debugging the normalisation, including the absurd GFlops numbers it produces, is pointless in the context of estimates. That's because neither the time nor Gflops should be being normalised [AT ALL], so it all get's disabled in estimates, and restricted to credit related uses where it's applicable to get the same credit claims from different apps.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113184 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113185 - Posted: 29 Jun 2014, 13:30:28 UTC - in response to Message 113184.  

At least one of those must be upside down.


In a sense yes. GPU app+device+conditions efficiency would be actual/peak, and must be less than 1 (and it is, e.g. it should be around 0.05 for single task Cuda GPU). Normalisation could be viewed as turning it upside down. It'll raise the GFlops & shrink the time estimate artificially --> the exact opposite of the kindof behaviour we want for new hosts/apps.

A bit will become clearer when I have the next dodgy diagram ready. Getting bogged down in broken code is a bit of a red-herring at the moment, as there are design level issues to tackle first.

In particular, debugging the normalisation, including the absurd GFlops numbers it produces, is pointless in the context of estimates. That's because neither the time nor Gflops should be being normalised [AT ALL], so it all get's disabled in estimates, and restricted to credit related uses where it's applicable to get the same credit claims from different apps.

Well, we do (crudely) have two separate cases to deal with.

1) initial attach. We have to get rid of that divide-by-almost-zero, or hosts can't run. They get the absurdly low runtime estimate/bound and error when they exceed it.

2) steady state. In my (political) opinion, trying to bring back client-side DCF will be flogging one dead horse too many. We need some sort of server-side control of runtime estimates, so that client scheduling works and user expectations are met. I'm happy to accept that the new version will be different to the one we have now, and look forward to seeing it.

OK, I'll get out of your hair, and take my coffee downstairs to grab some more stats.
ID: 113185 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113186 - Posted: 29 Jun 2014, 13:37:12 UTC - in response to Message 113185.  

At least one of those must be upside down.


In a sense yes. GPU app+device+conditions efficiency would be actual/peak, and must be less than 1 (and it is, e.g. it should be around 0.05 for single task Cuda GPU). Normalisation could be viewed as turning it upside down. It'll raise the GFlops & shrink the time estimate artificially --> the exact opposite of the kindof behaviour we want for new hosts/apps.

A bit will become clearer when I have the next dodgy diagram ready. Getting bogged down in broken code is a bit of a red-herring at the moment, as there are design level issues to tackle first.

In particular, debugging the normalisation, including the absurd GFlops numbers it produces, is pointless in the context of estimates. That's because neither the time nor Gflops should be being normalised [AT ALL], so it all get's disabled in estimates, and restricted to credit related uses where it's applicable to get the same credit claims from different apps.

Well, we do (crudely) have two separate cases to deal with.

1) initial attach. We have to get rid of that divide-by-almost-zero, or hosts can't run. They get the absurdly low runtime estimate/bound and error when they exceed it.

2) steady state. In my (political) opinion, trying to bring back client-side DCF will be flogging one dead horse too many. We need some sort of server-side control of runtime estimates, so that client scheduling works and user expectations are met. I'm happy to accept that the new version will be different to the one we have now, and look forward to seeing it.

OK, I'll get out of your hair, and take my coffee downstairs to grab some more stats.


LoL, always appreciate bouncing it around, thanks. At the moment it's a bit like pointing to a bucket of kittens and saying 'that's not the flower-pot I ordered!'. Yeah it's possible to debate over the intent versus function more, but when push comes to shove it's just wrong & gives wacky numbers. Not really any more complicated than that in some sense ;)
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113186 · Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 11 Aug 13
Posts: 10
Credit: 5,011,603
RAC: 0
Message 113187 - Posted: 29 Jun 2014, 18:32:40 UTC

June 29, 2014 18:00 UTC
[url]https://albert.phys.uwm.edu/show_host_detail.php?hostid=9649[/url]
BRP4G  2x using 1 cpu thread each (app_config), GPU utilization = 92%
       running an additional 4x Skynet POGs cpu WUs
GPU    7950 mem=1325, gpu=1150, pcie v2 x16
OS     Win7 x64 Home Premium
CPU    980X running at 3.41 GHz with HT off
MEM    Triple channel 1600 (7.7.7.20.2)
ID: 113187 · Report as offensive     Reply Quote
treblehit

Send message
Joined: 12 Mar 05
Posts: 5
Credit: 35,119
RAC: 0
Message 113188 - Posted: 29 Jun 2014, 20:01:58 UTC - in response to Message 113185.  



1) initial attach. We have to get rid of that divide-by-almost-zero, or hosts can't run. They get the absurdly low runtime estimate/bound and error when they exceed it.



I'll be bringing more machines online today in a desperate attempt to provide steady, un-fiddled-with, untweaked, vanilla BRP4G work for you.

I just need instructed: A) let them fail so you can see that, B) somehow prevent them from failing so that you have the reliable work-flow.

Instructions, please.

Bret
ID: 113188 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113189 - Posted: 29 Jun 2014, 20:22:36 UTC - in response to Message 113188.  

Um, if you don't mind, I think it might be best to wait a little time. The administrators on this project are based in Europe, and as you know Jason is ahead of our time-zone, in Australia. I think it might be better to wait 12 hours or so, until we have a chance to compare notes by email when the lab opens in the morning.

After all, we don't want to use up our entire supply of unattached new hosts in one hit, or else we won't have anything left to test Jason's patches with....
ID: 113189 · Report as offensive     Reply Quote
treblehit

Send message
Joined: 12 Mar 05
Posts: 5
Credit: 35,119
RAC: 0
Message 113190 - Posted: 29 Jun 2014, 23:39:59 UTC - in response to Message 113189.  

[quote]
Um, if you don't mind, I think it might be best to wait a little time.

[quote]


I completely understand, Richard. I was reluctant to bring it up in the first place.

Unfortunately for me I have to deal with the hardware side of it when I can, so I'm going to cope with that today. I'll get it ready to connect remotely when you guys are ready for it.

Let me know. You both know how to find me when and if you want me.

In the meantime, I'm going to detach this host and go away to stop being a distraction.

I only started this because "She Who Must Be Obeyed" had indicated you guys needed a reliable and unchanging stream of BRP4G tasks over on the GPU User's Group team message board.


Bret
ID: 113190 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113191 - Posted: 30 Jun 2014, 2:26:09 UTC - in response to Message 113189.  
Last modified: 30 Jun 2014, 2:30:37 UTC

Um, if you don't mind, I think it might be best to wait a little time. The administrators on this project are based in Europe, and as you know Jason is ahead of our time-zone, in Australia. I think it might be better to wait 12 hours or so, until we have a chance to compare notes by email when the lab opens in the morning.

After all, we don't want to use up our entire supply of unattached new hosts in one hit, or else we won't have anything left to test Jason's patches with....


Yes, unhooking that normalisation ( which divides by ~0.1, multiplies the GPU GFlops x~10 into absurd levels, and shrinks time estimates) is going to take quite some preparation to unhook *safely*. That same mechanism is hooked into credit (where it does make sense), so quite a lot of backwards & forwards for clarification, discussion and debate will be needed to get it 'right', and part of that's going to be me communicating effectively (which isn't always easy :)).

The other aspect is that some bandaids will be painful to rip off, and still other odd artefacts might be hiding inside... and only way to tell for sure is open it up.

The next few days will tell if we're all on the same page (but looking from different angles is fine). To me though, we are well through the tricky bits of understanding the current system enough to say it needs to be a lot better.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113191 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113197 - Posted: 1 Jul 2014, 10:55:05 UTC

Latest scattergram.



I've reverted my 5367 to normal running (early afternoon yesterday), so my timings *should* be lower and steadier - doesn't really seem to show in credit yet. I wonder why Claggy's laptop gets such variable credit?
ID: 113197 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113198 - Posted: 1 Jul 2014, 12:34:59 UTC - in response to Message 113197.  
Last modified: 1 Jul 2014, 13:02:43 UTC

I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113198 · Report as offensive     Reply Quote
Claggy

Send message
Joined: 29 Dec 06
Posts: 78
Credit: 4,040,969
RAC: 0
Message 113203 - Posted: 1 Jul 2014, 20:22:26 UTC - in response to Message 113198.  

I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.

I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).

Claggy
ID: 113203 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113206 - Posted: 2 Jul 2014, 3:11:12 UTC - in response to Message 113203.  
Last modified: 2 Jul 2014, 3:18:06 UTC

I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.

I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).

Claggy


Could be the wingmen. (There's a number of combinations of wingmen types that'll give random results between two regions. Two similar wingmen tend to cancel with averaging and become 'normal')
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113206 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113207 - Posted: 2 Jul 2014, 10:26:08 UTC - in response to Message 113206.  

I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.

I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).

Claggy

Could be the wingmen. (There's a number of combinations of wingmen types that'll give random results between two regions. Two similar wingmen tend to cancel with averaging and become 'normal')

Conversely, when he's paired with me - now back to lower, stable, runtimes - no jackpot, no bonus. Sorry 'bout that.
ID: 113207 · Report as offensive     Reply Quote
jason_gee

Send message
Joined: 4 Jun 14
Posts: 109
Credit: 1,043,639
RAC: 0
Message 113209 - Posted: 2 Jul 2014, 10:54:43 UTC - in response to Message 113207.  

I wonder why Claggy's laptop gets such variable credit?


Multiple tasks on smaller GPU, each running longer, will generate higher raw peak flop claims (pfc's) then that's averaged with the wingman's (Yellow triangle on dodgy diagram). So result can be anywhere from normal range to jackpot, as we previously assessed, depending on the wingman's claim. Though the prevalence of the jackpot conditions is less obvious, the noise in the system is still there.

I'm just running a single GPU task on both my GPU hosts, (the T8100's 128Mb 8400M GS doesn't count).

Claggy

Could be the wingmen. (There's a number of combinations of wingmen types that'll give random results between two regions. Two similar wingmen tend to cancel with averaging and become 'normal')

Conversely, when he's paired with me - now back to lower, stable, runtimes - no jackpot, no bonus. Sorry 'bout that.


LoL, yep, throwing the dice to get an answer is as good as any ;)
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. - C Babbage
ID: 113209 · Report as offensive     Reply Quote
juan BFB

Send message
Joined: 10 Dec 12
Posts: 8
Credit: 1,674,320
RAC: 0
Message 113211 - Posted: 2 Jul 2014, 17:28:06 UTC
Last modified: 2 Jul 2014, 17:30:46 UTC

@Richard/Claggy

Should i continue to crunch BRP4G only or you sugest to crunch another type of WU too (could do GPU work only here).

BTW I slow down my cruchers here since don´t belive quantity is what you´re looking for and now they will produce a stable number of daily WU.
ID: 113211 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113212 - Posted: 2 Jul 2014, 19:15:34 UTC - in response to Message 113211.  

BTW I slow down my cruchers here since don´t belive quantity is what you´re looking for and now they will produce a stable number of daily WU.

I think that's probably a good idea. We're already at the stage where my last 12 consecutive validations have been against one or other of your hosts (5 different machines, I think). And the machines are all pretty similar, to each other and to mine: GTX 670/690/780, running Win7/64 or (in one case) Server 2008.

In order to see (now) and test (later) BOINC's behaviour in the real world, we probably need a reasonable variation in hosts to give us realistic variation in the times and credits.

Bernd has launched a new 'BRP5' (Persueus Arm Survey) v1.40, with a Beta app tag on it, to test that new feature in the BOINC scheduler. I'm in the process of switching my machine over to run that instead: some company would be nice, but be warned: we're half expecting to fall over the 'EXIT_TIME_LIMIT_EXCEEDED' problem at some stage with BRP5 Beta, so hosts running it probably need to be watched quite closely for strange estimated runtimes, and you need to be ready to take action to correct it.
ID: 113212 · Report as offensive     Reply Quote
Profile Holmis

Send message
Joined: 4 Jan 05
Posts: 104
Credit: 2,104,736
RAC: 0
Message 113213 - Posted: 2 Jul 2014, 19:51:55 UTC - in response to Message 113212.  
Last modified: 2 Jul 2014, 19:52:14 UTC

... some company would be nice, but be warned: we're half expecting to fall over the 'EXIT_TIME_LIMIT_EXCEEDED' problem at some stage with BRP5 Beta...

I just downloaded my first v1.40 BRP5 and I'd say it's looking pretty good so far! The estimated completion time shown in Boinc is 5h03m08s.
These are the relevant lines from the scheduler log:

2014-07-02 19:35:03.2067 [PID=25783] [version] Best version of app einsteinbinary_BRP5 is [AV#934] (24.74 GFLOPS)
2014-07-02 19:35:03.2067 [PID=25783] [send] est delay 0, skipping deadline check
2014-07-02 19:35:03.2067 [PID=25783] [version] get_app_version(): getting app version for WU#625766 (PB0020_006A1_164) appid:27
2014-07-02 19:35:03.2067 [PID=25783] [version] returning cached version: [AV#934]
2014-07-02 19:35:03.2067 [PID=25783] [send] est delay 0, skipping deadline check
2014-07-02 19:35:03.3000 [PID=25783] [send] Sending app_version einsteinbinary_BRP5 2 140 BRP5-cuda32-nv301; projected 24.74 GFLOPS
2014-07-02 19:35:03.3001 [PID=25783] [send] est. duration for WU 625766: unscaled 18188.26 scaled 18306.56
2014-07-02 19:35:03.3001 [PID=25783] [send] [HOST#2267] sending [RESULT#1514790 PB0020_006A1_164_4] (est. dur. 18306.56s (5h05m06s55)) (max time 363765.12s (101h02m45s11))

And I've got this in the application details:

Binary Radio Pulsar Search (Perseus Arm Survey) 1.40 windows_intelx86 (BRP5-cuda32-nv301)
Number of tasks completed   0
Max tasks per day	    0
Number of tasks today	    1
Consecutive valid tasks	    0
Average turnaround time	    0.00 days

For v1.39 the tasks took less than 5 hours and the APR was 21.91 GFlops.
Whatever was changed seems to be working with regards to the initial estimates assuming that the app and workload is more or less the same. Keep up the good work!
ID: 113213 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113214 - Posted: 2 Jul 2014, 20:03:50 UTC - in response to Message 113213.  

Nothing's been changed yet...

I got something similar - 25.25Gflops and 4h57m02s24

2014-07-02 17:43:24.7141 [PID=19995] [version] [AV#934] (BRP5-cuda32-nv301) using conservative projected flops: 25.25G
2014-07-02 17:43:24.7141 [PID=19995] [version] Best app version is now AV934 (102.01 GFLOP)
2014-07-02 17:43:24.7142 [PID=19995] [version] Checking plan class 'BRP5-opencl-ati'
2014-07-02 17:43:24.7142 [PID=19995] [version] plan_class_spec: parsed project prefs setting 'gpu_util_brp' : true : 0.480000
2014-07-02 17:43:24.7142 [PID=19995] [version] plan_class_spec: No AMD GPUs found
2014-07-02 17:43:24.7142 [PID=19995] [version] [AV#937] app_plan() returned false
2014-07-02 17:43:24.7142 [PID=19995] [version] Checking plan class 'BRP5-opencl-intel_gpu'
2014-07-02 17:43:24.7142 [PID=19995] [version] plan_class_spec: parsed project prefs setting 'gpu_util_brp' : true : 0.480000
2014-07-02 17:43:24.7142 [PID=19995] [version] [AV#935] Skipping Intel GPU version - user prefs say no Intel GPU
2014-07-02 17:43:24.7142 [PID=19995] [version] [AV#934] (BRP5-cuda32-nv301) using conservative projected flops: 25.25G
2014-07-02 17:43:24.7142 [PID=19995] [version] Best version of app einsteinbinary_BRP5 is [AV#934] (25.25 GFLOPS)
2014-07-02 17:43:24.7142 [PID=19995] [send] est delay 0, skipping deadline check
2014-07-02 17:43:24.7142 [PID=19995] [version] get_app_version(): getting app version for WU#625736 (PB0020_006A1_104) appid:27
2014-07-02 17:43:24.7143 [PID=19995] [version] returning cached version: [AV#934]
2014-07-02 17:43:24.7143 [PID=19995] [send] est delay 0, skipping deadline check
2014-07-02 17:43:24.7197 [PID=19995] [send] Sending app_version einsteinbinary_BRP5 2 140 BRP5-cuda32-nv301; projected 25.25 GFLOPS
2014-07-02 17:43:24.7198 [PID=19995] [send] est. duration for WU 625736: unscaled 17819.43 scaled 17822.25
2014-07-02 17:43:24.7198 [PID=19995] [send] [HOST#5367] sending [RESULT#1523511 PB0020_006A1_104_6] (est. dur. 17822.25s (4h57m02s24)) (max time 356388.68s (98h59m48s67))

But note that line I've picked out: that means there are fewer than 100 completed tasks for this app_version yet, across the project as a whole.

The worry is that when 100 tasks have been completed, but before you have completed 11 tasks on your host (to use APR), you'll see adjusting projected flops based on PFC avg and some absurdly large number. That'll be when the errors (if any) start.
ID: 113214 · Report as offensive     Reply Quote
Profile Holmis

Send message
Joined: 4 Jan 05
Posts: 104
Credit: 2,104,736
RAC: 0
Message 113215 - Posted: 2 Jul 2014, 20:12:52 UTC - in response to Message 113214.  

Roger that, will keep a close watch on things until I've completed my first 11 tasks then.
ID: 113215 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 10 Dec 05
Posts: 450
Credit: 5,409,572
RAC: 0
Message 113217 - Posted: 2 Jul 2014, 21:04:15 UTC

Well, here's the first conundrum:

All Binary Radio Pulsar Search (Perseus Arm Survey) tasks for computer 5367

After 200 minutes of solid GTX 670 work on Perseus, I earn the princely sum of ... 15 credits!
ID: 113217 · Report as offensive     Reply Quote
Previous · 1 . . . 12 · 13 · 14 · 15 · 16 · 17 · Next

Message boards : News : Project server code update



This material is based upon work supported by the National Science Foundation (NSF) under Grant PHY-0555655 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2024 Bruce Allen for the LIGO Scientific Collaboration