Ana səhifə

Osg osg docdb 2007 Version 0


Yüklə 50.28 Kb.
tarix12.06.2016
ölçüsü50.28 Kb.
OSG

OSG Docdb

2007

Version 2.0


Document Name

OSG Resource and Service Validation Phase III Project Plan

Authors

Arvind Gopu, ???


Objective

This is the project plan for the third phase (version 3) of the RSV project.

The goal of the OSG Resource and Service Validation (RSV) project is to provide OSG site administrators, users and support personnel with dependable information about the state and capabilities of the resources accessible to and Grid services provided on the OSG. An additional goal is to deliver this information to global ATLAS and CMS through the World Wide LHC Computing Grid as agreed to in the OSG Project Plan and WBS.
Tables of Contents


1 Overview 2

1.1 Background/Review of Current Work 2

1.2 RSV Phase III 2

1.3 Schematic of RSV Infrastructure (current and future work) 3

2 List of Goals for RSV Phase III 3

2.1 Outline of Goals, Timeline and Effort 3

2.2 Description of tasks outlined above 5

2.2.1 Define and implement version 3 of probe set 5

2.2.2 Implement mechanism for on-demand execution of probes 7

2.2.3 Implement built-in mechanism to update probes and core modules 7

2.2.4 Enhance initial versions of OSG availability calculation algorithms and corresponding applications 8

2.2.5 Enhance initial version of web interface for OSG resource and service availability 8

2.2.6 Complete VORS replacement (and initiate VORS deprecation) 8

2.2.7 Further improve configuration steps to be more site-administrator friendly 8

2.2.8 Generalized Nagios Wrapper to run RSV probes within Nagios 9

3 Deployment 9

4 Exploratory Tasks planned in RSV Phase III 9

4.1 Enhance RSV-SAM transport system to be more robust 9

4.2 Explore further integration of RSV with OIM system and/or Operations dashboard a.k.a MyOSG 9

5 References 9





1Overview


The RSV project will continue to extend the existing infrastructure and build new capabilities for deployment and operation on the OSG to provide OSG staff, resources, services, VO administrators, partner grid infrastructures and users dependable information about the state and capabilities of the OSG and resources accessible from the infrastructure. This work builds on the OSG RSV project phases I and II discussed in [1], [4].

1.1Background/Review of Current Work


A schematic of the RSV infrastructure is shown in the fig. 1. The items listed in the project plan for phases I and II of RSV are mostly complete.

The status of first two phases of the RSV project is briefly expanded on below:



  • Version 2 of the RSV probe set is available. The latest probe tarball version is 2.3.5, and includes __ probes that test various elements of OSG CE, __ probes that test various elements of OSG SEs [3].

  • A Condor-cron based scheduling infrastructure is available as part of OSG 1.0.0

    • Any site administrator who wants to install RSV monitoring separately from their CE(s) can also do so using version 1 of the package available through the VDT 1.10.1 software distribution; a single monitoring host can be used to monitor multiple CEs.

  • All OSG resources running OSG CE 1.0.0 are able to run RSV monitoring. And they are able to upload RSV test results to a central RSV collector database using a mechanism provided by the Gratia team; the collector and the central RSV database is maintained by the GOC.

  • RSV results collected from OSG resources that are part of the WLCG inter-operability agreement are uploaded from the central RSV collector to the WLCG SAM system using an automated uploading mechanism (also maintained by the GOC).



1.2RSV Phase III


Phase III of the RSV project will build on existing work done in the previous two phases. The main goals include adding more probes to the existing probe set, including more security probes, as well as ones that test worker nodes; improving on the existing web interface to RSV data stored in the central database; improving configuration steps to make it even more site-administrator friendly by using configuration details stored in the OSG wide config.ini; and so forth. [AG: Need more??]

1.3Schematic of RSV Infrastructure (current and future work)





2List of Goals for RSV Phase III

2.1Outline of Goals, Timeline and Effort


The tasks to be completed for RSV phase III are listed in the table below with expected timelines and responsible parties. The tasks are expanded on in section 2.2. Future phases of RSV (versions) will likely include more probes, some possibly contributed by collaborators, bug fixes, and performance enhancements as necessary. A project plan for future phase(s), as needed, will be submitted towards the end of this phase.


Task

Start Date

Expected Finish Date

Responsible

Define version 3 of Probe set including major updates to version 2 probes




11/7/2008

RSV dev team, Karthik Arunachalam, and OSG security group

Implement probes (version 3 set defined above)




12/5/2008

Soichi Hayashi, Karthik Arunachalam, Anand Padmanaban and other security group members, Arvind Gopu

Implement mechanism to allow administrators to execute one or more probes manually on demand




1/9/2009

Arvind Gopu, Soichi Hayashi, Scot Kronenfeld

Implement built-in mechanism to update probes and core







Arvind Gopu, Soichi Hayashi, Scot Kronenfeld?

Further improve configuration steps to be more site-administrator friendly; possibly use config.ini type config file; Auto-reconfigure on restart







Scot Kronenfeld, Arvind Gopu, Soichi Hayashi, Suchandra Thapa

Enhance initial versions of OSG availability calculation algorithms and corresponding applications







Soichi Hayashi, Brian Bockelman, Arvind Gopu

Enhance initial version of web interface for OSG resource and service availability







Soichi Hayashi, Arvind Gopu

Complete VORS replacement




12/12/2008

Soichi Hayashi, Arvind Gopu

Central GOC Monitoring; including generalized Nagios Wrapper to run critical RSV probes within Nagios







Arvind Gopu, Soichi Hayashi, Tom Lee

Exploratory Tasks

Explore integration of RSV with OIM system and Operations dashboard a.k.a MyOSG

Soichi Hayashi, Arvind Gopu

Install messaging broker at GOC level to make SAM transport more robust and allow bi-directional traffic of records (from/to EGEE)

Arvind Gopu, Soichi Hayashi

2.2Description of tasks outlined above

2.2.1Define and implement version 3 of probe set


Version 3 of the RSV probe set will include some new probes as well as updates to existing probes based on feedback from users and/or administrators; Some highlights include eenhanced security metric probes that separate site level CA/CRL tests from central CA/CRL tests for the OSG security group; a few probes that will enable the GOC to centrally monitor critical services on OSG resources, and some prototype performance metric probes that will help test if the existing infrastructure, based on the current WLCG probe specification [2], works for non-status metrics (carry-over items from phase II). Performance metrics will be calculated based on various measurements gathered from a resource. As before, probes will be created using the Grid Monitoring Probes specifications defined at [2]. And the previous practice of placing documentation for each probe on a public web site [8] will be continued.


Probe

Possible Source (if similar test already exists)

Enhanced Site level CA/CRL metrics (See section 2.2.1.1.1)

RSV v2 probe set

CAs supported by a resource (for VO managers’ benefit; see section 2.2.1.1.1)




Central CA/CRL metrics (See section 2.2.1.1.2)

New/Existing security team scripts

Proxy warning metric (for site administrators’ benefit)

New

Print remote environment (See section 2.2.1.3)




Enhanced directory status metrics (that check for disk space, etc.)




Job manager metric that tests worker node functionality (required for WLCG interop MOU conformance)

New

VOMS Monitoring probes (for GOC/security team monitoring)

New

GUMS Monitoring probe (for GOC /security team monitoring)

New

Two to three performance metric probes, possibly including tests for gridFTP and SE performance

New

2.2.1.1Enhanced CA/CRL metrics



Background:
The current CA metric (considered an availability metric) has undergone a couple of major revisions in RSV phases I and II:

  • The first version of the metric ran openssl expiry-date checks on every CA file (*.0) on a CE. This penalized a site administrator if a CA was lax in updating its expired CA.

  • The second version tried to address this by comparing CA package version number, instead of inspecting each CA, to a central display of the version number.

  • The third major revision accounted for the recent migration of OSG-blessed CA certificates to the GOC; this version still checks for version number but figures out which CA distribution (GOC or VDT) is in use before doing so.

The current metric has multiple weaknesses:

  • The most obvious issue with comparing the package number of the CA-certificates is that this only works on systems that use the vdt-update-certs mechanism to get and update their CAs. This does not work on sites using RPMs/YUM or other access methods to get and update their CAs.

  • The work around some site administrators have put in place is to put a fake version file that looks like the one the VDT uses. This is flawed because an administrator can easily make the metric pretend as though they have the latest CAs without actually having the appropriate updated CAs.

The current CRL metric – used to be considered an availability metric but currently not so; likely to be deemed as one again in the near future to comply with WLCG MOU – runs openssl expiry checks on each CRL (*.r0) file. This metric has a known problem:

  • While the metric checks for expiry date i.e the nextUpdate date for a CRL, that does not necessarily cover the security requirements in the case where a CA releases a new CRL before the nextUpdate date. In such cases, the metric might falsely report that the CRL is up-to-date.



2.2.1.1.1Proposed Site level metrics

Proposed CA-Certificate metric (Availability metric: Yes)

The proposed metric will get openssl expiry date information for all the available CAs on a CE – this requires a site administrator to keep track of what CAs are on their resource, which we believe is a fair. But the metric will compare those dates to corresponding expiry dates published by a central web interface for each OSG approved CA, and publish critical/warning results only if the site's CA's expiry date is older than the expected expiry date. This new proposed metric will run a more concrete check on each CA on a site's resource while also not unfairly penalizing a site for a lazy CA, etc.

The GOC is willing to host the above central web interface that publishes expected CA expiry dates, one that can be automated to pick up the dates from an unpacked tarball provided by the OSG security group.

The above discussion does not address how we will handle CAs provided by non GOC sources and/or CAs a site might have but are not part of the OSG recommended list. In the former case, the non-GOC source, say VDT, could publish expiry dates similar to the GOC publisher. In the latter case, we propose to produce a critical result for an expired CA that is not known to the OSG security group but is installed on a site’s resource.



Proposed CRL metric (Availability metric: No; but will likely be in the near future)

This proposed metric will work in ways almost completely identical to the above proposed CA-certificate metric.

The GOC does not host CRLs at this time. We need to establish who will provide the authoritative (expected) CRLs, and how it will be updated (preferably seamlessly) if it is hosted by the GOC. We believe, the OSG security is in the best position to provide aforementioned CRLs.

Proposed CA list metric to help VOs track support of CAs by their sites

This is a new metric that will look at the list of CAs supported by a site's resource, and will compare that to a list of CAs the VO has opted to support. The VO-supported-CAs mapping is maintained in the OIM system. The VO manager is tasked with maintaining the mapping of the CAs they wish to support.

This metric will enable a VO to keep track of their sites, and what CAs they support.

2.2.1.1.2 Proposed GOC/OSG management level metrics



Proposed central-CA-certificate-watch metric (Deemed critical to security team)

This metric will essentially be a replica of the first avatar of the site-level CA-certificate metric. Openssl expiry date checks will be run on all CAs that the OSG supports by the GOC's central monitoring infrastructure. This metric will enable the security group to stay alert to possible expiring-soon CAs or expired CAs, and act accordingly.



Proposed central-CRL-watch metric (Deemed critical to security team)

This metric will be identical to above CA metric except, this will be for CRLs.


2.2.2Implement mechanism for on-demand execution of probes


The RSV team will implement a simple mechanism to allow for on-demand execution of one or more probes at any given time. This will augment the Condor-cron based current periodic probe scheduling system. When RSV indicates a problem with a certain component, and the administrator has fixed the issue, it will allow them to propagate the latest metric results immediately without waiting for the next scheduled probe run.

2.2.3Implement built-in mechanism to update probes and core modules


The RSV team will implement a simple mechanism to retrieve updates to RSV components. We expect to implement a mechanism similar to the one used for CA-certificate updates by VDT. A prototype of this is already available. This will provide a way to pull bug fixes, etc. in an exigent manner without depending on usual software release cycles.

While the mechanism might be similar to the CA-certificate updater, the RSV team will make no recommendations on automating such an update process; that is left to site administrators to decide on.


2.2.4Enhance initial versions of OSG availability calculation algorithms and corresponding applications


The RSV team and the measurements-metrics team will continue working on making the current availability calculation algorithms/implementations more robust. Currently, the measurements-metrics team has produced an application to calculate availability/reliability numbers for OSG resources using the algorithm devised by WLCG/EGEE. The GOC runs this application, and sends out daily availability/reliability reports [5]. The WLCG algorithm has known deficiencies, and does not always produce accurate numbers. The RSV team has also implemented an algorithm [6] that is more robust and accurate; they will continue to refine this algorithm/implementation.

2.2.5Enhance initial version of web interface for OSG resource and service availability


In phase II, the RSV team implemented a simple to use web interface to view OSG resource status based on RSV records uploaded by the individual resource’s clients [6]. We will pursue further enhancements to this web interface based on user and site-administrator feedback.

2.2.6Complete VORS replacement (and initiate VORS deprecation)


Several existing metric results from RSV phases I and II have been used to produce information similar to what the VORS tool [12] produces; for example, the VO support mapping [13]. The few remaining metrics that are not yet available will be implemented (as outlined in section 2.2.1), and then the RSV web interface will provide a complete set of replacements for all the functions provided by VORS.

2.2.7Further improve configuration steps to be more site-administrator friendly


In phase II, the RSV team implemented improvements to the process of configuring RSV, including the integration of RSV configuration to OSG-wide configuration [11]. Yet, with the growth of RSV’s monitoring scope, the process is still considered complicated by some site administrators. Additionally, changes to RSV configuration are not picked up on restart of the service – several site administrators have requested we consider implement automatic-reconfiguration on service-restart. The RSV team will attempt to simplify the configuration process by possibly using a config file similar to the one used by the OSG-wide configuration process. Implementing such a change is expected to make it possible to reconfigure on each restart of the osg-rsv service.

2.2.8Generalized Nagios Wrapper to run RSV probes within Nagios


The OSG GOC plans to use a Nagios based instance of RSV to do critical monitoring on all OSG resources at a central monitoring level. Towards this goal, the RSV team will build a Nagios infrastructure including a wrapper that can run RSV probes within Nagios. The wrapper will provide two functions: It will be able to digest arguments, etc. in a format understood by Nagios, convert them to a format that RSV probes accept, and run the appropriate probes. The wrapper will also be able to take RSV probe output and convert it into a form accepted by Nagios for display. The RSV team will explore the possibility of building on existing related work done by OSG colleagues at Brookhaven National Laboratory and by the WLCG grid monitoring group.

3Deployment


Deployment of RSV phase III will be accomplished according to the time table in section 2.1. The GOC will also have in place a Nagios based infrastructure to run a very minimal yet critical set of RSV tests on all OSG resources.

4Exploratory Tasks planned in RSV Phase III

4.1Enhance RSV-SAM transport system to be more robust


As stated in previous project plans, RSV records received from individual clients running on OSG resources, that are part of the WLCG MoU, are uploaded to the WLCG SAM system [7]. While this transport works fairly reliably on the OSG end, the RSV team has concluded that installing a messaging broker at the GOC will make the system more robust while also allowing bi-directional transport of metric-results i.e. OSG users will be able to view metric status of EGEE resources that are part of the WLCG MoU without depending on EGEE tools/interfaces.

4.2Explore further integration of RSV with OIM system and/or Operations dashboard a.k.a MyOSG


We plan to further explore integration between the OIM system and RSV.

5References


[1] OSG RSV Project: Phase 1 project plan:

http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=579

[2] WLCG Grid Monitoring Standards:



https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringProbeStandard

[3] Latest RSV Probe set:



http://rsv.grid.iu.edu/downloads/OSG_RSV_Probes-2.3.5.tar.gz

[4] OSG RSV Phase II project plan:



http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=731

[5] RSV Reporting Overview:



https://twiki.grid.iu.edu/bin/view/MeasurementsAndMetrics/RsvReportsOverview

[6] RSV Resource Status (Central)



http://rsv.grid.iu.edu/resources

[7] Links to RSV, SAM and GridView Summaries: https://twiki.grid.iu.edu/bin/view/Operations/RsvSAMGridView

[8] RSV Probes Help Pages:

http://rsv.grid.iu.edu/documentation/help/

[9] WLCG Availability Algorithm



https://twiki.grid.iu.edu/pub/Operations/RSVPeriodicReporting/Gridview_Service_Availability_Computation-1.pdf

[10] RSV Availability Calculation Algorithm

[need reference from Soichi here]

[11] OSG-wide Configuration Process:



https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/ConfigurationFileHelp#Introduction_to_RSV_Section

[12] VORS Web Interface:



http://vors.grid.iu.edu/

[13] RSV based VO-resource Mapping Viewer:



http://rsv.grid.iu.edu/vo


OSG Resource and Service Validation (RSV) Phase III Project Plan 6/12/16 5:11 A6/P6


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©kagiz.org 2016
rəhbərliyinə müraciət