Friday, December 18, 2009

As wrapping up the year we have a GFAL/lcg_util v1.11.13, FTS 2.2.4 and DPM/LFC v1.7.4.

Details are in the data management relese notes.

This FTS snapshot already has a secure preview of the administrative web interface:

Friday, December 11, 2009

GFAL/lcg_util in SVN

GFAL and lcg_util is the first candidate for migration to SVN.

The new repository is available and I have managed to run remote ETICS builds on all platforms using the glite-data-dm-util_R_1_11_12_3 and glite-data-gfal_R_1_11_12_3 configurations.

While SVN feels much better while renaming files, there is a certain degree of complexity in tagging



svn copy https://svn.cern.ch/reps/glitedm/trunk/gfal \
https://svn.cern.ch/reps/glitedm/tags/glite-data-gfal_R_1_11_12_3 \
-m glite-data-gfal_R_1_11_12_3

compared to CVS

    cvs tag glite-data-gfal_R_1_11_12_3

or git

   git tag glite-data-gfal_R_1_11_12_3

Wednesday, November 25, 2009

FTS releases

The current, 'WLCG approved' version of FTS is FTS 2.1.

gLite has already released FTS 2.2 (i.e. FTS 2.2.0), however Atlas has
discovered some shortcomings with the checksum suppport in the
Pilot service, which were fixed in the upcoming FTS 2.2.1 and
FTS 2.2.2 releases.

FTS 2.2.2 has been certified by gLite and installed on the CERN FTS
Pilot service and will be running until the beginning of December
to reach the 'WLCG approved' status.

Currently we are working on FTS 2.2.3 to address a few more issues.

Have a look at the FTS patch status page for more details on older releases!

releases

My magic URL for upcoming data management releases points into Savannah.

We usually create a 'patch' (i.e. release) when we have a draft idea of what should
go into a release. For example FTS 2.2.3 and GFAL 1.11.13 are created with all the
bugs attached that we intend to implement by the release date.

The first noteworthy state is 'Ready for certification', when the developers have
finished their work and there are already RPMs created. At this point we usually
upload the packages into our Release Candidate repository for the convenience of
early testers.

The next noteworthy state is 'Certified', when the release has passed all regression
tests and the new features seem to be working.

After this state there is a few weeks of testing (i.e. waiting if there is any unexpected
behaviour) in the pre-production testbed (PPS) and then comes the gLite release.

Thursday, November 19, 2009

Unifying LGC_Util and GFAL version numbers

A usual source of confusion: which LCG_Util version requires which GFAL library version. Almost after each release somebody installed the wrong packages somewhere. Now, the confusion is over: from the next release on, we always release those two components together, under the same version numbers (but with different tag prefix, certainly). We will create the first such a release pair this week, with version numbers 1.11.12-1 (the next GFAL version number). It means, that there will be a gap in LCG_Util case: version 1.7.8-1 will jump up to 1.11.2-1. Keep tuned.

Wednesday, November 4, 2009

Debugging tricks

When you want to debug the command-line tools of the projects, you find immediately that the commands are in fact shell scripts. They are libtool wrapper files actually and set up several things before calling the binaries themselves. You need to invoke gdb in the following way:

libtool gdb _command_

From this point, everything should work as usual.

Next, you may run into the following trouble when debugging:
[Thread debugging using libthread_db enabled]
Error while reading shared library symbols:
Cannot find new threads: generic error
For me, it occured on SLC5, when the code used the dynamic linking library (dlopen, etc.). You can eliminate the problem by linking libpthread directly to your executable. For example:

lcg_del_LDADD = $(COMMON_LIBS) -lpthread

Good luck!

Tuesday, November 3, 2009

GFAL and LCG_Util test bed developments

Currently, the GFAL test suite contains integration and regression tests only. The certification process develops and executes those tests. We need something more and flexible: basically, we need unit/white box tests that checks GFAL code validity until the boundaries of its dependencies. We started to create unit test suite for both GFAL and LCG_Util, for debugging and internal validation purposes. The unit test suite requires some redesign for the code (redesign for testability). The pattern behind is dependency injection. The code covered by unit tests will never call external library code directly (except for the standard C library functions), they will do it by replaceable function pointers. In production, the pointers point to the original functions, however, a white-box test can replace a set of functions for dummy ones simulating a scenario.

We do not change the whole GFAL code, to avoid regression. We change the code gradually as we solve the Savannah tasks. All the appropriate Savannah tasks will go with unit tests as well, and only the code affected by a task will be changed. What we will get is a "hybrid": in some cases, functions will be called directly, in some cases indirectly. In ideal case, we reach full unit test coverage when the function call methods get unified.

We will demonstrate the power of the unit test suite with solving RFE: Extra parameter in lcg-cp for a better TURL construction.

After unit tests, we have to have a controllable regression/integration test suite. As we cannot control the certification test bed and it is tightly bound to the certification test environment, we create our own test bed better integrated into the development environemnt. We do it by copying the cert. tests into our source tree, adapting to our environment, then we start adding tests covering our tasks and purposes.

Friday, October 16, 2009

Chrome for ETICS

I have tried Chrome with the ETICS web interface and the speed improvement of the JavaScript engine is impressive.

The only catch is the authentication, which you have to solve by command line tools, as Google expects that the OS provides the GUI for certificate handling.

To solve the bootstrapping problem here are some useful commands, given that you have libnss3-tools and ca-policy-igtf-classic installed:


cd ~/.globus
openssl pkcs12 -export -inkey userkey.pem -in usercert.pem -out usercert.p12
certutil -d sql:$HOME/.pki/nssdb -i ~/.globus/usercert.p12
certutil -d sql:$HOME/.pki/nssdb -A -t "C,," -n cern-root -i /etc/grid-security/certificates/d254cc30.0
certutil -d sql:$HOME/.pki/nssdb -A -t "C,," -n cern-online -i /etc/grid-security/certificates/1d879c6c.0

Monday, September 14, 2009

GFAL release 1.11.11-1

The release contains the following minor fixes:

- IPv6 compliance
- Manual page update

The release tag is:

glite-data-gfal_R_1_11_11_1

IPv6 compliance in FTS and GFAL

There were several Savannah tasks targeting IPv6 compliance, they have been resolved now. The list of the related tasks:

#41844: IPv6 bug; LCG-utils client functionality immediately broken by IPv6

#41278: IPv6 bug: non compliant address in source code (hard coded IPv4: 127.0.0.1)
#41585: [FTA] IPv6 bug: non compliant name resolving function (gethostbyname_r)
#41586: [FTA] IPv6 bug: non compliant name resolving function in source code (gethostbyname_r)
#41278: IPv6 bug: non compliant address in source code (hard coded IPv4: 127.0.0.1)

See the resolution details in the comments of the individual tasks. Basically, the general solution was:

- remove dependency on the pre-compiled gSoap library
- take stdsoap2.c directly from the gSoap sources
- compile the above file with WITH_IPV6 defined

The release tags including the IPv6 compliance are:

glite-data-srm-api-c_R_1_1_0_12
glite-data-srm2-api-c_R_2_2_0_6
glite-data-transfer-cli_R_3_7_2_1
glite-data-transfer-agents_R_3_4_2_1
glite-data-gfal_R_1_11_11_1

It lists the affected components as well, actually they are the ones that implement SOAP communication with gSoap. The IPv6 functionality is encapsulated into gSoap completely, so we did not have to change the implementation, it was only configuration issue.

Tuesday, September 8, 2009

FTS from Python

I was writing a Python binding for the transfer-cli functionality.

This is how far I got:


import fts

f = fts.FileTransferService()
print "# FTS using endpoint: %s" % f.endpoint()
print "# FTS service version: %s" % f.version()
print "# FTS interface version: %s" % f.interface_version()
print "# FTS schema version: %s" % f.schema_version()

c = fts.ChannelManagement()
print "# CM using endpoint: %s" % c.endpoint()
print "# CM service version: %s" % c.version()
print "# CM interface version: %s" % c.interface_version()
print "# CM schema version: %s" % c.schema_version()
print c.channel_names()


And the output was:

# FTS using endpoint: https://lxvm0307.cern.ch:8443/glite-data-transfer-fts/services/FileTransfer
# FTS service version: 3.7.0-1
# FTS interface version: 3.7.0
# FTS schema version: 3.4.1
# CM using endpoint: https://lxvm0307.cern.ch:8443/glite-data-transfer-fts/services/ChannelManagement
# CM service version: 3.7.0-1
# CM interface version: 3.7.0
# CM schema version: 3.4.1
('ASGC-CERN', 'BNL-CERN', 'CERN-ASGC', 'CERN-BNL', 'CERN-CERN', 'CERN-FNAL', 'CERN-GRIDKA',
'CERN-IN2P3', 'CERN-INFN', 'CERN-NDGF', 'CERN-NIKHEF', 'CERN-PIC', 'CERN-RAL', 'CERN-SARA',
'CERN-TRIUMF', 'FNAL-CERN', 'GRIDKA-CERN', 'IN2P3-CERN', 'INFN-CERN', 'NDGF-CERN', 'NIKHEF-CERN',
'PIC-CERN', 'RAL-CERN', 'SARA-CERN', 'TRIUMF-CERN', 'CERN-STAR')


The minimum goal is to have submit and status checking working.

Thursday, August 27, 2009

Fixes in GFAL and lcg_util

lcg_util: should use -1 length for gridftp/CKSM

In GFAL, we always calculate checksums for the whole file (if it is needed). However, the corresponding Globus API function (globus_ftp_client_cksm) was used with wrong length parameter: it was 0 (calculate checksum for 0-length data) instead of -1 (calculate until the end of the file). This error also pointed out an inconsistency in DPM, it interpreted this value a bit differently than in the API specification above.

GFAL: shall handle abreviated checksum names


It is a workaround on how DPM calls the checksum algorithms. Actually, the endpoints should follow the GridFTP specification, but DPM has already implemented a different name set. It will be changed in DPM as well, however, it may not be deployed everywhere soon. So, internally and temporarily, GFAL detects and converts the DPM names to the GridFTP conventions.

The fixes have been sent for integration and certification.

Monday, August 24, 2009

GFAL activities in Savannah

The following GFAL bugs have been certified:

Unable to get a tURL in a full space
See the description and the solution here.

GFAL: Problems with LDAP queries on SL5
The LDAP filters contained spaces between the filter parts and the operators. It seems that it is not allowed, but some GFAL releases (on different platforms probably) have worked with spaces.

In the same time, we introduced a new activity:
updating depreceted LDAP calls

OpenLDAP still continues supporting deprecated functions, so the priority is low.

Thursday, August 13, 2009

playing with git

I have set up a CVS-to-git export cron job for the glite data management and CASTOR CVS modules that we can try git.

On RedHat derivatives you can install from the DAG repository

yum install git git-cvs git-svn qgit

On Ubuntu you can install the dependencies as

sudo apt-get install git-core git-svn git-cvs qgit giggle

And then you can check out a module:

$ time git clone
git://lxtank02.cern.ch/org.glite.data.srm-util-cpp
...
real 0m0.803s
user 0m0.340s
sys 0m0.044s

At this point my checkout was a bit lost with the branches,
so needed a bit of help to find back to the true path:

$ cd org.glite.data.srm-util-cpp
$ git merge origin/origin

And this point you can start branching at wish to play with the GFAL code.

You can sync up later with CVS using
$ git pull

Once we move from CVS to SVN committing through git would become also feasible.


To see the efficiency of the storage here is a small experiment:

$ git clone git://lxtank02.cern.ch/CASTOR2
$ du -sh CASTOR2/.git
37M CASTOR2/.git
$ cd CASTOR2; time git pull
...
real 0m0.273s
user 0m0.092s
sys 0m0.124s

$ rm -rf CASTOR2
$ cvs -d ':pserver:anonymous@isscvs.cern.ch:/local/reps/castor' co CASTOR2
$ du -sh CASTOR2
58M CASTOR2
$ cd CASTOR2; time cvs up
...
real 0m2.700s
user 0m0.156s
sys 0m0.136s

In plain words the storage size of all the versions going back to 1999 (37MB) is smaller than the workspace (58MB).

Monday, August 10, 2009

Wrong LDAP search filters?

Today, I tried to replace the deprecated LDAP functions, related to this post. As I have never did anything with LDAP (new skill ;) ), first I wanted to get familiar with it. Google Code Search helped a lot, however, it turned out that the problem might not be related to the deprecated functions, because the LDAP API developers still maintain them, for backward compatibility. What I did in this context was:
  • Created the ldap_facade module, to hide the calling details of the deprecated functions (which function is called in fact, some of the always-the-same LDAP API function parameters, etc.).
  • Added the ldap_facade_init and ldap_facade_search functions only, and left the rest (may be added later, if change is needed there).
To check if I am able to connect to the LDAP server, and execute the search appeared in the log attached to the bug, I installed the Apache LDAP browser plugin for Eclipse, created a connection to the server, and copied the search filter. Here, I made an observation: the plugin did not accept the search filter, because it started with a space...

I checked it in the code, the hard-wired template filters really started with a single space. If I removed it, then the test command went further, and tried to do some SRM operations.

So, the questions:
  • Is it true that the LDAP search filter string cannot start with a white space?
  • If it is true, then is it a bug in the code? Other filters start with space as well.
  • If it is a bug, why could not we see them so far?

Friday, August 7, 2009

Deprecated LDAP API functions in GFAL

On SLC5, certification of the following patch failed:

https://savannah.cern.ch/patch/?3119

The error can be reproduced in the following way:

  1. Check out and build org.glite.data with GFAL (instuctions here). The next steps are continuation of the linked build process.
  2. Build the org.glite.data.dm-util package:

    cd ~/org.glite.data.dm-util/build
    make install

  3. Execute the following commands:

    cd src
    export LCG_GFAL_INFOSYS=lcg-bdii.cern.ch:2170
    ./lcg-cr -d srm://lxb7608v1.cern.ch/dpm/cern.ch/home/dteam/test_rm_02 -D srmv2 -vv /etc/redhat-release
(about the LCG_GFAL_INFOSYS: see this). The command results:

Using grid catalog type: lfc
Using grid catalog : (null)
Checksum type: None
[INFO] BDII server: lcg-bdii.cern.ch:2170/o=grid
[INFO] BDII filter: (| (GlueSEUniqueID=lxb7608v1.cern.ch) (& (GlueServiceType=srm*) (GlueServiceEndpoint=*://lxb7608v1.cern.ch:*)))
[INFO] Trying to use BDII: lcg-bdii.cern.ch:2170/o=grid (timeout 60)
[BDII][ldap_search_st][] lcg-bdii.cern.ch:2170 > Bad search filter
[GFAL][bdii_query_send][EINVAL] No accessible BDII
lcg_cr: Invalid argument

That is the same result that can be seen in the bug. After analysis, we found that several LDAP C API functions got deprecated on this platform. Ideas at this point:

- We may as well create a facade to the LDAP API.
Pros: unit-testability, without real LDAP servers (mock finctionality). Future LDAP API changes are isolated.
Cons: The LDAP API does not change frequently, so probably the workload is not worth.

TODO-s:

- Change the deprecated LDAP functions
- Create a regression test for the bug
- Re-send the patch for certification.

small fixes: transfer-agents and transfer-cli

There were a couple of other updates:

glite-data-transfer-agents v3.4.1-1

Really fixing #47507: SRMv2.2 as default.
This is a two character fix, which finally made it to the release.

If you cannot wait then there are some workarounds. The original idea of adding
FTA_GLOBAL_ACTIONS_SRMVERSION="2.2"
worked only for the VO agents, so one also has to add the following lines to the Yaim config:
FTA_TYPEDEFAULT_SRMCOPY_ACTIONS_SRMVERSION="2.2"
FTA_TYPEDEFAULT_URLCOPY_ACTIONS_SRMVERSION="2.2"

... or simply submit a full SURL to FTS including the endpoint of the SRMv2 server.


glite-data-transfer-cli v3.7.1-1
  • Fixing a regression: overwrite flag (-o) should not require an argument, which problem was introduced in 2008 March as a regression.
  • Updated the test suite to the latest FTS service.

transfer-url-copy version 3.2.1-rel2 released

The affected module is: org.glite.data.transfer-url-copy.

The changes are:
  • Warning removal
  • The result of the code review implemented partially: descriptive enum-s to signal the actual checksum checking use case.
The new release tag is:

glite-data-transfer-url-copy_R_3_2_1_2

The functionality and the behaviour have not been changed.

See the component in the CVS.

Building GFAL

OK, it's time to build this package. The selected platform is SCL5 (Scientific Linux @ CERN 5), because the world will soon move to this platform, and there are some problems there.

The build is done by using ETICS that is a special software configuraion management system funded by the European Commission, CERN, etc. The steps what I did were (ETICS has already been installed):

mkdir GFAL
cd GFAL
etics-workspace-setup
etics-checkout --ignorelocking --continueonerror --project-config glite_branch_3_2_0_dev org.glite.data
etics-build --continueonerror org.glite.data

GFAL is the (empty) project directory, I will refer to it as $WORKSPACE. This series of commands builds the whole org.glite.data suite.

Great, the suite build was successfull, let's see the GFAL build. The best if we start working under e-env:

cd $WORKSPACE
org.glite.data/bin/e-env

Then:

cd
$WORKSPACE/org.glite.data.gfal/build
make install

We should have a couple of executables with names gfal_test*. Let's execute the tests. During the tests, I had a valid proxy credential in the LCG Deployment Team Virtual Organization (dteam).

./gfal_version

Returned:

GFAL-client-1.11.8-1

touch a;
./gfal_teststat a

Returned:

stat successful
mode = 100644
nlink = 1
uid = 1000
gid = 1000
size = 0

COOL! And then the other test commands.

Thursday, August 6, 2009

Checksum code review

The latest FTS development was adding checksum support to verify if the data has been transferred properly, and the source/destination files has not been altered. The related requirement specification can be found in the wiki:

FtsChecksums

The feature has been transferred to the package certification process.

Today, we had a code review with Rosa and Ákos, we reviewed the checksum-related code. After a discussion about some fancy C++, Boost, STL features + some potential Google interview questions :), we had two findings that will be changed:

- The system determines the actual checksum use case and stores it in bool variables - enum-s should be used instead, with descriptive names.
- The asynchronous SRM operations called synchronously, so the same send/poll pairs go always together in the code. Should be merged into one function that encapsulates the new exponential backoff functionality as well.

We found no bugs and the changes above will not modify the behaviour, so we do not need a new release now.

Wednesday, August 5, 2009

GFAL

Today, I have started to get familiar with the GFAL package. I had a discussion with Rémi Mollon who is the actual package owner, and will stop maintaining the package in October.

GFAL provides a POSIX-compliant C and Java API to access data in grid environment. An important package is based on GFAL: the LCG Util package, that is maintained together with GFAL, in fact.

We had an idea to merge some components of FTS and GFAL, especially the SRM access layer. In FTS, it is the org.glite.data.srm-utils-cpp component, written in C++. In GFAL, it is written in C. Re-implementing org.glite.data.srm-utils-cpp on top of the GFAL SRM access layer would be benefical:

- Maintaining would be much easier
- There are several features that are under development now, and must be implemented in both packages (for instance, exponential backoff for failed or long requests, see FTS Request; there is a similar GFAL Request).
- org.glite.data.srm-utils-cpp has good unit test coverage -> the appropriate GFAL code would also be tested better if we execute the FTS test suite.

TODO: I need to check the feasibility of the above idea before proposing anything. So, in the next days, I will do the following on GFAL side:

- Organize a coffee with the responsibles from the experiments :)
- Set up a GFAL development environment
- Compile, run the tests, solve the complications coming from certificates, security, etc.
- Re-implement SrmLs on top of the GFAL SRM access layer (this is the most heavily used SRM call in FTS)
- Execute the FTS tests.

LCG Util


Today, I have started to get familiar with the LCG Utils package. I had a discussion with Rémi Mollon who is the actual package owner, and will stop maintaining the package in October.

First, I wanted to explore the users of the project, they are the ATLAS and LHCb CERN experiments.

Then, we went through the dependencies between LCG Utils and the rest of the gLite project. The most important is:

LCG Utils is main end user command line tool for data management provided by LCG. Implements high level file management tools, like copy files, etc.
.

It depends on GFAL.

TODO: identify the main users and stakeholders on the experiment side. Have a coffee with them :)