COCHRANE CENTRAL ADVISORY GROUP (CCAG) REPORT

 

 

1.  How many meetings, and of what type (e.g. face-to-face, by teleconference), has your Advisory Group had since October 2004?  Is this what you expected in your previous report?

 

•           The last face-to-face meeting of the full CCAG was 3 October 2004 at the 12th Cochrane Colloquium (Ottawa, ON, Canada). The final version of the minutes to the Ottawa meeting have been circulated and approved. All CCAG minutes will be available at www.cochrane.us shortly.

 

•           A teleconference took place on 9 December 2004 with CCAG. Topics discussed included: maintenance of CENTRAL and supporting documentation; Chapter 5 of the Cochrane Reviewers’ Handbook; and Master List update mailing process. Throughout the past six months, the CCAG E-mail Discussion List has been used extensively for discussion about, and development of a consensus on, the final contents of the CENTRAL Management Plan, and the nomination of 2 more new CRG TSC representatives to serve on CCAG.

 

•           The CCAG’s next face-to-face meetings are scheduled to take place at the 13th Cochrane Colloquium in Melbourne, Australia in October 2005.

 

2.  Supply an up-to-date list of the members of your Advisory Group (as of January 14, 2005).

 

Kay Dickersin (CCAG Convenor, USCC)

Davina Ghersi (CCSG Representative, Breast Cancer Group)

William Gillespie (Coordinating Editor Representative, Musculoskeletal Injuries Group)

Elena Glatman (CENTRAL Activities Coordinator, USCC)

Diane Haughton (RGC Representative, Neonatal Group)

Carol Lefebvre (Information Specialist, UKCC)

Steff Lewis (CCSG Representative, Stroke Group)

Eric Manheimer (Field/Network Representative, Complementary Medicine Field)

Hugh McGuire (TSC Representative, Depression, Anxiety and Neurosis Group)

Marijke Moll (Field/Network Representative, Rehabilitation and Related Therapies Field)

Indy Rutks (TSC Representative, Prostate Group)

 

Andrew Cullis, Pauline Howarth, Deborah Pentesco-Gilbert, Karen Robinson, Hazim Timimi, and Susan Wieland, are ex officio members of the CCAG.


3.  Summarize any significant actions taken by your Advisory Group since your last report (for the CCSG meeting in Ottawa in October 2004), and significant actions planned for the next six months until the next meeting of the CCSG in Providence in April 2005.

 

Previous six months:

•                      Submissions for Issues 1 and 2, 2005 of CENTRAL have been processed at the USCC, and the logs on the handsearch and specialized register submissions have been sent to TSCs and CCAG. This task continues to involve extensive quality control undertaken at the USCC, though individual feedback and instructions are sent to TSCs during and after processing. One of the underlying problems resulting in errors in CENTRAL submissions each quarter is lack of training among TSCs, for example, in using a ProCite database to maintain a specialized register. Errors also result from understaffed entities (no appointed TSC), or a large TSC workload where fixing a register for CENTRAL submission is not a top priority. More detailed information on problems encountered during processing of submissions can be found on the USCC website at http://www.cochrane.us/logs.htm.

 

•                      Updating of the CENTRAL Management Plan (CMP) has been initiated.

 

•                      The Master List update mailing process was modified to meet TSCs needs.

 

Next six months:

 

•           Updated version of the CMP will be available at http://www.cochrane.us/central.htm.

 

•           Work with the Cochrane Collaboration Steering Group, the Information Management Systems Group, Wiley Interscience, the Cochrane Publishing Policy Group, and TSCs to further develop and refine CENTRAL.

 

•           Decide upon a final set of fields to include in CENTRAL for publication in The Cochrane Library; begin to standardize journal titles in CENTRAL; develop systems and rules for publishing references to ongoing and unpublished studies. Continue work on improving CENTRAL that had begun in the previous six months.

 

4.  Does your Advisory Group have any questions that you would like the Steering Group to answer?  If so, please list them.

 

Does the CCSG have any suggestions for improving the situation with understaffed entities (due to funding issues as reported by the MRG), or where TSCs work only part-time and do not have enough resources to fix specialized registers, since this problem effects the integrity of CENTRAL?

 

5.  Does your Advisory Group wish to raise any problems, and recommended solutions, which you would like the Steering Group to discuss?  If so, please list them.

 

Would there be any funding available to assist with new CENTRAL development and refinement in the future? (See pages 3-8).

 

6.  Do you foresee any problems in keeping within the budget you submitted for the current financial year (April 2004 to March 2005)?

                       

No. The CCAG budget for conference calls is adequate; however, our concerns lay with the CENTRAL budget.

 

7. What are your budgetary requirements for the period April 2005 to March 2006?  Please provide a breakdown if appropriate.  (As a reminder, the Steering Group sets the budget for each Group at its non-Colloquium meeting.)

 

A budget of £2,000 (approximately $3,000 US) requested for teleconference calls only, submitted for the financial year April 2005 to March 2006 is sufficient.

 

On January 4, 2005 the USCC submitted a request for funding to the CCSG specifically for MEDLINE retagging activities that are not covered by other sources.

 

 

Kay Dickersin, Convenor, CCAG

22 February 2005


Development of a New and Improved CENTRAL

 

A.  Introduction

 

In February, 2004 the Cochrane Collaboration Steering Group (CCSG) approved development of a new study-based Cochrane Central Register of Controlled Trials (CENTRAL) and its integration into the new Information Management System.  Review groups and others would enter their trial report records online directly into a database, which in turn would be submitted to the CENTRAL publisher.

 

The aim of rebuilding CENTRAL would be to develop a regularly updated, clean, non-redundant study-based database of controlled trials, containing all relevant records from the 50+ Cochrane specialized registers, as well as records indexed as CONTROLLED CLINICAL TRIAL [PT] and RANDOMIZED CLINICAL TRIAL [PT] in MEDLINE.  (Note MEDLINE is the subset of PubMed in which records are indexed using Medical Subject Headings [MeSH]).

 

In March-April, 2004 the US Cochrane Center performed a series of pilot investigations of methods proposed for rebuilding CENTRAL.

 

B.  Methods

 

            B.1  Database

 

            The proposed database would be relational and study-based.  We propose using Oracle

            (9.2 or 10g) for the new CENTRAL repository due to the anticipated size and projected

            rate of growth.  Compared to other options (e.g., Access, FoxPro), Oracle offers the best

            flexibility, speed, and built in data integrity mechanisms to ensure current and accurate

            data.  For testing purposes Oracle 9.2 was used.  The server is a Dell PowerEdge 2600

            with 136 Gigs disk space, 1 gig of RAM, and a single Xeon processor running at 3.06

            GHz.

 

            B.2  Populating Database

 

            We anticipate rebuilding and updating CENTRAL by downloading MEDLINE records


indexed as RCT (PT) or CCT (PT), tagged as human, and importing them.  PubMed provides scripts (e-utilities) for this purpose.  There are advantages to rebuilding CENTRAL via the PubMed utilities.  For one thing, all 62 fields are able to be downloaded from PubMed’s website.  This will eliminate all duplicates from PubMed (based on importing into a preliminary table with PMID as the primary key) and any other typographical errors will also be eliminated.

 

            For the records that are either not part of PubMed (Embase, etc.) or do not have any ID,

            we will do an import of those records directly into Oracle.  To eliminate duplicates,

            queries against the PubMed data will be based on the author and title fields.

 

B.3  Process

 

The process comprises two major functions, non PubMed data and PubMed data1.  The non-PubMed data process begins with the extraction of all data from the specialized registers.

 

Importation of the specialized registers is done in multiple phases to ensure no data loss and to try to eliminate duplicates during the import phase, as opposed to trying to remove duplicates after an import is complete.

 

                        B.3.1  Extraction of registers from existing registers

 

                    The Cochrane Eyes and Vision Group (CEVG) was used as a specialized register for a

                    pilot project.  The CEVG sent two different versions of their specialized registers, and

                    both were used for the pilot.  One is the complete register and the other is the subset of

                    information sent to the US Cochrane Center for CENTRAL.  We analyzed data

                    extraction from both files.

 

A.  The initial extract was based on the most current Procite file (subset of fields in complete register) sent to the US Cochrane Center for Issue 2, 2004 The Cochrane Library. 

             a.            Step 1: All fields were extracted (13) based on the mapping file modified for the CEVG database to place fields in a comma delimitated file with quotation marks indicating text.  Further modification of the extract was completed via a text editor for a global search and replace to input “\n” in the appropriate location.           

B.  The second extraction was based on the Procite database and the complete register, sent to the US Cochrane Center by the CEVG for the pilot.  This database included all fields in the CEVG specialized registers.

                             a.     Step 1: Determine extract configuration.

                             b.     Step 2: Select all records, extract, into a text delimitated file.

                             c.     Step 3: Modify the extracted text in generic text editor to allow correct

                                                importing processing.

________________________

1 Please refer to process flow diagram

 

 

            B.3.1.2  Import of records into MS Access

 

            After extraction from the Procite files, all records were imported into MS Access for

            preliminary processing.

 

            A.            Each table column was converted to the correct size (i.e., data type) to ensure no data were truncated.

 

            B.            After import, the maximum length of each column was found and compared to the Procite file.  This provided further assurance that no data were lost during the import phase.

 

            C.            Determine CEVG register-specific issues. We determined that ̃ 70% of the MEDLINE records had a “19" appended to the beginning of the PubMed ID1.

 

            D.            Export all data to a new table within MS Access.

           

            B.3.1.3  Export of records from MS Access

 

            Copy newly created table to a new MS Access database.  This step is completed due to a

            limitation in the Access Oracle migration utility.  For each converted database, all tables

            are imported directly into Oracle onto their own tablespace.  To ensure consistency we

            wanted to limit the import to only the single table.  The ID field of the table was cleaned

after import.  Many ID fields contained text as well as numeric values.  A new column (dbtype) was created, and any text was imported into the new column and removed from the ID field.

 

            B.3.1.4  Import of MS Access records into Oracle

 

            Several different alternatives were tested, and we looked at a variety of factors including

            speed of import, compatibility of software, and the amount of time needed to integrate

            the process.  The alternatives included:

 

                 1.     Linking tables directly from Oracle to MS Access

                 2.     Creation of SQL Loader scripts

                 3.     Access to Oracle Migration tool

 

_____________________

1 We randomly selected 10% of all PubMed ID’s that contained extra digits to ensure that after removing the “19", the correct record was pulled from the PubMed website.

 

            Based on our pilot testing, the Oracle MS Access to Oracle Migration Utility provided

            the most direct seamless import into Oracle.  The speed was negligible, the column

            widths and data were maintained, Oracle reporting of the process was complete, and the

            results were easily repeatable.

 

            In certain versions of Oracle Enterprise Manager (OEM) there are utilities to help

            administer Oracle databases.  Included is the Oracle MS Access to Oracle Migration

            Utility.  The utility completes a number of tasks in order to import the records including:

 

                 1.     Automatic creation of unique tablespace in Oracle DB of your choice for tables to import.

 

                 2.     Automatic creation of table definition.

 

                 3.     Automatic creation of indexes as defined by MS Access.

 

                 4.     Automatic creation of sequence and triggers.

 

                 5.     Automatic granting or rights.

 

            B.3.2  Import of PubMed records into Oracle

 

            NLM provides scripts on their website for large data extractions.  We modified scripts

to suit our environment and download requirements.  Records were extracted from Medline using the search parameters: “Randomized Controlled Trial” and “Clinical Controlled Trial” as the publication type and “Human” as the MeSH term.  A total of 242,289 records were downloaded in XML format in a batch process that took 51 minutes.  The script initially produces a list of all PMIDs which are then extracted in “Abstract” format.  PMIDs were sent in batches of 10,000 per query.

 

            B.3.2.1  Steps involved

 

                 1.     Modify script provided by Medline to batch 10,000 IDs, change query parameters and output format.

 

                 2.     Run Crontab job to pull records in off hours (between 9 PM and 5 AM).  On the initial test, 242,289 records were pulled in 51 minutes for a file size of 1.4 gigs.

                 3.     Copy newly created XML file to Oracle server.

 

                 4.     Run script to transform XML data using XSL file created for import process.

 

                 5.     Import transformed XML into Oracle.

 

            B.3.2.2  Process and Timing Results

 

            The record retrieval process from PubMed takes approximately 2 hours from start to

completion.  The initial part of the script retrieves the PMIDs for all human (MeSH headings) CCT or RCTs.  The second part of the script extracts the full record in XML format.  The article requests are batched in groups of 10,000 per the recommendation NLM.

 

            A total of 38 files are created during the extraction/download process.

 

            The import of the records into Oracle was completed using a combination of the DOM

            (Document Object Model) and SAX (Simple API for XML) models to parse XML files.

            The SAX parser reads the individual records into memory, and Oracle’s DOM parser

            completes the import into Oracle.

           

            Preliminary fields into Oracle:

            1.    Abstract

            2.            Affiliation

            3.    Article identifier

            4.    Author

            5.    Comment in

            6.    Comment on

            7.    Publication date

            8.    Grant number

            9.    Issue

            10.    ISSN

            11.    NLM - Unique id for journals

            12.    Language

            13.    MeSH terms

            14.    Pagination

            15.    PubMed unique identifier

            16.    Publication type

            17.    Source

            18.    Title

            19.    Volume

 

            B.3.2.3  Initial Database Design

 

            The initial database was created in Oracle 9.2.  We used a standard database (i.e., not

            XML) to allow multiple export functions.

 

            It was determined that specialized register records will be stored in a different table

            structure from PubMed records.  This will allow faster processing of records to check for

            duplicates and change records from non-PubMed to PubMed style.

 

            The initial database design for PubMed records consists of twelve tables.  The primary

            key was an autonumber generated by Oracle with a prefix of 55 with the sequence

            starting 55000000.

 

            Based on testing, the import of the 300,000 + records will take approximately 6 hours.

 

            B.3.3.3  Elimination and importing of non-PubMed records

 

            There is overlap between the specialized registers and PubMed data.  Due to data integrity issues in the specialized register files themselves, all records needed to be compared to the newly downloaded PubMed records.

 

            PLSQL scripts were created to do a comparison (author and title) of each record           

            (PubMed download vs. Specialized Register).  If the author and title matched  the

            imported PubMed records, then the record was overlooked.  If the record did not match,

            the ID field of the record was examined.  Depending on the value of the field and if the

            dbtype field was populated determined the unique ID of the record.  The record was then

            imported into the table.

 

            PubMed tables:

 

                    1.     PubMed

                    2.     MeSH heading

                    3.     Chemical list

                    4.     Abstracts

 

            Non-PubMed tables:

 

                    1.     Non-PubMed

                    2.     MeSH headings

                    3.     Abstracts

 

            B.3.4  Handsearching Database

            B.3.4.1 Data Extraction

 

            The July, 2004 handsearching database (Procite) is being investigated to determine the

            best way to remove any duplicates and apply consistent characteristics to each of the

            datatypes.  The Handsearching records (108,719) were all exported using the three

            workforms associated with the database.  Each extraction was then imported into MS

            Access.  Each column was predefined to provide ample space for large columns

            (abstracts).  For example, the import function in MS Access allows the user to place the

            contents into a new or custom table.  We chose to import:

 

            1.    Author

            2.    Title

            3.    Medium

            4.    Journal title

            5.    Date of publication

            6.    Volume

            7.    Issue

            8.    Location in work (pagination)

            9.    Author role

            10.    Call number

            11.    Location/URL

            12.    Abstract

            13.    Keywords

            14.    CODEN

            15.    Availability

            16.    Notes

            17.    Author, mongraphic

            18.    Proceedings title

            19.    ISBN

            20.    Place of meeting

            21.    Place of publication

            22.    Editor

 

            B.3.4.2  Cleanup and Combining of data

 

            Each field of the handsearching database was mapped to the initial PubMed database

            design.  Additional tables/fields will need to be added to accommodate the conference

            proceeding records.

 

            Since there is overlap between the Handsearching database and PubMed records, all

            records that contained an ID will be compared to the PubMed import.  If there is a

            numeric match, then a comparison will be completed of the author and title list.  If there

            is an exact match, then the record from the Handsearching database will be deleted.  An

            additional comparison will be completed for all records between author and title.  All

            non-duplicates will be imported into the same format as the PubMed records with a

            different unique ID (based on an algorithm).

 

 

            C.  Know Limitations of Pilot

 

            As with any pilot, not all possible issues can be addressed.  We recognize certain

            limitations exist, but do not feel that any are insurmountable or will change the overall

            process.

 

1.         The pilot was based on 2 existing Procite files from the CEVG group.  The main issue is determining the proper export format to ensure all data is exported with the associated record properly.  Each group within the Cochrane Collaboration may use a different software package (e.g., MeerKat, Reference Manager, Procite, EndNote).  In our preliminary findings with the Procite files and with a basic understanding of MeerKat, we are confident that all existing records from each specialized register can be exported completely and converted into a new database.

 

2.         New ID system required - We cannot base our ID numbering system on the IDs granted by PubMed or any other database.  A new ID numbering system will need to be designed to ensure non duplication of ID and a numbering system that will ensure administrators that data is being stored properly.  We have investigated and propose using a unique starting numeric sequence depending on whether the information is PubMed or non-PubMed.  These IDs will not overlap.  We also anticipate records that have no ID and which we need to review to see whether they are already included on a bibliographic source (e.g. PubMed) and download.