COCHRANE CENTRAL ADVISORY GROUP (CCAG)
REPORT
1. How many meetings, and of what type (e.g.
face-to-face, by teleconference), has your Advisory Group had since October
2004? Is this what you expected in your
previous report?
The last face-to-face meeting of the full
CCAG was 3 October 2004 at the 12th Cochrane Colloquium (Ottawa, ON, Canada).
The final version of the minutes to the Ottawa meeting have been circulated and
approved. All CCAG minutes will be available at www.cochrane.us shortly.
A
teleconference took place on 9 December 2004 with CCAG. Topics discussed
included: maintenance of CENTRAL and supporting documentation; Chapter 5 of the
Cochrane Reviewers Handbook; and Master List update mailing process. Throughout the past six months, the CCAG
E-mail Discussion List has been used extensively for discussion about, and
development of a consensus on, the final contents of the CENTRAL Management
Plan, and the nomination of 2 more new CRG TSC representatives to serve on
CCAG.
The
CCAGs next face-to-face meetings are scheduled to take place at the 13th
Cochrane Colloquium in Melbourne, Australia in October 2005.
2.
Supply an up-to-date list of the members of your Advisory Group (as of
January 14, 2005).
Kay Dickersin
(CCAG Convenor, USCC) Davina Ghersi
(CCSG Representative, Breast Cancer Group) William
Gillespie (Coordinating Editor Representative, Musculoskeletal Injuries
Group) Elena Glatman
(CENTRAL Activities Coordinator, USCC) Diane Haughton
(RGC Representative, Neonatal Group) Carol Lefebvre
(Information Specialist, UKCC) Steff Lewis
(CCSG Representative, Stroke Group) Eric Manheimer
(Field/Network Representative, Complementary Medicine Field) Hugh McGuire
(TSC Representative, Depression, Anxiety and Neurosis Group) Marijke Moll
(Field/Network Representative, Rehabilitation and Related Therapies Field) Indy Rutks (TSC
Representative, Prostate Group) |
Andrew Cullis, Pauline Howarth, Deborah
Pentesco-Gilbert, Karen Robinson, Hazim Timimi, and Susan Wieland, are ex
officio members of the CCAG. |
3. Summarize any significant actions
taken by your Advisory Group since your last report (for the CCSG meeting in
Ottawa in October 2004), and significant actions planned for the next six
months until the next meeting of the CCSG in Providence in April 2005.
Previous six months:
Submissions for Issues 1 and 2, 2005 of CENTRAL have been processed at
the USCC, and the logs on the handsearch and specialized register submissions
have been sent to TSCs and CCAG. This task continues to involve extensive
quality control undertaken at the USCC, though individual feedback and
instructions are sent to TSCs during and after processing. One of the
underlying problems resulting in errors in CENTRAL submissions each quarter is
lack of training among TSCs, for example, in using a ProCite database to
maintain a specialized register. Errors also result from understaffed entities
(no appointed TSC), or a large TSC workload where fixing a register for CENTRAL
submission is not a top priority. More detailed information on problems
encountered during processing of submissions can be found on the USCC website
at http://www.cochrane.us/logs.htm.
Updating of the CENTRAL Management Plan (CMP) has been initiated.
The Master List update mailing process was modified to meet TSCs needs.
Next six months:
Updated
version of the CMP will be available at http://www.cochrane.us/central.htm.
Work with the Cochrane Collaboration
Steering Group, the Information Management Systems Group, Wiley Interscience,
the Cochrane Publishing Policy Group, and TSCs to further develop and refine
CENTRAL.
Decide upon a final set of fields to
include in CENTRAL for publication in The Cochrane Library; begin to
standardize journal titles in CENTRAL; develop systems and rules for publishing
references to ongoing and unpublished studies. Continue work on improving
CENTRAL that had begun in the previous six months.
4.
Does your Advisory Group have any questions that you would like the
Steering Group to answer? If so, please
list them.
Does the CCSG have any suggestions for
improving the situation with understaffed entities (due to funding issues as
reported by the MRG), or where TSCs work only part-time and do not have enough
resources to fix specialized registers, since this problem effects the
integrity of CENTRAL?
5. Does your Advisory Group wish to raise
any problems, and recommended solutions, which you would like the Steering
Group to discuss? If so, please list
them.
Would there be any funding available to assist with new
CENTRAL development and refinement in the future? (See pages 3-8).
6. Do you foresee any problems in keeping
within the budget you submitted for the current financial year (April 2004 to
March 2005)?
No. The CCAG budget for conference calls is
adequate; however, our concerns lay with the CENTRAL budget.
7. What are your budgetary requirements for the
period April 2005 to March 2006? Please
provide a breakdown if appropriate. (As
a reminder, the Steering Group sets the budget for each Group at its
non-Colloquium meeting.)
A budget of £2,000 (approximately $3,000 US)
requested for teleconference calls only, submitted for the financial year April
2005 to March 2006 is sufficient.
On January 4, 2005 the USCC submitted a request for
funding to the CCSG specifically for MEDLINE retagging activities that are not
covered by other sources.
Kay
Dickersin, Convenor, CCAG
22
February 2005
Development of a New and Improved CENTRAL
A. Introduction
In February, 2004
the Cochrane Collaboration Steering Group (CCSG) approved development of a new
study-based Cochrane Central Register of Controlled Trials (CENTRAL) and its
integration into the new Information Management System. Review groups and others would enter their
trial report records online directly into a database, which in turn would be
submitted to the CENTRAL publisher.
The aim of
rebuilding CENTRAL would be to develop a regularly updated, clean,
non-redundant study-based database of controlled trials, containing all
relevant records from the 50+ Cochrane specialized registers, as well as
records indexed as CONTROLLED CLINICAL TRIAL [PT] and RANDOMIZED CLINICAL TRIAL
[PT] in MEDLINE. (Note MEDLINE is the
subset of PubMed in which records are indexed using Medical Subject Headings
[MeSH]).
In March-April,
2004 the US Cochrane Center performed a series of pilot investigations of
methods proposed for rebuilding CENTRAL.
B. Methods
B.1 Database
The proposed database would be
relational and study-based. We propose
using Oracle
(9.2 or 10g) for the new CENTRAL
repository due to the anticipated size and projected
rate of growth. Compared to other options (e.g., Access,
FoxPro), Oracle offers the best
flexibility, speed, and built in
data integrity mechanisms to ensure current and accurate
data. For testing purposes Oracle 9.2 was used. The server is a Dell PowerEdge 2600
with 136 Gigs disk space, 1 gig of
RAM, and a single Xeon processor running at 3.06
GHz.
B.2 Populating Database
We anticipate rebuilding and
updating CENTRAL by downloading MEDLINE records
indexed as RCT (PT) or CCT (PT), tagged as human, and
importing them. PubMed provides scripts
(e-utilities) for this purpose. There
are advantages to rebuilding CENTRAL via the PubMed utilities. For one thing, all 62 fields are able to be
downloaded from PubMeds website. This
will eliminate all duplicates from PubMed (based on importing into a
preliminary table with PMID as the primary key) and any other typographical
errors will also be eliminated.
For the records that are either not
part of PubMed (Embase, etc.) or do not have any ID,
we will do an import of those
records directly into Oracle. To
eliminate duplicates,
queries against the PubMed data will
be based on the author and title fields.
B.3 Process
The process comprises two major functions, non PubMed data and PubMed
data1. The non-PubMed data
process begins with the extraction of all data from the specialized registers.
Importation of the specialized registers is done in multiple phases to
ensure no data loss and to try to eliminate duplicates during the import phase,
as opposed to trying to remove duplicates after an import is complete.
B.3.1 Extraction of registers from existing
registers
The Cochrane Eyes and Vision
Group (CEVG) was used as a specialized register for a
pilot project. The CEVG sent two different versions of
their specialized registers, and
both were used for the
pilot. One is the complete register and
the other is the subset of
information sent to the US
Cochrane Center for CENTRAL. We
analyzed data
extraction from both files.
A.
The initial extract was based on the most current Procite file (subset
of fields in complete register) sent to the US Cochrane Center for Issue 2,
2004 The Cochrane Library.
a. Step 1: All fields were extracted
(13) based on the mapping file modified for the CEVG database to place fields
in a comma delimitated file with quotation marks indicating text. Further modification of the extract was
completed via a text editor for a global search and replace to input \n in
the appropriate location.
B.
The second extraction was based on the Procite database and the complete
register, sent to the US Cochrane Center by the CEVG for the pilot. This database included all fields in the CEVG
specialized registers.
a. Step
1: Determine extract configuration.
b. Step
2: Select all records, extract, into a text delimitated file.
c. Step
3: Modify the extracted text in generic text editor to allow correct
importing
processing.
________________________
1 Please refer to process flow diagram
B.3.1.2 Import of records into MS Access
After
extraction from the Procite files, all records were imported into MS Access for
preliminary
processing.
A. Each table column was converted to
the correct size (i.e., data type) to ensure no data were truncated.
B. After import, the maximum length of
each column was found and compared to the Procite file. This provided further assurance that no data
were lost during the import phase.
C. Determine CEVG register-specific
issues. We determined that ̃ 70% of the MEDLINE records had a 19"
appended to the beginning of the PubMed ID1.
D. Export all data to a new table
within MS Access.
B.3.1.3 Export of records from MS Access
Copy
newly created table to a new MS Access database. This step is completed due to a
limitation
in the Access Oracle migration utility.
For each converted database, all tables
are
imported directly into Oracle onto their own tablespace. To ensure consistency we
wanted
to limit the import to only the single table.
The ID field of the table was cleaned
after import. Many ID fields contained text as well as numeric values. A new column (dbtype) was created, and any
text was imported into the new column and removed from the ID field.
B.3.1.4 Import of MS Access records into Oracle
Several
different alternatives were tested, and we looked at a variety of factors
including
speed
of import, compatibility of software, and the amount of time needed to
integrate
the
process. The alternatives included:
1. Linking
tables directly from Oracle to MS Access
2. Creation
of SQL Loader scripts
3. Access
to Oracle Migration tool
_____________________
1 We randomly selected 10% of all PubMed IDs that
contained extra digits to ensure that after removing the 19", the correct
record was pulled from the PubMed website.
Based
on our pilot testing, the Oracle MS Access to Oracle Migration Utility provided
the
most direct seamless import into Oracle.
The speed was negligible, the column
widths
and data were maintained, Oracle reporting of the process was complete, and the
results
were easily repeatable.
In
certain versions of Oracle Enterprise Manager (OEM) there are utilities to help
administer
Oracle databases. Included is the
Oracle MS Access to Oracle Migration
Utility. The utility completes a number of tasks in
order to import the records including:
1. Automatic
creation of unique tablespace in Oracle DB of your choice for tables to import.
2. Automatic
creation of table definition.
3. Automatic
creation of indexes as defined by MS Access.
4. Automatic
creation of sequence and triggers.
5. Automatic
granting or rights.
B.3.2 Import of PubMed records into Oracle
NLM
provides scripts on their website for large data extractions. We modified scripts
to suit our environment and download
requirements. Records were extracted
from Medline using the search parameters: Randomized Controlled Trial and
Clinical Controlled Trial as the publication type and Human as the MeSH
term. A total of 242,289 records were
downloaded in XML format in a batch process that took 51 minutes. The script initially produces a list of all
PMIDs which are then extracted in Abstract format. PMIDs were sent in batches of 10,000 per query.
B.3.2.1 Steps involved
1. Modify
script provided by Medline to batch 10,000 IDs, change query parameters and
output format.
2. Run
Crontab job to pull records in off hours (between 9 PM and 5 AM). On the initial test, 242,289 records were
pulled in 51 minutes for a file size of 1.4 gigs.
3. Copy
newly created XML file to Oracle server.
4. Run
script to transform XML data using XSL file created for import process.
5. Import
transformed XML into Oracle.
B.3.2.2 Process and Timing Results
The
record retrieval process from PubMed takes approximately 2 hours from start to
completion. The initial part of the script retrieves the PMIDs for all human
(MeSH headings) CCT or RCTs. The second
part of the script extracts the full record in XML format. The article requests are batched in groups
of 10,000 per the recommendation NLM.
A
total of 38 files are created during the extraction/download process.
The
import of the records into Oracle was completed using a combination of the DOM
(Document
Object Model) and SAX (Simple API for XML) models to parse XML files.
The
SAX parser reads the individual records into memory, and Oracles DOM parser
completes
the import into Oracle.
Preliminary
fields into Oracle:
1. Abstract
2. Affiliation
3. Article identifier
4. Author
5. Comment in
6. Comment on
7. Publication date
8. Grant number
9. Issue
10. ISSN
11. NLM - Unique id for journals
12. Language
13. MeSH terms
14. Pagination
15. PubMed unique identifier
16. Publication type
17. Source
18. Title
19. Volume
B.3.2.3 Initial Database Design
The
initial database was created in Oracle 9.2.
We used a standard database (i.e., not
XML)
to allow multiple export functions.
It
was determined that specialized register records will be stored in a different
table
structure
from PubMed records. This will allow
faster processing of records to check for
duplicates
and change records from non-PubMed to PubMed style.
The
initial database design for PubMed records consists of twelve tables. The primary
key
was an autonumber generated by Oracle with a prefix of 55 with the sequence
starting
55000000.
Based
on testing, the import of the 300,000 + records will take approximately 6
hours.
B.3.3.3 Elimination and importing of non-PubMed
records
There is overlap between the
specialized registers and PubMed data.
Due to data integrity issues in the specialized register files
themselves, all records needed to be compared to the newly downloaded PubMed
records.
PLSQL
scripts were created to do a comparison (author and title) of each record
(PubMed
download vs. Specialized Register). If
the author and title matched the
imported
PubMed records, then the record was overlooked. If the record did not match,
the
ID field of the record was examined.
Depending on the value of the field and if the
dbtype
field was populated determined the unique ID of the record. The record was then
imported
into the table.
PubMed
tables:
1. PubMed
2. MeSH heading
3. Chemical list
4. Abstracts
Non-PubMed
tables:
1. Non-PubMed
2. MeSH headings
3. Abstracts
B.3.4 Handsearching Database
B.3.4.1
Data Extraction
The
July, 2004 handsearching database (Procite) is being investigated to determine
the
best
way to remove any duplicates and apply consistent characteristics to each of
the
datatypes. The Handsearching records (108,719) were all
exported using the three
workforms
associated with the database. Each
extraction was then imported into MS
Access. Each column was predefined to provide ample
space for large columns
(abstracts). For example, the import function in MS
Access allows the user to place the
contents
into a new or custom table. We chose to
import:
1. Author
2. Title
3. Medium
4. Journal title
5. Date of publication
6. Volume
7. Issue
8. Location in work (pagination)
9. Author role
10. Call number
11. Location/URL
12. Abstract
13. Keywords
14. CODEN
15. Availability
16. Notes
17. Author, mongraphic
18. Proceedings title
19. ISBN
20. Place of meeting
21. Place of publication
22. Editor
B.3.4.2 Cleanup and Combining of data
Each
field of the handsearching database was mapped to the initial PubMed database
design. Additional tables/fields will need to be
added to accommodate the conference
proceeding
records.
Since
there is overlap between the Handsearching database and PubMed records, all
records
that contained an ID will be compared to the PubMed import. If there is a
numeric
match, then a comparison will be completed of the author and title list. If there
is
an exact match, then the record from the Handsearching database will be
deleted. An
additional
comparison will be completed for all records between author and title. All
non-duplicates
will be imported into the same format as the PubMed records with a
different
unique ID (based on an algorithm).
C. Know Limitations of Pilot
As
with any pilot, not all possible issues can be addressed. We recognize certain
limitations
exist, but do not feel that any are insurmountable or will change the overall
process.
1. The
pilot was based on 2 existing Procite files from the CEVG group. The main issue is determining the proper
export format to ensure all data is exported with the associated record
properly. Each group within the Cochrane
Collaboration may use a different software package (e.g., MeerKat, Reference
Manager, Procite, EndNote). In our
preliminary findings with the Procite files and with a basic understanding of
MeerKat, we are confident that all existing records from each specialized
register can be exported completely and converted into a new database.
2. New
ID system required - We cannot base our ID numbering system on the IDs granted
by PubMed or any other database. A new
ID numbering system will need to be designed to ensure non duplication of ID
and a numbering system that will ensure administrators that data is being
stored properly. We have investigated
and propose using a unique starting numeric sequence depending on whether the
information is PubMed or non-PubMed.
These IDs will not overlap. We
also anticipate records that have no ID and which we need to review to see
whether they are already included on a bibliographic source (e.g. PubMed) and
download.