INDEX
Section |
|
Para |
|
|
|
A |
INTRODUCTION |
|
|
General. |
1 |
|
The Combined Oral Contraceptive. |
6 |
|
The Regulatory History. |
11 |
|
The Issues in the Litigation. |
20 |
|
|
|
B |
THE APPROACH TO DECIDING THE FIRST ISSUE |
|
|
Cohort Studies. |
26 |
|
Case Control Studies. |
27 |
|
Database Studies. |
29 |
|
Expert Evidence |
33 |
|
Point Estimates and Confidence Intervals. |
36 |
|
Aggregation of COC Products. |
45 |
|
|
|
C |
THE WHO STUDY |
|
|
An a Priori Hypothesis? |
59 |
|
All Centres or Oxford? |
64 |
|
Hospital or GP Controls? |
68 |
|
Conclusion. |
78 |
|
|
|
D |
THE TNS: THE FIRST TWO STUDIES |
|
|
The Origins of the Study. |
81 |
|
The Progress of the Study. |
84 |
|
TNS1 |
91 |
|
TNS2 |
95 |
|
The Mercilon Anomaly. |
98 |
|
Duration of Use. |
106 |
|
Suissa's Splines. |
115 |
|
|
|
E |
TNS3 AND THE COX REGRESSION ANALYSIS |
121 |
|
The Pill Calendar Data. |
127 |
|
The Attack on Cox by Walker. |
133 |
|
MacRae's Response. |
142 |
|
Walker's Rebuttal. |
148 |
|
MacRae's Reply. |
152 |
|
Walker's Separate Algebraic Attack. |
158 |
|
Conclusion on Cox. |
159 |
|
|
|
F |
THE JICK v FARMER DEBATE |
164 |
|
The UK GPRD. |
165 |
|
The Methods and Findings of the Studies. |
171 |
|
The Development of the Issues. |
185 |
|
Jick V: The Attack on Farmer's Controls. |
194 |
|
Conclusions. |
209 |
|
|
|
G |
THE OTHER STUDIES |
|
|
Leiden 1995. |
225 |
|
Herings. |
231 |
|
Parkin. |
234 |
|
Lidegaard. |
238 |
|
UK Meditel. |
240 |
|
German Mediplus. |
243 |
|
UK Mediplus. |
244 |
|
Wyeth-Ayerst. |
248 |
|
Farmer 2000 "Pill Scare". |
255 |
|
|
|
H |
BIAS AND CONFOUNDING |
258 |
|
Prescriber Bias. |
262 |
|
Diagnostic/Referral Bias. |
279 |
|
Conclusions on Prescriber, Diagnostic and Referral Bias. |
286 |
|
Hidden Bias and Confounding. |
287 |
|
Conclusions on Bias and Confounding. |
288 |
|
Industry Funding Bias. |
298 |
|
|
|
I |
CAUSALITY |
|
|
The Bradford Hill Criteria. |
302 |
|
Haematology. |
303 |
|
|
|
J |
OVERVIEW OF THE STUDIES |
309 |
|
|
|
K |
CONCLUSIONS ON THE FIRST ISSUE |
339 |
Mr Justice Mackay:
SECTION A: INTRODUCTION
- General.
This is the trial of seven lead claims in group litigation against three drug companies in respect of their products. There were at the last count 99 claims currently in being. 40 are brought against Schering Health Care Ltd (Schering), 46 against Organon Laboratories Ltd (Organon) and 13 against John Wyeth and Brother Ltd (Wyeth). All the Claimants took on prescription different brands of the Combined Oral Contraceptive (COC) and say they have suffered various cardio-vascular injuries as a result. Their injuries come under the collective description of Venous-thromboembolism (VTE). The commonest forms of this are Deep Vein Thrombosis (DVT) and Pulmonary Embolism (PE). Some claimants have suffered Cerebral Venous Thrombosis (CVT) and some have suffered from strokes.
- All claimants say that the products they took were defective under the provisions of the Consumer Protection Act 1987 and/or the Product Liability Directive 85/374/EEC 25th July 1985. I set out below the seven lead claimants and the conditions from which they suffer. The distinction between the different products which each claimant took will have to be considered in detail below.
- These claims follow one of the biggest pill scares which has occurred in the history of oral contraception in this country; "pill scare" is a phrase used as a convenient shorthand by commentators on this issue which I will adopt, but its use is in no way to be taken as minimising the seriousness of the questions this event raised. On the 18th October 1995 the UK Committee on the Safety of Medicine (CSM) wrote a "Dear Doctor" letter to all relevant prescribers stating that three unpublished studies into the safety of COCs in relation to VTE had indicated "around a two-fold increase in the risk" of such conditions as against the preceding generation of COCs. The Claimants say that a proper consideration of those studies in their published form and of other studies linked to them supports their claim. The Defendants say that there is no increased risk associated with their products, that the warning from the CSM was misjudged and that it should never have been given. The CSM's letter engendered a heated debate among those practising in this field.
- The lead Claimants in this group litigation with very brief details of their claims are as follows:
(i) Carol Townsend. She was born on the 28th December 1970 and was first prescribed Schering's COC Femodene by her General Practitioner in November 1991. She was then just under 20. On the 20th August 1993 she suffered a DVT. She has continued to suffer leg pain and loss of mobility and has needed help in the home from her sister.
(ii) Debra Jones was born on 23rd August 1970 and was first prescribed Femodene in August 1991 when she was 21. She suffered a CVT on the 4th December 1994 when she was just over 24. She has suffered from very severe headaches, despite repeated lumbar punctures, with nausea and giddiness. Her working ability and social activities are both restricted.
(iii) Andrea Massey was born on the 24th September 1976 and was first prescribed Femodene in June 1995 when she was 18Ύ. On the 23rd July 1995 she suffered a stroke as a result of a paradoxical embolism a very short time after starting on the product. I do not as yet have full details of her current symptoms.
(iv) Karen Roberts was born on the 9th June 1962 and was first prescribed Femodene in January 1995 when she was 32. On the 8th August 1995 she suffered a DVT seven months after that first prescription. She has a swollen left leg, her walking is limited and she has an increased risk of recurrence of VTE.
(v) Jacqueline Diplock-Webb was born on the 2nd January 1958 and was prescribed Organon's product Marvelon in March 1993 having also been prescribed it in 1983 and 1985. On 29th August 1993 when she was just over 25½ she suffered a DVT. She has a constant ache in her leg made worse by walking and climbing stairs.
(vi) Nicola Moores was born on 14th February 1967 and was prescribed Organon's product Mercilon in September 1990 when she was 23. In November 1993 and again in August 1995 she suffered episodes of PE she then being 26 and 28. She has had episodes of chest pain and dyspnoea. She has suffered a severe loss of confidence and is worried by the thought of a recurrence of PE.
(vii) Ellen Silcock was born on the 16th December 1977 and was prescribed Wyeth's product Minulet in July 1993 when she was aged 15. On the 4th October 1995 when just under 18 she suffered a PE. She had similar symptoms. She is anxious and has sleep disturbance. She lost her job and cannot play sports.
- All these women, as will be seen, have suffered very significant health problems as a result of their conditions. They have been selected as broadly representative of the range of injuries said to have been caused by the defective nature of the products at the heart of this case. Several of the non-lead cases are more serious in degree and include some fatal cases.
- The Combined Oral Contraceptive.
Oral contraceptives began to be prescribed for women in the United Kingdom in the early 1960s. "The Pill" as it came commonly, almost affectionately, to be called rapidly achieved a significant share of the contraceptive market. Few if any pharmaceutical products have made such an impact on the lives of the persons for whom they were prescribed and the society in which they lived. It is no part of this judgment's task to analyse let alone evaluate the impact of the Pill on modern life. This much, however, has to be observed from the outset. As a pharmaceutical product it is almost unique in being prescribed for and taken by women who are, by and large, healthy and who choose to take it for the purpose of regulating their own fertility. There exist alternatives to oral contraception. Almost from the outset it was recognised as carrying risks accompanying the benefits it brought to those who took it. There has been a more or less continuous debate for 40 years in medical circles as to the nature and degree of those risks.
- Combined Oral Contraceptives, as their name suggests, contain in combination two distinct constituents both of which are synthetic hormones. Over the years the nature and dose of these constituent parts have both changed. The first constituent is a synthetic oestrogen and in almost all COCs that now takes the form of ethinyloestradiol (EE). The second part of the combination is a synthetic progestogen. This litigation is concerned with two. Desogestrel (DSG) was developed by Organon and introduced into the UK market in 1982 in a COC under the brand name Marvelon which contained 150΅g of DSG in combination with 30΅g of EE. In 1989 Organon introduced Mercilon which contained an identical dose of DSG but a smaller dose (20΅g) of EE. The second synthetic progestogen involved in these claims is gestodene (GSD). Schering introduced this into the market in 1987 in the UK under the name Femodene (known in Germany as Femovan) which contained 75΅g of GSD and 30΅g of EE. An identical product was licensed to Wyeth who marketed it under the name Minulet in 1988 and, in a Triphasic version called Tri-Minulet in 1992. All bar Tri-Minulet, were monophasic COCs, that is to say the dose remained constant throughout the pill-taking weeks of the cycle; in the triphasic version the dose varied. In this case no significance attaches to this distinction, or to the fact that the GSD and DSG based COCs contained different doses of their respective progestogens.
- It is important to bear in mind from the outset that DSG and GSD are chemically distinct products, albeit designed to achieve the same biological result. Despite this they have been frequently referred to in the literature collectively as "third generation" progestogens and the COCs of which they were constituent parts have been referred to as "third generation COCs". I shall deal below with the extent to which it is legitimate to lump together or aggregate these products. I shall use the abbreviation "COC3" to stand for "third generation COCs" by which I mean "combined oral contraceptives with an EE content equal to or less than 30΅g and with as a synthetic progestogen either DSG or GSD".
- For the purposes of this case it is necessary to understand only this much about the action of a COC. During pregnancy a woman's body produces high levels of oestrogen and progesterone. This has the effect of suppressing ovulation. A COC is taken daily for 21 days of the menstrual cycle followed by a seven day break. When a woman takes a COC the plasma levels of oestrogen and progestogen in her body are increased by the synthetic hormones in the COC which mimics the natural process that occurs during pregnancy. Thus her body is in effect deceived in such a way as to prevent the maturation and subsequent release of an egg. Additionally the synthetic progestogen acts in two other ways which inhibit conception. A thickened mucus is produced which forms a physical barrier at the entrance to the uterus so preventing the passage of sperm. The progestogen also modifies the natural development of the endometrium (the lining of the womb) which otherwise occurs naturally in the course of the menstrual cycle when the effects of natural oestrogen are unopposed. This thickening of the endometrium is necessary to permit the implantation of a fertilised egg. The presence of progestogen throughout the cycle prevents this development. In this way the COC acts to form a virtually 100% effective method of contraception. Its introduction, as I have said, allowed women complete control and autonomy over their fertility and opened the door to a society in which, theoretically at least, every pregnancy could be both planned and wanted.
- In the early 1960s the so-called first generation of COCs all contained high doses (150΅g and sometimes more) of EE in combination with various progestogens, principally norethisterone (NRT). Studies in the 1960s and early 1970s had detected an increased risk of VTE arising from COC use and the finger of suspicion pointed at the EE component. The result of this was that EE doses were reduced dramatically over the years. COCs with an EE dose of the order of 50΅g or less and one or other of three progestogens namely NRT, levonorgestrel (LNG) and norgestimate (NRG) have in the literature been called "second generation COCs". Again this definition has to be watched with caution. COCs with NRG have commuted from time to time between the second and third generation, since it effectively metabolises to LNG, but for the purposes of this judgment I will place NRG in the second generation except where otherwise stated. I will therefore use the abbreviation "COC2" to refer to "a COC with an EE dose of no more than 50΅g and a progestogen comprising NRT, LNG or NRG". Where it is necessary to refer to a particular COC2 I will do so simply by using the initials of its progestogen content.
- The Regulatory History.
On the 18th October 1995 the CSM circulated a warning to all prescribers of COCs. This was a key event in the history of oral contraception in this country. The letter was addressed to doctors and pharmacists and was headed "Combined Oral Contraceptives and Thromboembolism". In its relevant parts it read as follows:-
"The Committee on Safety of Medicines has recently become aware of the results of three (as yet unpublished) epidemiological studies on the safety of oral contraceptives in relation to [VTE]
The three new studies all indicate that combined oral contraceptives containing [DSG] and [GSD]
are associated with around a twofold increase in the risk of [VTE], compared with those containing other progestogens
in the light of this new evidence and in consultation with experts in family planning the Committee advises the following:-
(1) The recent evidence does not suggest any new additional risks with oral contraceptives containing levonorgestrel norethisterone or ethynodiol. Women taking these oral contraceptives should be reassured that there is no need for them to change their pill.
(2) Women taking oral contraceptives that include [DSG] or [GSD]
should be strongly urged to complete their current cycle. The risks associated with unwanted pregnancies are far greater than the risks of continuing these pills.
(3) [COCs] containing [GSD] or [DSG] should not be used by women with risk factors for [VTE] including obesity, varicose veins or a previous history of thrombosis from any cause.
(4) [COCs] containing [GSD] or [DSG] should only be used by women who are:
- Intolerant of other [COCs] , and
- Prepared to accept an increased risk of thromboembolism"
The chairman of the Committee concluded with these words:
"I thought that it was important that you should be made aware of this matter at the earliest opportunity. However I am acutely aware that this new information will worry women and impose a substantial burden on doctors and pharmacists".
- The Claimants say that this action by the CSM was timely, responsible and justified by the full studies in which the unpublished material referred to in the Chairman's letter was subsequently reported, as well as by later studies. Their case is that what is stated in the letter has grown into mainstream opinion in this area of medicine. The warnings that the CSM gave in 1995 are those which they say should have been and still should be attached to COC3s.
- The Defendants say this letter was precipitate and ill considered and that it has led to what some have called a public health disaster. One of the experts who gave evidence to me was on the sub-committee which had advised the CSM, and was telephoned by a colleague, whom he respected, and accused of causing havoc in women's lives. As predicted by the Chairman grave concern was caused among users of COCs, and there was a flight from COC3s and from COCs generally. I will look below at the way the regulatory history of the matter developed, since the Claimants set great store by it and argue that it shows that the Defendants' collective stance, that there is no increased risk or if there is it is well below the levels suggested by the regulators, is frankly eccentric and not endorsed by the responsible expert members of the pharmaco-vigilance community.
- Whatever the judgment on the CSM's letter, its immediate consequences are clear. In 1997 the Office for National Statistics estimated that in the eight months alone following the pronouncement there were 30,000 conceptions in the United Kingdom beyond the number that might have been projected from trends in place before the announcement. 10,000 of these were terminated by abortion. There has also been acute, sometimes acrimonious controversy ever since among epidemiologists as to whether the "twofold increase in the risk" is borne out by the evidence.
- The equivalent European body, the CPMP, a Committee which advises the European Agency for the Evaluation of Medicinal Products, took a different approach from the CSM. On 27th October 1996 in a statement it noted the position and asked the three companies to provide more data. It said it was not withdrawing COC3s on the basis of the evidence available. After consideration of certain material submitted to it, it reached a "preliminary conclusion" on 3rd December 1996 that users of COC3s should be warned that there was an excess absolute risk of about 10 cases per 100,000 woman years as against LNG. In view of this, it said, only women without recognised risk factors for VTE and who might be intolerant of other COCs and are prepared to accept an increased risk of VTE should take them.
- In the UK the companies continued to make their representations. On 20th January 1999 the Licensing Authority of the Medicines Control Agency (MCA) accepted the advice of the Medicines Commission that the Summary of Product Characteristics (SPCs) , previously called data sheets, for COC3s should be amended. The Commission said that it had considered the companies' information and arguments, and that these had not changed the earlier view of the CSM that there was an increased relative risk "of about 1.7 1.8 after adjustment", which was "not fully explained by bias or confounding" [emphasis added]. In the light of this new warnings were required to appear on the SPCs.
- These new warnings, summarised, said that an increased risk of VTE associated with COCs generally was well established but was smaller than that associated with pregnancy (60 cases per 100,000 pregnancies). In healthy non- pregnant women who were not taking any COC it was about 5 cases per 100,000 woman years, in those taking COC2s about 15, and for COC3s about 25. All these risks increase with age, and other known VTE risk factors such as obesity. The Commission took this step, as it put it,
"since the re-analyses and new studies did not provide convincing evidence that the differences found
. could be explained fully by bias or confounding".
It pointed out that the risk was small and that so long as prescribers and patients had access to full and understandable information there was no justification for restricting COC3s to those intolerant of COC2s, i.e. so-called "second line" prescription only. This is the UK regulatory authorities' final position as of today.
- To return to Europe and complete the picture, the CPMP completed its assessment on 28th September 2001. Its view was that COC2 risk was about 20 per 100,000 women years, and that the "best estimate of the magnitude of the increased risk [of COC3 v. 2 ] is in the range of 1.5 2.0". They thought it was highest in the first year of use and that this information should be taken into account by first time users. This view was incorporated in proposed changes to the SPCs for the relevant products and the European equivalent of a "Dear Doctor" letter. The CPMP's view was that there was no reason for a woman using any brand of COC to stop taking it on the basis of its findings.
- I must stress at the outset that this action is in no sense a review of or appeal against the CSM's letter of 18th October 1995 or the MCA's decision of 20th January 1999 . Each body has a function to discharge which is entirely different from mine in this litigation. Regulators have the difficult task of deciding at what point in the apparent accumulation of evidence it is prudent to act as though a given association were causal rather than to continue to assume it is not. The point at which that decision forms and crystallises depends on many complex factors, medical and social, including such balancing considerations as the relative consequences of the alternatives of doing something and doing nothing. My task is a simpler one in this sense. I have to decide on the balance of probabilities on the evidence presented before me whether each of these Claimants has established her case that these products were defective, that the defective nature of them caused or contributed to her injury, and therefore that she is entitled to damages as a result. I note with respect and interest the conclusions of the regulators who looked into this matter, but am in no way bound by them, nor indeed am I strongly influenced by them in the end. If I were to adopt their conclusions, at least if I were to do so literally, the Claimants would lose this litigation in any event, for a reason with which I will shortly deal.
- The issues in the litigation.
The following issues have been identified as those requiring decision in the case, and in the order in which I set them out.
(1) Have the Claimants proved that COC3s, alternatively DSG/GSD, carry a true excess risk of VTE which is more than twice that carried by COC2s containing LNG? There is agreement that this is the essence of the issue, though the parties do not agree how the drugs at issue should be categorised for this purpose. But all agree that if the Claimants fail to prove this the action should go no further as it could not succeed.
(2) If Yes, are the relevant products defective within the meaning of section 3 of the Consumer Protection Act 1987, i.e. was their safety "not such as persons generally [were] entitled to expect", which includes consideration of any instructions or warnings associated with the product? It is agreed that if the Claimants succeed on the first issue they must also succeed on the second. Realistically the Defendants accept that if the true risk of VTE was more than doubled, even if the overall risk was very low in absolute terms, women and their prescribers were entitled to be told of this before making their decisions or giving advice, and they were not.
(3) If Yes, have the Defendants proved, the onus being on them to do so, that they are entitled to the benefit of the Development Risks Defence under section 4 of the same Act? This applies where the:
"state of scientific and technical knowledge at the relevant time was not such that a producer of products might be expected to have discovered the defect".
This will turn to a great extent on the true extent and limits of that defence, and the key question of whether it is restricted to what was capable of being discovered about the defect according to the state of technical knowledge prevailing at the time, or does it extend to what the Defendants might reasonably have discovered.
(4) If No, would the Claimants have been prescribed COC3s but for the defect? This issue includes consideration of the potential benefits in respect of matters other than VTE which they either brought or were perceived by prescribers to bring, relating to such questions as reduced androgenicity and better arterial cardiovascular protection.
(5) If Yes, in the case of each Claimant viewed separately , did her exposure to a COC3 cause or contribute to the adverse event which she suffered and but for the defect would she have avoided the injury of which she now complains?
- The reason why the Claimants accept through Lord Brennan QC that this first issue is capable of disposing of the claims should be set out. It is not because an increase of less than 2 would fail to render the product defective within the meaning of the Act, though the Defendants would so argue if they had to. It is for reasons of causation that he accepts this burden, correctly in my view. If factor X increases the risk of condition Y by more than 2 when compared with factor Z it can then be said, of a group of say 100 with both exposure to factor X and the condition, that as a matter of probability more than 50 would not have suffered Y without being exposed to X. If medical science cannot identify the members of the group who would and who would not have suffered Y, it can nevertheless be said of each member that she was more likely than not to have avoided Y had she not been exposed to X. There is a statistical formula which expresses this concept, but in this case I intend to resort to words where I can, both from a preference for language over symbols and because I am conscious that this judgment should be accessible, above all to the women who bring these claims.
- The trial of the first issue occupied some 42 days, including submissions in relation to it. The trial of the remaining issues was estimated to last about as long again and to add a seven figure sum to the total costs of the litigation. In the circumstances I thought it right to give this judgment on the first issue without hearing the remainder of the evidence.
SECTION B: THE APPROACH TO DECIDING THE FIRST ISSUE
- Epidemiology.
As the evidence relied on by the Claimants and Defendants in this first issue in the case is almost entirely epidemiological in its nature it is necessary before going to the studies to state in simple terms the principles of epidemiology which apply.
- Epidemiology is the study of the occurrence and distribution of events (such as disease) over human populations. It seeks to determine whether statistical associations between these events and supposed determinants can be demonstrated. Whether those associations if proved demonstrate an underlying biological causal relationship is a further and different question from the question of statistical association on which epidemiology is initially engaged.
- The best way to prove whether exposure to a drug can cause an outcome is to conduct an experimental study in which, on a double-blind basis, one group of patients is given the drug and another group a harmless substitute. When these two groups are followed their reactions can be noted. For reasons which are all too obvious in the case of the COC that type of study would be impossible. It is therefore necessary for epidemiology to resort to observational studies of which three kinds have featured in this case.
- Cohort Studies.
In these a group of people is studied over a period of time. The experience of the members of the group who have the relevant exposure (here taking a COC3) is compared with that of those without such exposure and the incidence of the condition in question (here VTE) of the two groups is compared. Based on the proportion in which the two groups suffer the condition in question the relative risk (RR) of the one group against the other can be calculated. These studies are often the method of choice for observational epidemiological studies. They are however not well-suited to studies of the incidence of an outcome which is itself a rare condition since it will be necessary to recruit very large numbers of subjects in order to acquire a sufficient number of events to achieve statistical significance. They can also take a long time.
- Case-Control Studies.
These studies are commonly used where the outcome under scrutiny is rare. The epidemiologist finds a number of "cases" who are people who have suffered the condition under examination. At the same time he recruits a larger number of "controls" who have not. These will usually be up to three or four times as many as the cases. Relevant data is then obtained about the exposure of both groups together with other risk factors. A comparison is then done between the proportion of exposed and non-exposed persons in the cases and the same proportion in the controls. From that is calculated the odds ratio (OR). For the purposes of this case there is little distinction between an OR and a RR and the two are often used, as I will use them, interchangeably albeit inaccurately. A case-control study is not capable of determining the absolute risk of the event in question since the incidence of the event in the population cannot be derived from it. It is therefore not capable of yielding the relative risk, unlike the Cohort Study.
- It is of great importance in a case-control study that the cases and controls be as similar to each other as possible in all respects other than the fact that the cases will have and the controls will not have the condition in question. Put more accurately, the controls must be as representative as possible not of the cases but of the population at risk from which the cases have come. The task of the epidemiologist is to achieve this as closely as he or she can. Case-control studies can be carried out on the data obtained within a cohort study or a database study in which case they are called "nested" case-control studies. To ensure that the sample taken by such a study is representative and not skewed by confounding factors or bias (which will be discussed in more detail below) steps are usually taken by matching the cases and controls carefully and/or by adjusting the data in the process of analysis. But bias and confounding remain the case-control study's greatest enemy and their potential for distortion has been a constant feature throughout this case.
- Database Studies.
A database such as the General Practitioners Research Database (GPRD) in the United Kingdom, which will be encountered in more detail below, constitutes a very large source of medical information on a large number of patients. This can be used by epidemiologists who are spared the need to recruit participants in a field study, and they are thus quicker and less costly. They can make use of the database to look back on data already accrued over a long period of time. These data can then be used to perform what amounts to a retrospective cohort study yielding a calculated RR or a nested case-control study yielding a calculated OR.
- Whatever the type of study, what will emerge from it is a "point estimate" or a single figure representing the RR or OR which has resulted from an application of the recognised formula. Epidemiologists however conventionally seek to indicate the reliance that can be placed upon that figure by determining 95% confidence limits or intervals (CI) around it. The spread of these will depend on the numbers involved in the study. They constitute a range within which the reader can be reasonably confident that the result is unlikely to have arisen as a result of the "luck of the draw" or sampling error. Generally the higher the numbers in the study the closer together or "tighter" the confidence limits will be. The CI means that the observed difference would occur by the operation of chance less than 5 times out of 100. The result is described as "statistically significant" where the CI does not include 1.0.
- By using these tools epidemiology has been in the van of developing medical knowledge. For more than 40 years mainstream medical and non-medical opinion has, for example, accepted the association between cigarette smoking and lung cancer, or light asbestos exposure and mesothelioma, even though the mechanisms by which these conditions occur at a microbiological level are not even now fully understood. Many other examples of this type could be cited. The condition under consideration in this case is on the face of it classically suited for an epidemiological investigation. Such are the complexities of the processes which form the delicate state of balance in the blood between coagulating and anti-coagulating factors, all acting in tension or competition with each other, that haematology is very far from reaching a full understanding of what causes blood clots to form in the venous system. Pending fuller biological understanding being reached, epidemiology should be able to light the way to an understanding of factors or exposures which may cause such conditions to occur.
- But controversy has raged in this case, where put at their highest the statistical associations relied on by the Claimants are at a low level, as to the reliance to be placed on the numbers which emerge from the studies, and particularly whether they can be translated into or viewed as proof of biologically determined causation. This argument focuses on the inherent risks or pitfalls of epidemiology as a discipline. The Claimants say that the most realistic overall view of the studies indicates an increased relative risk of something a little more than 2. According to their main witness Professor Alexander M. Walker, who attempted to assess or derive a single figure from a selection of the studies now available, the true relative risk is 2.2. Their second witness Dr Margaret Thorogood puts it at 2.1. Professor Klim McPherson puts it in various ways, but his overall view of all the relevant studies gives a figure of 1.9 or so. So it will be seen that the Claimants have at the outset a perilously low margin of error given their acceptance of the level of the threshold over which they have to climb, namely a "true" relative risk figure measurably in excess of 2. The Defendants say that the right statistical analysis is that there is no differential at all, or at most one well under 2; but, if they are wrong on this, they say that in all studies where low level associations are shown great circumspection has to be shown before construing the figure as a "true" one evidencing an underlying biological causal relationship, because of the possible presence of confounding factors and/or bias. These will have to be considered in the context of the individual studies considered below.
- Expert Evidence.
I have heard from ten epidemiological experts in this case and they remain as they have been for the last several years, largely irreconcilable in their differences. Such were those differences that before trial I was persuaded, against my initial judgement, by all the very experienced counsel in this case that no purpose would be served by requiring them to meet to resolve or narrow the issues. Having heard and seen them I can now understand why that was. The debate between them has been unyielding, at times almost rancorous in tone, and with a few honourable exceptions with which I will deal as I go through the issues, devoid of willingness to countenance that there may be two sides to the question. So, science has failed to give women clear advice spoken with one voice. I cannot simply accept the Claimants' categorisation of the Defendants' witnesses as eccentric mercenaries isolated from the rest of scientific opinion on this issue. From the literature I have read there are others who seem to share their views.
- What is the role of the Judge in such a dispute? First he must select the issues which appear to matter; I cannot decide every dispute between experts in this case or between counsel in their extensive final submissions. Secondly he cannot transform himself into some form of super-scientist with access to a level of expertise superior to those who have given the evidence. Rather his role, and my role here, is to evaluate the witnesses and decide after 42 days of evidence and submissions, undoubtedly a more extensive debate on this topic than has ever been carried out to date, which parts of the evidence are sound and reliable and which are not. I rely in this context on the important advice of Stuart-Smith LJ in Loveday v. Renton and another [1990] 1 MLR 117 at 125 where he said:-
" The court has to evaluate the witness and the soundness of his opinion. Most importantly this involves an examination of the reasons given for his opinions and the extent to which they are supported by the evidence. The judge also has to decide what weight to attach to a witness's opinion by examining the internal consistency and logic of his evidence; the care with which he has considered the subject and presented his evidence; his precision and accuracy of thought as demonstrated by his answers; how he responds to searching and informed cross-examination and in particular the extent to which a witness faces up to and accepts the logic of a proposition put in cross-examination or is prepared to concede points that are seen to be correct; the extent to which a witness has conceived an opinion and is reluctant to re-examine it in the light of later evidence, or demonstrates a flexibility of mind which may involve changing or modifying opinions previously held; whether or not a witness is biased or lacks independence
. There is one further aspect of a witness's evidence that is often important; that is his demeanour in the witness box. As in most cases where the court is evaluating expert evidence, I have placed less weight on this factor in reaching my assessment. But it is not wholly unimportant; and particularly in those instances where criticisms have been made of a witness on the grounds of bias or lack of independence, which in my view are not justified, the witness's demeanour has been a factor that I have taken into account"
That there is an advantage to be gained from the judge's impression of the expert witness notwithstanding that the issues may be technical in nature was stressed by Lord Bridge in Wilsher v Essex AHA [1988] 1 AC 1074 at 1091G.
- I include a brief rιsumι of each expert's relevant experience in this field in Appendix 2 to this judgment. So far as my assessment of each as a witness that will be found later in the text at what I feel to be an appropriate point. Having introduced each expert at first encounter by his or her full title I hope they will forgive the use of surnames thereafter for brevity's sake.
- Point Estimates and Confidence Intervals.
The studies at the heart of this case have all calculated a point estimate, usually expressed to one decimal place. Almost invariably that is followed in brackets by two other numbers which constitute the lower and upper confidence interval. These are in all relevant places calculated within 95% confidence limits. Wherever in this judgment I follow a point estimate with two bracketed figures they will be the upper and lower limits of a 95% CI. A question is raised as to the relevance of these limits to the task I have to carry out. The Defendants say that to establish causation in the individual, and therefore a RR which is greater than 2, there must be seen not just a point estimate but also a lower CI which is greater than 2 in order for the result to be significantly different from 2.
- The confidence interval gives a range of "true" RRs from which the point estimate derived from the study does not differ significantly or in other words it says that it could as well have been thrown up by a "true" RR anywhere within that range. If the 95% CI excludes 1 the RR is statistically different from 1 at the 0.05 level of confidence. Put technically, this concept describes the probability of the data if the "true" RR equals 1, not the probability that the "true" RR equals 1 given the result shown by the data (which Professor Kenneth MacRae describes as the "fallacy of the transposed conditional").
- In a single study it is agreed that a confidence interval which straddles 1 "fails to reject the null hypothesis", the premise of all such studies, namely the proposition that there is no difference in risk between the drugs under consideration. This indicates that whatever the point estimate of that study is it is not statistically significantly different from 1 or indicates a state of no true RR. This is because the play of chance has not been eliminated to a degree which most epidemiologists would by convention consider satisfactory. If therefore the Claimants' case here stood or fell on a single study the Defendants would be entitled, as it seems to me, to argue that I should be looking not at its point estimate but its lower confidence limit. It also seems to me that that would be the case if the issue was not whether the RR was greater than unity (i.e. whether the null hypothesis had been rejected) but rather whether as here the RR was greater than 2. The next question to be answered would be this: If there were a series of studies which were truly combinable in a formal Meta-analysis (with which I will deal later in the Overview section) would the same principle apply? In a narrow sense I believe it would. If such studies were formally combinable so that they became a single mega-study with a single point estimate and a single CI in theory I see no reason why the Defendants' argument would not then still apply.
- However the position I face is altogether more complex and less coherent than that which I have so far described. In this case we have a series of studies with different point estimates and, largely, overlapping CIs. Time after time in the case experts have had their attention drawn to point estimate's from studies which appear, to the layman's eye, to be very different. Almost invariably they have dismissed those apparent differences by reference to the overlapping CI's, saying the figures are statistically compatible and there is no significant difference. Three examples illustrate this:-
(1) Walker in his report described the WHO All Centres figure 2.7 (1.6-4.6) and the TNS1 figure 1.5 (1.1-2.1) as "closely corroborative" of each other.
(2) Dr. Hershel Jick was taken to 2 RR estimates in his 1995 study report namely 2.7(1.2-6.3) and 2.0(0.9-4.8). His reaction was "these 2 estimates are not anywhere near statistically significantly different from each other. It is a flip of the coin sort of thing. One happened to be 2.7, one happened to be 2."
(3) Dr. Nicholas Dunn was taken to a range of ORs in a report which he had carried out, which will be encountered in the section on bias, which ran from 0.7 to 0.9 as against those in a parallel paper by Professor Lothar Heinemann which ran from 1.4 to 1.6. Because of overlapping confidence intervals he described these as not statistically significantly different and said it would not even be appropriate to describe these different examinations of the same phenomenon as pointing in different directions.
- There are many other such examples in this case, notably the diversity in the ORs for GSD and DSG when considered separately, which is dismissed by many experts as insignificant. As Professor Samuel Shapiro put it a RR of 1.5 or 1.4 or even 2 is not materially different from one of 1. Truly, as more than one witness said, this is an inexact science for all its appearance of precision.
- But the Claimants say that the statistician's view of a confidence interval has a limited role to play in this case. The Court's task is not to derive some likely point estimate from a synthesis of all the evidence and then erect confidence intervals around it seeing whether the lower interval exceeded 2. This is not an exercise the Court would be able to do, apart from anything else. The Court's task is to decide whether the "true" RR is more than doubled. That conclusion will have to be rooted in the statistical evidence, for there is no other soil in which it can grow. But there must come a time in this case when, having received all the assistance that epidemiology has to offer me, I part company with it and embark on the final judicial journey of assessing, as best I can, on all the evidence and using my judgement of it whether as a matter of probability the "true" RR is indeed more than doubled. Both sides' final submissions envisage this two stage process as being necessary and appropriate.
- In this second stage of my journey I will be armed with all the assistance that I can acquire from epidemiology but I will no longer be bound, as it seems to me, by its rules and conventions. In particular I will not be constrained to use the language of "confidence" with all that it implies. I will have emerged from that forest into broader more open country where the simpler concept of the balance of probabilities rules. That which is more probable than not, where as here past events are under consideration, is treated by the law as being certain. Minority doubts are submerged in majority probabilities for sound pragmatic reasons, and in that sense the winner takes all. No legal judgment needs to be surrounded by an expression of the degree of confidence with which the Court pronounces it.
- For these reasons all the evidence given on this topic, particularly by Walker, in which the relationship between the position of a point estimate within its CI on the one hand and the probability or likelihood of that point estimate being the "most probable" result of that study is, as it seems to me, potentially dangerous. To be fair to Walker he was reluctantly led into this evidence, sometimes by Lord Brennan QC and sometimes by me. The point estimate can be said to be the best estimate as a single figure to come from the individual study to which it attaches, the best figure that is internal to that study. It is dangerous, it seems to me, to move from that proposition to say that a single study's point estimate, particularly at low levels such as these, says anything about the probability of the "true" RR. The Claimants say that I should look at all the evidence in the case both that which is and that which is not statistically significant taken on its own and form my judgment from it. I believe that submission is right.
- I draw support and comfort from what was said by Kennedy LJ in Elvicta Wood Engineering Limited v- Huxley [2000] WL 664536 at page 3 when he said:
"We merely have to decide whether, on the material presented to him, as interpreted by the witnesses, the judge was entitled to conclude as he did, namely that on the balance of probabilities the cause or a significant cause of this Claimant's SCC was his wrongful exposure to large amounts of wood dust over many years. That does not mean that we are entitled to ignore known limitations on the value of statistical material or anything of that sort, but what it does mean is that we do not have to search for medical certainty, and even if we do uphold the judge's decision we can contemplate with equanimity the possibility that at some time in the future it may be shown, on a balance of probabilities or perhaps to an even higher standard, that the judge was wrong".
- Aggregation of COC Products.
The lead claims in this litigation involve different brands of COC. The pleaded cases assert that each of these products is defective by virtue of it being not at least as safe as COC2s and conferring an increased risk of VTE in comparison with them. Because of the way the case is put in the pleadings and the Defendants belief in the heterogeneity of their respective products the two principal progestogens have in effect been separately represented throughout this trial.
- At the outset of the trial the Claimants said that the comparison group that they sought to use was COCs containing LNG and not all COC2s. This was on the grounds that LNG had the lion's share of the COC2 market, all bar some 15% or so. The Defendants have not objected to this, accepting that the Claimants are at liberty to choose the comparator they prefer.
- More contentious is the Claimants' treatment of the products under attack. They seek in their final submissions to aggregate or roll up DSG and GSD and treat them as a single pharmaceutical entity. Though it is not disputed that the products are chemically distinct there is no suggestion made that they are different in terms of thrombogenicity. Indeed the agreed haemostatic evidence is that "there is no distinction to be drawn between GSD and DSG as to their effect on the haemostatic factors of relevance to this case". The reason for this desire to aggregate is a pragmatic one and not theoretically based. The advantages are entirely practical namely that so doing will increase the power of the relevant studies. The Defendants object that this ought not to be the warrant for aggregating, but that the only respectable basis for doing so is if there are shown to be no clinically important differences between the preparations. This they say is the very object of the studies. Therefore to aggregate in the course of carrying them out or considering them is to assume that which the studies set out to prove.
- Appendix 3 sets out MacRae's descriptive summary of all the various RRs/ORs found in the various studies and sub-studies and there is no advantage in setting them out again here. It seems to me to be a fair summary to say that in terms of point estimates GSD attracts values of 0.9 to 3.1 and DSG of 1.1 to 2.8. There is no consistent pattern as to which comes out higher than the other. Broadly speaking there is an equal division of supremacy between them. All the point estimate's are surrounded by CIs which overlap.
- It is right to say that the WHO scientific group stipulated that ideally the effect of each OC formulation should be considered separately, and certainly at the outset of this controversy that was the correct position to take. McPherson agreed that certainly one should look at the individual products separately. But he added that one must always look at the overall aggregate effect, and in a way that was more important. If your epidemiology shows that they are the same you can aggregate them as long as you know what you are doing, as he put it. I think this was a valid description of the position. Shapiro was unhappy, saying that one should not merely aggregate for the sake of power as one may blur the true position. The Defendants also contend that the Claimants in arguing for this are acting in a way that is slightly inconsistent, having opted to dis-aggregate LNG from among the other COC2s for use as the comparator.
- McPherson said that a prescriber in practice would not prescribe a third generation COC but that he/she would probably prescribe a particular product. But he did say that an advantage of aggregation would be to send a clearer message to prescribers; that is perhaps best shown by the Dear Doctor Letter itself which did not mention particular products in its warning but referred to " third generation COCs with 30 ΅g of EE" as carrying a small increased risk.
- I am content for the moment to adopt aggregation, largely it must be said for reasons of convenience, since the label COC3 is and has throughout the trial been used as a handy shorthand for a wider concept. I also suspect there is no way in which one of these progestogens could be found defective and the other escape liability. I must however be alive to the risks involved in taking this course. The absence of haematological evidence that they are different in their effect may tell us more about the limits of present haematological knowledge in this area than the respective actions of the two progestogens. It should also not be confused with evidence that they are in truth the same. If at a later stage (and I particularly have in mind the Jick v Farmer debate see below) aggregation operates so as to influence the result of a particular study in a way that I consider inappropriate I must be alive to that risk and take steps to allow for it. With these principles in mind I must now turn to the main studies on this issue.
SECTION C: THE WORLD HEALTH ORGANISATION STUDY
- By 1990 it was estimated that such was the popularity of COCs that 73 million women worldwide were users. Low dosages of oestrogens had been introduced in response to a succession of studies which, without commanding universal acceptance, strongly suggested an association between COC use and 3 cardio-vascular (CV) diseases namely acute myocardial infarction (AMI), cerebral infarction and VTE. There had however been no rigorous large scale epidemiological evaluation of this proposition nor had there been large studies into COCs in developing countries as opposed to Europe and North America.
- Professor (then Doctor) Neil Poulter explained how the WHO study took shape, with him becoming its central co-ordinator and co-principal investigator. In the mid 1980s the UNDP/UNFPA/WHO/World Bank Special Programme of Research Development and Research Training in Human Reproduction sponsored and funded the study which was to become known as the WHO Study. This sponsorship was quasi-political in its motivation; since all major studies up to that point had been based in Europe or North America it was thought inappropriate to extrapolate their results to all population groups, especially those in the developing world, when results derived from a study with a wider geographic base might command greater respect. Feasibility studies were carried out in October 1985 and potential collaborators identified. The draft protocol was prepared and a pilot study carried out in 1987-88. The protocol was finalised in January 1990. Data were collected from 21 centres in 17 countries in Asia, Africa, Europe and Latin America. 3,792 cases and 10,281 controls were recruited over a period from 1st February 1989 to 31st May 1993. The methods used were described in a paper published in 1995(1)(See Appendix 4 for full references for this and all other papers).
- This study was the biggest of its kind in this field. Consideration of the data, commonly called "data cleaning", took place over about a two year period. Preliminary findings were presented orally at a meeting in Geneva in July 1995, and before the study's results were published the CSM decided to write the dear doctor letter of 18th October 1995. This was the first of the unpublished studies to which it referred.
- On 16th December 1995 two WHO papers were published in the same edition of the Lancet. The first gave the results of the main study(2). From the 21 centres worldwide 1,143 VTE cases and 2,998 controls matched by age were examined. The result was a finding of an increased risk attached to COC use (regardless of generation) of 4.15 (3.09 - 5.57) in the European centres and 3.25 (2.59 - 4.08) in the non-European countries. Overall the conclusion was that there was an association between COC use and VTE, although interestingly these risk estimates were lower than those demonstrated by most previous studies.
- However, what the writers of the article called an "unexpected finding" had emerged. ORs for COC3s were higher than those observed in the case of COC2s. This phenomenon was the subject of the second article in the same journal, to which I will refer as "the WHO Study"(3). For this purpose a sub-group of 769 cases from 9 countries was compared with 1,979 age matched hospital controls and (at one centre only, Oxford) to 246 community controls matched on age and general practice. These 9 were the only centres from the original 21 with any data at all on COC3 use. This no doubt reflects the low penetration of the COC market in the developing world achieved at this time by COC3s. Oxford had 48 out of the 71 cases exposed to COC3s (68%) and 48 out of the 56 exposed controls (86%). The other 8 contributing centres made very modest contributions. The numbers of cases of VTE exposed to COC3s ranged from 0 to 8 (the latter figure applying to Colombia only). In the remaining centres the range was between 1 and 5 exposed cases. In some centres the number of exposed controls was even smaller. In 4 centres no controls had been exposed to COC3s and in the remaining 4 the numbers ranged from 1 to 3. The details of these numbers were set out in Table 1 in the study.
- Oxford had both hospital based (360) and community based (246) controls. At 4 other centres attempts had been made to recruit community controls but with difficulty, so much so that 1 of the centres had abandoned the exercise altogether and in the other 3 the resultant data were so sparse as to result in their being jettisoned for all purposes of analysis.
- A summary of the relevant findings in the WHO Study is set out immediately below (all comparisons being against LNG). The shaded cells are the key figures for the purpose of my judgment.
Centre |
Type |
RR (Adjusted) |
|
|
|
All Centres |
DSG |
2.4 (1.3-4.6) |
|
GSD |
3.1 (1.6-5.9) |
|
Both |
2.7 (1.6-5.9) |
All Centres excluding Oxford |
DSG |
4.8 (0.5-43.4) |
|
GSD |
5.3 (1.8-15.5) |
|
Both |
5.2 (2.0-13.7) |
Oxford hospital controls |
DSG |
2.3 (1.1-4.9) |
|
GSD |
2.0 (0.8-4.7) |
|
Both |
2.2 (1.1-4.2) |
Oxford GP controls |
DSG |
1.8 (0.7-4.8) |
|
GSD |
0.9 (0.3-2.8) |
|
Both |
1.4 (0.6-3.1) |
- Three main issues arise for resolution in respect of this very important study.
Was it an "a priori hypothesis" of the WHO Study to make a "head to head" comparison between COC3 and COC2 risks in relation to VTE?
- The importance of this issue is that it is common ground that results do not command the same confidence when they emerge from a study without having been an a priori or planned objective of that study specified in advance. This is accepted by both sides as a cardinal principle of epidemiology. At best such a study will reach a tentative conclusion and raise a question for further analysis by other studies. As to what the authors themselves said in the text of their paper they included these words.
"although not pre-specified in the study protocol, the secondary objective was to determine whether the risk of VTE
.. varied with OC composition or duration of use." (And later) "these observations are based on an analysis of a secondary study objective, and the possibility that they are due to chance, confounding or bias or a combination of these cannot be excluded entirely. They must be confirmed by independent epidemiological studies
.".
- Consideration of the protocol does not provide an obvious answer. In its original form it set out a subsidiary objective as being to determine "if the risk [of VTE] varies with the different composition of [COC] used (i.e. different levels of oestrogen and progestogen content) and with the duration of use." The revised protocol in the methods paper did not contain the parenthesis in round brackets.
- Heinemann, who had the role of Principal Investigator in East Germany for the study, said in his litigation report that he was surprised when he heard and read that the WHO had carried out a head to head comparison. This had never been discussed with him at any stage nor was he aware that the study was interested in a comparison of different types of progestogen. Thorogood who was a member of the Publications Advisory Committee thought that the study was not designed to test the matter on a head to head basis; indeed in her answers to a written request following her report in this case she said she was not aware it had ever been claimed otherwise. She confirmed this in her evidence. Poulter relied on the wording of the original protocol and the presence of the noun "content" after progestogen, which he construed as meaning that the different types of progestogen were to be made the subject of comparisons at this early stage. Certainly "content" could not refer to the dose of progestogen, which varies considerably in COC3s ( DSG has twice that of GSD), but which has never been considered significant in any study I have been shown.
- This evidence of Poulter (he gave evidence as a factual witness only) was given in my judgement with a degree of conviction which the wording of the WHO Study itself did not bear out. The authors of the paper plainly viewed this aspect of the findings as hypothesis generating and not as a priori analysis and that is what in my judgement it was. This is not to say that it is not an important piece of evidence in this case, for it plainly is. If later studies confirm its finding then it in turn becomes a piece of the evidence in the case. But as I stated at the outset of this section of the judgment the confidence to be reposed in it is somewhat lessened by this fact.
- Which is the right point estimate, all centres or Oxford only?
The Claimants' case here is that the all centres figure of 2.7 is the appropriate point estimate which should represent the findings of the WHO Study. Alternatively they say I should take a figure somewhere between 2.2 and 2.7. The Defendants say that the all centres data should be effectively discarded. I have already described the sparsity of the non-Oxford data as set out at Table 1 of the study. As will be seen from the table above, when analysed apart from the Oxford data the other centres gave ORs surrounded by very wide CIs indeed which would never be capable of standing independently as reliable evidence. The text of the WHO Study gives no direct assistance as to which figures it thought were the best estimate of the relative risk in this case. In the summary at the head of the paper the authors cite both all centres and Oxford figures.
- The Claimants mainly rely on the evidence of Poulter who regarded the non-Oxford centres as "consistent in terms of the results", without which quality he could envisage an argument being made for ignoring them. I have set out at paragraph 58 above what the non-Oxford centres yielded in terms of figures, in effect a doubling of the Oxford point estimates if hospital controls are used or quadrupling if community controls are used. Save in the narrowest of statistical senses it is difficult to see how "consistent" properly describes this comparison. They seem in plain English to be much higher. Also in a statistical sense they are unstable with very wide confidence intervals. Walker thought that they contributed "in a small way" to the overall picture or result and could not be dismissed as "negligible".
- Thorogood took what she described as a conservative view and selected the Oxford figure as the "headline OR" from the WHO Study. The all centres figure she thought involved using countries which had very low overall use of COC3s and their estimates would have been less reliable in that there were just one or two controls. She said the only really meaningful data in her view were the Oxford data. Heinemann plainly was of the same view. Shapiro who was not involved in the study unlike the other two described the non-Oxford data as "statistically unstable and fragile" and described the findings from the non-Oxford centres as "uninterpretable". For what it is worth the Defendants point to the fact that the two published meta-analyses of Kemmeren and Hennessey, whose works both include some elements of quality review, seemingly adopt the Oxford figure.
- I believe that Thorogood was right to take the course she took. The WHO authors do not express any great enthusiasm in the text of their study for the higher figure. I consider that the arguments in favour of the Oxford centre are strong and accept them.
- Hospital or GP-based controls?
The study authors expressed a clear view on this issue. They said hospital controls were the preferred comparison. While they accepted that "where there is homogeneous coverage of health services" GP-based controls may be more representative, their use in this study gave a markedly lower RR particularly for GSD which fell after adjustment to 0.9. They pointed to the high non-response rate among women approached as potential GP-based controls particularly in the younger aged groups. They said that the 18 orphaned controls for whom no GP-based controls could be identified were atypical and their exclusion "attenuated the higher risk associated with GSD compared with LNG". They gave no details of the numbers underlying these findings.
- A later report published by the WHO Scientific Group in 1998 (4) added further information:-
"when the authors examined cases for whom both hospital and community based controls were available they found similar RRs associated with the use of COCs containing DSG or GSD, regardless of which control group was used in the analysis."
- The group gave no details of this finding nor does it appear anywhere in the paper. I can only deduce that the point estimates were lower but not by much.
- Consideration of the protocol strongly suggests that other things being equal the WHO team would have preferred community based controls. They said:-
"whilst ideally the aim of the control group is to provide an estimate of [COC] usage among women aged 20-44 who come from the same population as the cases but who do not have or have not had any of the three diseases under investigation, previous studies have met with major logistical problems in recruiting community based controls. Hence in order that a standardised method of control recruitment can be used in all centres, hospital based controls are to be used".
- The reason for this is plain to see. A control should represent the population from which the case came i.e. the group of women at risk of becoming cases but who have not done so. The population of pill users are, by and large, healthy women living in the same circumstances as the cases. Use of a GP-based control system will ensure that they come from the same type of community as the cases. Use of hospital controls will mean that the controls are different in one important respect, namely that they are unwell; how unwell they were is in my judgement important. Those selected as hospital based controls would not have been day patients or short term visitors to the hospitals, or they would not have been caught in the net of the study. It would have been necessary for them, said Poulter, to have been in-patients for weeks rather than days, therefore they are all women with significant even serious illness. Additionally, said Thorogood, it is a familiar epidemiological concept that hospital inpatients differ from the "outside" population in other material respects.
- Walker was concerned about the lack of co-operation shown by the Oxford controls. He would want to see something in the order of 90 to 100% response rate. Poulter said 75 to 80% but that 61%, the rate achieved, was at the bottom end of what was acceptable. Thorogood said that in the United Kingdom 75% would be the rate she would be satisfied with and 60% (a figure she had attained in a different study of her own), was at the "lower end of what is acceptable". There is in fact some room for doubt as to what the true response rate was in this study; some of those listed as non-responders could well have been women who more accurately should have been described as people who had moved away and whom the invitation had never reached. Additionally as letters had to be routed through GPs there is the further possibility that the practices were not efficient at passing them on, when such an occurrence could be listed as a non-response. I note these possibilities but cannot quantify their effect. At all events the 61% response rate may have understated the true position.
- Walker's evidence, I believe, overstated the problem here considerably. No study, certainly not one concerned as this was with the developing world, can possibly have set out in the belief that it would elicit the sort of response rate that he was stipulating as being necessary. One can readily envisage how in some of the developing world centres the search for community controls would have been a vain exercise.
- The point about the 18 orphaned cases is more difficult, although it is to be noted that it did not seem to concern Thorogood who laid no emphasis on it. But the Claimants say this is a significant body of evidence which should not be lost unless for very good reasons. Their exclusion is said to have led to an attenuation of the RR associated with GSD as against LNG. The Defendants say this evidence is incomplete since the figure is not given and the extent of the attenuation is not transparent.
- Shapiro was in no doubt as to what the best methodology was. He thought that, especially in a country such as the UK, the use of GP-based controls was preferable by far. He thought that it meant a use of controls more representative of the study base population and enabled control of confounders. Heinemann described their use as "state of the art". It is of course to be noted that the TNS used mixed hospital and GP controls, the latter predominating, and found that when they used hospital controls only, the RR was elevated.
- The Defendants do not urge me simply to take the GP-based point estimate as representing the WHO Study but rather to take the higher figure and discount it downwards to reflect the GP control results. The Claimants argue that it would be wrong in principle to attempt any form of combined summary estimate for the Oxford region since this would break the matching of the case-control sets which can have unpredictable consequences.
- Conclusion.
I do not see my choice as limited to the three main point estimates, 2.7, 2.2 or 1.4 in deciding what the best assessment of the WHO data is. Lord Brennan QC pragmatically invites me, if I have misgivings about the all centres figure, to fix on a value between 2.2 and 2.7. The Defendants say I should discount the 2.2 figure by a small amount to reflect the force of the argument in favour of the GP controls.
- I am concerned that the WHO as owners of the data have never allowed any further access to it to enable further analyses to be done, or, for example, to establish what was meant by "similar RRs" or the attenuation caused by excluding the 18 orphan cases. To everyone's surprise Poulter volunteered in evidence that he retained a copy of the dataset in London and I made an order for its production. In the event no use could be made of it as the person who would have been needed to help with its operation was unavailable.
- In my judgement the RR which best represents this study is 2, no higher and possibly a fraction lower. The guarded conclusions of the study authors as set out above and the tentative nature of the study's results seem to me to permit no higher value to be ascribed to it.
SECTION D: THE TRANSNATIONAL STUDY (TNS) - THE FIRST TWO STUDIES.
- The origins of the Study.
In October 1990 Professor Walter Spitzer of McGill University Canada was approached by the First Defendant's German parent company in the wake of the controversy following the findings of Professor Kuhl of Frankfurt. Kuhl had sought to demonstrate significant differences in the plasma levels of EE in women who took GSD as compared with those who took DSG and this had attracted much publicity and concern in Germany. Spitzer agreed to be Schering's consultant and advisor and set about structuring the response to this problem. He took steps to secure academic integrity by appointing a distinguished working group of international epidemiologists, doctors and biostatistician and a Scientific Reference Board (SRB) to supervise all scientific activities. Its members were not to be accountable either to McGill or Schering. He liaised with the German Regulator, the BGA.
- The working group met in January 1991 and completed its report on the 25th February of that year. Its overall conclusions were that there were serious questions about the validity reliability and clinical relevance of the Kuhl studies and that there were no grounds to suppose significant differences in action as between the two types of COC3 which Kuhl considered. That is the general view now held about Kuhl's work. The working group however went further and found that the data on VTE which were then available in relation to GSD, principally found in a large Phase IV trial by Schering, were unreliable and uninterpretable. It recommended that no further action was needed to examine the Kuhl findings but that there was an identifiable need for studies on the safety of all low dose oral contraceptives, which it saw as an international priority. Fairly soon after this report Schering agreed in principle to fund such research and detailed work on a protocol was done by Spitzer, Thorogood, Heinemann and others. Schering were warned that the pilot and feasibility studies alone could cost up to $1.4m and the entire project not less than $10m. I have no evidence as to the final cost of this study but it was plainly very high. The average annual funding grant was stated to have been between DM 2 and 4 million. The TNS in short was in terms of its VTE data the biggest and most expensive study of those that I am considering in this case.
- Although the working group considered the use of the GPRD it rejected it, apparently on the grounds that it contained no data on German women, was not consistent as to the data it included on BMI (Body Mass Index a measure of obesity) and smoking, and could not supply data on oral contraceptive experience going back far enough. The decision was therefore taken to conduct concurrent case-control studies initially in three European Countries emulating as closely as possible the methodology of the WHO study, with a possible view to later data aggregation, as well as to facilitate comparisons between the two. For example, the questionnaire for study subjects on past COC use was in a format which closely followed that used by the WHO. It was to be a very large study with 200, 500 and 500 cases of AMI, stroke and VTE respectively, with 4 controls per case, at least one from hospital and 2-3 community based. It required the interviewing of some 25,000 women. These proposals were endorsed by the SRB and the study embarked on, initially under the name of the Trinational study in Germany, the UK (at centres in Southampton, Manchester and Glasgow) and France.
- The progress of the Study.
Field work began in late 1991, but unfortunately Spitzer the principal investigator suffered serious ill health through much of 1992 which introduced some delay in the execution of the study. It had been agreed that the protocol would be published in a peer-reviewed Journal to permit scrutiny of it prior to the commencement of the study. That eventually was done and the protocol appeared under the names of Spitzer, Thorogood and Heinemann in January 1993(5). A time consuming part of such a study is the recruitment of cases and controls and this proceeded through 1993 and 1994. It became apparent that the initial aim of completing the study by the end of 1995 would not be met and therefore 2 further countries were added to the study in mid 1994, the name of the study being changed from Trinational to Transnational.
- At this stage I should refer to the objectives of the study as formally defined in the protocols. The original protocol set as a primary objective the investigation of the relative risk among current users of any OC as against non-users in respect of 3 specific conditions namely AMI, stroke and VTE. Additionally the study set out to determine the relative risk of those conditions among current users of specific OCs again comparing current use with no use. The secondary objectives were to compare, as against no use, current use of GSD, DSG and any contraceptive with 10% or more of the market share of each in at least one of the countries in the study.
- This protocol was revised in a process which began in early 1995 and culminated in a paper submitted for publication on 13th November 1995 and published early in 1996 (6). The Scientific Review Board in mid February 1995 had recommended that the protocol be revised to state that a comparison of risks be made between users of GSD as against current users of other low dose OCs. That recommendation was accepted by the data management committee a few days later, despite strong opposition from among others Thorogood. It is plain to me that her experience in the TNS was not a happy time for her, though the precise reasons are not so clear and may not matter. She struck me as still feeling rather wounded by her treatment. The UK operation had had problems in collecting data from its three centres and forwarding them to Potsdam. She said this was because of under-funding. In her evidence she said the change in protocol, which was published in a paper of which she was not an author, was "clearly made after the data collection was completed
either after or very near the end". She put this forward as some evidence of a weakness on the study's part. Data collection on VTE cases had stopped in mid October since there was no value in collecting data after the scare which would have introduced all sorts of bias. When shown the contemporary documents she accepted that the change to a head to head comparison was initiated much earlier. In this respect I found her a less than dispassionate witness, in marked contrast to the rest of her evidence where I found her to be highly expert, helpful and impressive. She only heard the results of the study shortly before the meeting with the CSM in October 1995. She then spent 2 days writing up the results, and appears as one of the authors of what has come to be called TNS1. Thereafter she played no part in the further analyses of the study.
- While therefore the main objectives remained, viz. the comparison of the relative risk of AMI stroke and DVT generally on the basis of a comparison between any OC and no current use, the secondary objectives were significantly revised. The study was now also seeking to establish the relative risks, for each of the 3 categories of disease, of current use of GSD-containing OCs, DSG-containing OCs and any OCs with 10% or more of the market all as compared with other low dose OCs. In this way the "head to head" comparison was set up.
- The case-control ratio was reduced from 1:4 to 1:3, with at least one hospital and one community control per cluster. The TNS team were put under a new timetable, namely to complete data collection by the end of August and complete data "cleaning" and analysis, on which the WHO had spent some 2 years, in 9 weeks only before presenting their preliminary results to the regulatory authorities in October. This they did, and the fact that they did so impresses me with their competence as epidemiologists, especially Dr Michael Lewis who bore much of the burden in these latter stages. At the meeting they expressed the preliminary conclusion that the data showed a RR of COC3s for VTE as against COC2s of 1.8. This was the second of the 2 unpublished studies to which the CSM referred in its letter.
- Spitzer felt he had left the regulators on 10th October reassured that there was no real problem of safety in relation to VTE. He believed he had a powerful study behind his figure. So far as cases of VTE was concerned the TNS had based its calculations on 127 cases exposed to COC3s as against the WHO's 71 (more than two thirds of whom had come from one centre - Oxford). The TNS figures had not by then been adjusted for duration of use. Lewis says that the data were analysed prematurely, and the analysis was incomplete as not all data had been considered. The pill calendars (see below) had not been entered and could not be taken into account at all. Though case accrual for VTE events was halted in October 1995 data processing of cases and controls already accrued went on until completed in late 1996.
- Unlike the WHO data, those obtained by the TNS have been subjected to further analyses since the oral report of October 1995. It is necessary to look at the principal studies in which these appear.
- The first published paper TNS 1.
This was written in November 1995 by Spitzer, Lewis, Heinemann, Thorogood and MacRae, all of whom are witnesses in this action. It was published in the January 1996 issue of the BMJ, having been accepted for publication on 13th December 1995 (7). By this stage 471 cases of VTE were included and 1,772 controls matched for age within 5 year age bands and by hospital or community setting (789 and 983 respectively). The relevant findings were the UK and German data only, on the basis that the other 3 countries' data was considered fragile; Lewis said that these three, the so-called southern rim countries, was at this stage "in what one could call a run in or pilot status and were not yet in a position to deliver adequate data". Though as something of a political gesture a figure was given for the results from all 5 countries the entire focus of this paper, rightly in my judgement, was on the UK and German data, which I find gave the only meaningful information from which conclusions of any sort could be drawn. After this paper the southern rim countries disappear entirely from later deliberations by the TNS. Late in the case the Claimants accepted the force of this, and there is therefore really no issue about TNS 1, which stands on any view as a very powerful and influential study into this problem. It will be seen that it was based on 127 cases of VTE and 249 controls, compared with the WHO which had 71 cases, 48 of which came from Oxford.
- The UK/German data gave adjusted ORs for VTE as follows:-
|
Comparison |
COC3 v COC2 |
GSD v COC2 |
DSG v COC2 |
|
Cases exposed |
98 |
45 |
53 |
UK |
Controls exposed |
197 |
101 |
96
|
|
OR
|
1.5 (1.0-2.2) |
1.4 (0.9-2.3) |
1.6 (1.0-2.5)
|
|
|
|
|
|
|
Cases exposed |
29 |
10 |
12 |
Germany |
Controls exposed |
52 |
11 |
25 |
|
OR |
1.8 (1.0-3.3) |
2.6 (1.0-7.2) |
1.5 (0.8-3.1)
|
|
|
|
|
|
|
Cases exposed |
127 |
55 |
72 |
Combined |
Controls exposed |
249 |
112 |
137
|
|
OR
|
1.5 (1.1-2.1) |
1.5 (1.0-2.2) |
1.5 (1.1-2.2) |
- The adjustment carried out to produce these figures included adjustment for "duration of exposure to OCs used before current OC". When further adjustment was made for "duration of lifetime use of OCs preceding the current use of OC and for length of use of the most recent OC" the point estimate for COC3 v COC2 declined by 12% to 1.4. The authors thought the best adjusted estimate from this study was 1.5. If the comparator used is LNG not COC2s, as the Claimants seek, the OR rounded to one decimal place is 1.7 (1.1-2.8) as calculated by MacRae in the course of trial.
- From this analysis there was seen to be a weak association when all COC3s were taken together, with a point estimate which if it reflects the true relative risk would arguably be something about which women ought to be informed but which would not be enough for the Claimants in this litigation. The authors expressed the view that this modest association "must be taken seriously even if it is not certain that the relation is causal", likening the apparent increased risk to the threat posed by smoking 10 cigarettes a year in terms of death from cancer and heart disease. The UK data and German data each considered in isolation yielded no association of statistical significance, and the German data particularly for GSD were sparse, perhaps the legacy of the Kuhl scare. The combined data for both countries produced no statistically significant association for GSD but for COC3s and DSG the association was just significant. The study authors considered the possibility of the operation of various biases, as will be seen later, and found no evidence for them in their data. They raised a further possible bias which they called the "attrition of susceptibles" which will require further consideration.
- The second published paper TNS2.
The lead author of this paper was Lewis and it was submitted for publication on 8th March of 1996 (8). Cases of VTE had now grown to 505 (though no new cases were recruited after October 1995, additional cases accrued from those already in the pipeline at that date), and controls increased to 1,877. This sought to complete the analysis of the TNS project at least in relation to VTE. It found ORs for VTE for all COC3s as against COC2s of 1.5 ( 1.1-2.0), or 1.6 (1.2-2.2) when NRG was included as a COC3. The figures for GSD as against COC2s were 1.7 (1.1-2.6), for DSG 1.8 (1.2-2.6) and NRG 1.9 (1.0-3.6). The authors called these "very weak associations" and therefore looked at issues of bias which might explain them. Specifically they looked at diagnostic and referral bias as well as a concept which they called the "attrition of susceptibles". The authors pointed to the fact that for women in the 25-44 year age band the year of COC brand introduction seemed to be of great importance, with the higher ORs attaching to the brand most recently introduced. This phenomenon even showed something which, viewed intuitively, seemed inexplicable and which also cannot be explained by the agreed evidence of the haematologists: Mercilon containing DSG and 20 ΅g of EE, a brand introduced in 1992, exhibited a much higher OR as against LNG (2.8) than its stable-mate Marvelon, containing the same DSG but a 50% higher dose of EE (30΅g) and whose OR was 1.5. Furthermore NRG, introduced in 1986-92, and so in chronological terms a COC3, had an OR of 2.4 as against LNG, even though in chemical terms it was more properly to be classified as a COC2 as it metabolised to LNG.
- In his report and evidence Lewis said there were really 4 matters which triggered his decision to proceed with a third analysis of the TNS data.
(1) The 12% reduction in the point estimate in TNS 1 when a fuller duration of use element was introduced into the model;
(2) The distribution of duration of use by age and product for the controls showed that among those who had used LNG in the 25-44 age group 20% had done so for more than 96 months as against younger women where the figure was 2%. With GSD no such effect was evident.
(3) Stratifying the data by dividing the study subjects into women aged 25 44 and 16 24 he found in the former group but not the latter a clear trend of increasing ORs related to recency of product introduction, the highest attaching to the most recent so that Mercilon with only 20 ΅g of EE showed the highest OR. This phenomenon is not explicable in haematological terms, according to the agreed evidence in that discipline.
(4) In a review paper in the Journal of Human Reproduction he published a finding from these data that the RR COC3 v. COC2 for 16 24 year olds was 1.36 (0.82-2.26) but for 25 34 it was 1.19 (0.74-1.98) and for 35-44 3.22 (1.47-7.06)
- Much controversy surrounded these matters. Lewis was heavily criticised for excluding 16-24 year olds from the analysis that yielded evidence of this apparent bias in favour of healthy users being over represented on the older pill type. His answer is that the effect could not be expected to show in this age group, since they would not have had the chance to qualify as long term users. Shapiro thought his subdivision of the data quite reasonable and said that Lewis' Figure 5.2 in his Report which showed the recency of introduction effect was the most striking single piece of evidence in the studies he had seen. The Claimants say that what Lewis did was "data dredging", making illegitimate sub-analyses of the original dataset to achieve a desired a priori result. They accuse him of "dressing up" the crucial graphs which show the effect by excluding first generation COCs and progestogen only pills, and altering the very presentation of the graphic itself, as well as suppressing in his report mention of adverse scientific commentary on his paper. These were serious charges and I should deal with them. I do not believe it would have been sensible to have included in Figure 5.2, in a study looking at low oestrogen pills, a bar to represent the position of high oestrogen pills or pills with no oestrogen at all. The difference in the vertical axes as between Figures 5.1 and 5.2 is visible but not significant and, I am satisfied, not the result of manipulation by Lewis. He was not the sort of man to do that in my view. He was a thorough and careful scientist, a little stolid and unimaginative in his presentation, but not the type of person prepared to manipulate data in this way. His report at 167 pages long could hardly be described as terse; his omission was not venial in my view. That TNS 1 achieved the high level of acceptance it did (there are no significant attacks on it, bar some unconvincing suspicions from McPherson about some of its CIs) is in large measure due to him.
- The Mercilon Anomaly.
Organon launched its COC3 Marvelon in 1982; it contained DSG with 30΅g EE. In 1989 it launched Mercilon whose composition was identical except that the EE component was a third lower, 20΅g. In several studies Mercilon has paradoxically emerged with an higher OR for VTE. This is counter-intuitive in the sense that the received wisdom has for some time been that the risk of VTE is correlated with the dose of EE.
- The Defendants argue that this phenomenon cannot be mechanically explained and therefore provides in a microcosm a product-specific demonstration of some bias or biases which are in operation more generally and affect adversely comparisons made between COC3s and LNG. The haematological evidence provides no explanation for it, rather the reverse. The Claimants argue that it is only present in the older sub-group of women and otherwise there is no good evidence for its existence.
- The studies which support the existence of the anomaly are as follows (I will not lengthen the judgment by setting out the figures in each case, save to note that most of them are either not statistically significant or are based on low number observations); WHO (compared with non-use), Wyeth-Ayerst, Herings, TNS2 (25-44 age group), Jick 1995, Mediplus 1997, Leiden 1999. Several of these studies are of doubtful quality when viewed overall as will be apparent in the later section where I deal with them in more detail.
- Studies which do not support the phenomenon are Farmer 2000, Jick 2000, TNS1 (all women) and Mediplus 1999.
- Walker's first report in this case was written in August 2001 and proved to be to a very large extent based on an article he had written in a Journal in 1998 (9). It is not unfair to describe the report as a "cut and paste" re-presentation of that article (plus a 1999 memorandum relating to Cox, which I will be referring to later). In the article he noted the existence of the conundrum as he called it, pointed out that the results were based on too few patients and controls to be persuasive but said that the aggregate was difficult to ignore. In fairness to him the 2000 studies into the GPRD were not available when he wrote that. He concluded "however, until a plausible mechanistic explanation is forthcoming, the anomalous result for these patients is an unanswered argument in support of the proposition that there is an uncontrolled residual bias, of uncertain impact". When he "lifted" the Mercilon section from his 1998 article he omitted that last sentence.
- His explanation for this was that he had read an article by Rosendaal in 2001 and that in writing his own article he had not considered the possibility:
"that one should consider the interactions of the [DSG] with the [EE] which was the point brought forward by the Rosendaal article".
In fact when one looks at that Rosendaal article no such interactions are described as affording an explanation for the phenomenon. Rosendaal made the point that it was questionable whether the safety gains achieved in reducing EE progressively from 100-150΅g EE down to 50 or less could be continued by reductions from 50΅g down to 30΅g or 20΅g. In no sense did the article address the Mercilon phenomenon as Walker was obliged to concede. I was left with the impression that he had been rather selective in the parts of his original article that he chose to bring forward for presentation in this case.
- In due course Walker accepted in his evidence that bias would be "a candidate explanation" for the Mercilon anomaly, which he corrected in re-examination to a "candidate indicator" whatever the difference between these phrases might be.
- Evidence for the existence of a Mercilon phenomenon is patchy but it is there. It is only unequivocal among the older age group. It is not explained indeed it is contra-indicated by the haematological evidence. At its lowest therefore it raises reasonable suspicions, as it did in the minds of Lewis and MacRae which contributed to their decision to deploy the Cox Regression Analysis. When that analysis was used the anomaly disappeared.
- Duration of use.
When first articulated in TNS1 the theory of the "attrition of susceptible subjects" was put this way. It was said to be something which occurred:-
"
..because those patients susceptible to side effects tend to drop out of the corresponding user group at an early stage or are switched by their doctor to another product. In contrast if a product is well tolerated prudent doctors and safety conscious patients tend to continue to use it. So that patients who will have been taking a product for a long time would be expected to be at lower risk than first time users of any brand"
In TNS2, after noting the recency of introduction phenomenon the authors said:-
"the underlying phenomenon here is "attrition of susceptibles" or what may be termed a healthy user effect".
They supposed that such a group developed over time and that on the introduction of a new product the healthy cohort would ignore it and the newly introduced drug would be preferentially used by new users or those who did not tolerate the predecessor drug well. Therefore, the argument runs, there is a discernible population effect with short duration of use of COCs associated with higher risk and the depletion of susceptible users from the ranks of longer duration users. This has been caused by a differential introduction into the market of products rather than any inherent difference in effect of the generations of product and the apparently elevated RR visible in the 1995 studies is due to this.
- The causal mechanism by which this may have occurred as a phenomenon is highly controversial. The arguments against it are best expressed by Timothy Farley in a 1999 paper adopted by Walker and by the Claimants. He put forward three possible mechanisms.
(1) A withdrawal of susceptible women from the pool of LNG users as a result of VTE events. He argued that the incidence of VTE is too rare for this to cause sufficient attrition to account for the difference. Shapiro having reconsidered the matter agreed with this criticism.
(2) More women at high VTE risk might have withdrawn from the LNG user pool compared with COC3 users. For that process to cause a twofold difference, he argued, those who withdrew must have been a large proportion of high risk users at a substantially higher VTE risk than those who continued. Shapiro agreed with this proposition but said "
we are talking about categories. Some have withdrawn."
(3) Thirdly, Farley dealt with the hypothesis that high risk women could switch from LNG to COC3s thus artificially enriching the latter group with more susceptible users. Against this he said that there had been no reports indicating a high relative risk in recent switchers. As to this Shapiro thought that it was common sense to say that high risk women would switch from LNG to COC3s if their doctors believed that the most recent product was the safest. He did not think that any of these factors operated independently but in combination. The Claimants say there is an insufficiency of evidence supporting a switching effect . Shapiro it has to be said was not impressed by Heinemann's "factor X" as a cause of switching (or for that matter, of anything else).
- Faced with these arguments Shapiro developed a second argument for a decline not in the ranks of susceptible women but in the susceptibility of women with duration of use. He was obliged to abandon that on what he called a reconsideration of the haematological evidence which as will be seen later is now to the effect that the thrombogenic effect of a COC is produced within one treatment cycle, and certainly by the end of the third, and that there is no continuing increase in thrombogenicity thereafter. He described this second view as "speculation, probably one I should not have made". This was characteristic of Shapiro as a witness who was the least afraid of all the witnesses who gave evidence before me to change his mind and recognise the difficulty inherent in many of the concepts in this complex case. In this complex area I regard that as a strength not a weakness.
- Shapiro's final position was therefore this, basing himself on a paper by Rosendaal which made a very late appearance in the case (10). The thrombosis threshold in a population of women will increase as that population ages due to what might be called intrinsic risks (genetic factors such as Leiden Factor V present in 5-10 % of women, for example). Secondly the pill acts as an extrinsic risk in an unknown part of that population so that some of them would have VTE events as a result of a combination of that extrinsic and their own intrinsic risk. Thirdly this whole phenomenon is superimposed upon a baseline of risk which rises with age and this might be what is underlying the recency of introduction phenomenon, seen as it is in the 25 to 44 year sub-group. None of this was advanced dogmatically by Shapiro who agreed readily that it was a "most complex" problem which required "years of thought". The Claimants object to this argument not only on the grounds that it has evolved over the course of Shapiro's consideration of this case but that it simply evokes an individualised thrombosis potential at any given moment of time without explaining why short duration of use against longer duration should confer in population terms a higher risk of VTE.
- The Defendants for their part point to the fact that whatever the uncertainties are it is quite clear that there are very significant differences between the two populations, that is LNG users and COC3 users, during the relevant period and there are different profiles of those users both as regards length and pattern of use.
- In the first place COC3 users had much higher proportions of both starters and re-starters and neither Walker nor McPherson challenged this notion. There is nothing surprising about this phenomenon since the COC3 products were newer and generally increasing their market share. Plainly the mean duration of use (both as to their current COC and as to their lifetime use) of COC3 users was less than that of COC2 users as apparent from the evidence in TNS 3 Table 3, for example, and Lewis figure 6.2 in his first report. Thirdly, again as would be expected with a newer product, there was a higher proportion of switchers among the COC3 population. These differences, argue the Defendants, are all associated with differences in risk whatever the type of pill used. But all witnesses agreed there was a starter effect on any pill, and re-starters should also experience such an effect. This will be most apparent in the first period of use, however that is defined, as a "spike" in the incidence of VTE but it does not disappear immediately thereafter. The evidence as to switchers is more controversial but McPherson accepted that Suissa 2 (11) suggested that they were at an increased risk (though some of these may have been re-starters properly so called). If therefore the COC3 population contains a higher proportion of higher risk women than the COC2 population in any of these respects then there will necessarily be produced an appearance of higher risk associated with COC3 use.
- Much of this was accepted on a theoretical level by Walker, although the quantum of, for example, the starter effect he rightly viewed as a separate point. McPherson too was prepared to accept that the so called attrition of susceptibles could be a plausible explanation for what was going on. His point was, however:-
"we need evidence to support it. I think the evidence could come from a really straightforward and simple analysis of the TNS study without having to resort to all sorts of Cox regression or splines or particular kinds of sub-set analysis as were done
.".
It is right to say that Lewis was asked whether he had attempted such types of analyses and he said he had, by performing several different types of stratification of the data. He was taken to task for not retaining or producing the results of this work, and I am invited to assume that his results were concealed as being unhelpful to the Defendants' case. I am conscious that Lewis has a direct commercial interest in obtaining work from these Defendants and to that extent producing results that are pleasing to them. All researchers in these fields are as Shapiro said under all sorts of pressures from industry funders, from their academic institutions and from publishers. I do not believe Lewis would go so far as to suppress unhelpful material. In TNS1 itself he had disclosed in plain terms that additional stratified analyses had been done. It has been open to the Claimants through their experts since February to run the type of further analysis McPherson was envisaging to refute this part of the Defendants' case.
- Pausing here, it seems therefore to me that two statements can be made, whatever the other controversies are. The first is that the two populations namely COC3 and LNG users are populations at different stages of their development and maturity and thus have different characteristics in certain significant respects. Secondly this whole area, as Shapiro was at pains to point out, is one where medical understanding of the position is incomplete and developing. In that position the absence of an understanding of a causal mechanism for the phenomenon, and I think there is such an absence in the present state of knowledge, need not operate to defeat attempts to investigate it if those attempts are properly carried out. There is after all nothing inherently improbable about the proposition that, when one is looking at the risks attached to the use of a long term drug of this kind, the question whether the past history of a person's use of it is relevant to that enquiry is at least worthy of investigation.
- At all events and whatever the criticisms of Lewis' views expressed in TNS 2 what matters in my view is not so much why Lewis proceeded as he did so much as what he did. That study in which he and MacRae (who had until May 1995 been a member of the TNS Scientific Reference Board) subjected the dataset to yet further analysis, if it is good, is accepted as capable of answering this question in a conclusive manner. If it is not good that means that his suspicions were indeed ill-founded, and his reasons for embarking on the further study are not themselves such, in those circumstances, as to make me doubt the point estimates in the 1995 main studies. Therefore TNS 3 will have to be considered in some detail, since it involved a form of data analysis which is potentially capable, if the claims made for it are correct, of unlocking this whole problem even in the absence of a satisfactory explanation for the mechanism by which the alleged duration of use effect has influenced the RRs in this case, if indeed it has. Before I do so I should first consider another attempt to re-analyse the TNS data.
- Professor Suissa's splines.
Professor Samy Suissa in 1997 described an analysis he had carried out on a sub-set of the TNS data (12). He looked at 105 cases and 422 controls who were first time users of COC2s and COC3s or never users of OCs. This sub-set was a little under a quarter of the whole dataset. By applying logistic regression and quadratic spline modelling he found that for first time users the adjusted rate ratio of VTE as a function of the duration of OC use was:
"essentially identical for second and third generation pills relative to never users. This rate ratio increases to around ten in the first year of use and decreases to around two after two years of use remaining at this risk level thereafter for both second and third generations agents".
The use of quadratic spline models on epidemiological data was both novel and controversial. The spline is a "best fit curve", usually deployed in economics, as a complex method of fitting a smooth curve to data as they evolve over time, in which the curvature is allowed to change at different points of the analysis which are called knots. Therefore the conclusion which Suissa stated in his paper derives from the appearance of these curves, which for each of the two generations followed a more or less identical shape or course.
- The following year this paper came in for heavy criticism by Farley of the WHO team (13). He argued that Suissa's model did not fit the data adequately and was inappropriate to be used for such a task and he sought to illustrate this by applying the same technique to his WHO data. At the heart of this criticism lay the question of whether it was right to "constrain" the curves at time zero or leave them unconstrained and allow them to fit the data as best they could on that basis. Suissa's response in essence was that Farley's comments were influenced by the fact that the WHO data differed from the TNS data in that there was a differential risk at time zero between the WHO first users and never users.
- Walker, broadly speaking, adopted Farley's criticisms of Suissa. He thought that the splines did not appear to capture the data, so far as he could see them from Suissa's tables which set out, on a stratified basis, the rate ratios for first time users of the two generations of pills. He particularly focused on the stratum whose duration of use was less than one year.
- In order to understand what had been done Walker, reasonably in my view, asked the Claimants' solicitors to administer formal questions of Suissa (who was featuring in the case as a witness of fact, though one whose status verged closely on that of expert witness). One of these questions asked him to identify the exact interval between the commencement of COC use and the VTE events in the different contraceptive user groups. It was clearly a valid question and crucial information in the light of Farley's point about constraining the starting point of the spline to make it begin at an RR of one. This information was a very long time in coming and when it did come it came in my judgement much too late. On Day 30 of the trial I rejected an application by the Defendants to introduce a second witness statement from Suissa in which, at last, he broke down the first time users by duration of use and showed at what point in the first twelve months of use the 11 COC2 and 16 COC3 cases occurred. For reasons given in that same ruling I also ruled that it would be wrong for Suissa to give any evidence in support of his paper, which was therefore left to speak, so far as it could, for itself.
- The net result of this is that I do not derive any assistance from this work by Suissa. Both sides of the argument are well delineated. Suissa's justification was that the baseline risk after adjusting for confounders in a never user should be the same as that in a new user on the day that she first starts taking OCs and therefore it is right to constrain the RR at time zero to a value of one. Only after the first pill is taken by the new user should the rate ratio be allowed to rise. But I see much force in the Farley/Walker argument that by forcing the spline to be continuous at zero duration of use the model cannot reflect the sharply increased risk of VTE in the first months of use and am impressed by Farley's demonstration in his letter of how, at least on the WHO data, a much more faithful representation of the data appears with his unconstrained model.
- The new information which Suissa was prevented by my ruling from introducing may or may not have advanced this debate. Without it, it seems to me, it is impossible for me to say even as a matter of probability that this unusual and highly sophisticated sub-analysis of the data has successfully proved the point which its authors claim it has proved. On the evidence before me it is an unconvincing exercise. That being the case I leave it out of my consideration in this matter altogether.
SECTION E: TNS 3 AND THE COX REGRESSION ANALYSIS
- In a paper submitted for publication on 13th August 1998, and published the following year the TNS data were further analysed (14). This study had been funded by Organon. The TNS dataset had grown to comprise 502 cases and 1,864 controls.
- This was not so much a re-analysis of previously analysed data as a consideration of new material. Its main feature was that it included full or lifetime COC exposure history for over 90% of the subjects involved, based on the pill calendars whose form and content were specified at the outset of the TNS and which identified COC use month by month since menarche. Ordinarily cases and controls in such studies are captured within a study period of, say, 3 years and their exposures are assessed as at the time during which the event occurred. This gives no scope for considering any effect from prior exposures occurring before the study period (the problem sometimes called "left censoring"). These exposures were believed by the TNS team to be potentially significant in the light of the conclusions suggested by their theory of the attrition of susceptibles. So this information as to subjects' lifetime pill use was entered as an addition to the dataset and the resultant data were adjusted by a Cox regression model.
- Three results can be extracted from Table IV. After adjustment it was found that in relation to VTE the OR for COC3 as against COC2 was 0.79 (0.5 1.26), for DSG as against LNG 1.07 (0.59 1.96) and for GSD on the same comparison 0.58 (0.32 1.03). If reliable therefore, and in plain English, this study found no association between COC3s and any increased risk of VTE when compared with their predecessor products, if anything the reverse.
- This model was devised in the 1970s by the well known bio-statistician Sir David Cox. It is a survival analysis in which among 2 or more groups the time of each group to reach a point event, called a failure time, is measured and compared. It can be used for example to examine the reliable life of machine tool components in industry. Classically in a medical context it is used in a study, for example a drug therapy trial, where a group of subjects treated under two different therapy regimes are followed and their outcomes compared. It looks at the time taken for an event to occur, sometimes referred to as "survival time", and estimates a hazard rate as a function of time. It enables duration of exposure to the drug in question to be adjusted for in a way that traditional logistic regression cannot. It allows, for example, for those who enter the trial at later dates, or leave it early, or those who have concomitant drug therapies or whose therapy is discontinued, reduced or otherwise altered for varying periods to be included and used in the trial as a whole. The original Cox model was called a proportional hazards model; this assumed that the hazard rates remain constant over time. The later version, which MacRae used, was called the Cox regression analysis with time-dependent covariates, which I will from here on simply call "Cox". This model can compare hazard rates which fluctuate or vary over time in response to changing circumstances or conditions. In the machine tool case, therefore, it could take into account the effect on time to failure of the key component of the machine of, say, changes in operating temperature, levels of maintenance or the type of material being worked on, assuming such data were to hand.
- TNS 3 by using the pill calendar data, for the first time, purported to establish whether there was any and if so what effect on the TNS 1 and 2 results when all prior periods of pill exposure and non-exposure in the jargon the "time dependent co-variants" were taken into account. What it produces is not strictly an OR, as would a case-control study, but a hazard ratio, that is a comparison of the different hazard rates of different forms of treatment or types of drug. For current purposes the distinction between the two is not of great moment.
- A fierce debate has taken place both as to whether Cox was an appropriate tool to deploy in a case such as this, its use being admittedly without direct precedent in a case-control study such as this, as well as to whether the TNS team has applied it correctly. The issues are highly technical and will require an enquiry into the "black box" whereby analysis of data is done. But this is very important to this case, as Cox is the high water mark of the Defendants' case, showing if it is right that there is no enhanced relative risk at all as between the two generations of pill.
- The Pill Calendar data.
For all the cases and controls in the TNS study Lewis had available a month by month history of OC use from the age of 9 (the earliest date at which any study subject was prescribed an OC). If this was reliable and reasonably accurate it was something of a mine of information about the infinitely variable patterns of pill use and non use these women had followed. The calendars were described in the protocol for the TNS, which itself followed the WHO protocol closely, to this extent. Appendix I outlined the content of Form B which the protocol stated "will be used for the main interview of cases and controls". That content included:
"Contraceptive history in the 10 weeks before illness leading to current hospitalisation [or date of interview in the case of a community control] (the same questions also asked about lifetime use). This will include use of
. Name or type of contraceptive".
The Claimants argue that this is thin support for the argument that it was an advance intention of the TNS study to use this calendar in the way it was in fact used.
- Anita Assmann was a trained epidemiologist who worked with Heinemann and who had been involved in the WHO studies co-ordinating several centres in the former GDR. She was also project co-ordinator on the TNS. She had experience in training field study interviewers and that was part of her role in the TNS. In her unchallenged evidence she said that she always understood the pill calendar to be an important part of the study data and trained her interviewers to ensure it was as complete as possible, to go back over it and check it thoroughly with the subject. She trained them to work through it by reference to major life events in the interviewee's life, building round these to complete the picture. She trained the UK interviewers at the various centres in this way; typically these were nurses, ex-nurses or health workers. The form and sequence of the questions was laid down in a precise manner and she stressed the need to stick to the format and obtain answers that were as full as possible. Her evidence describes a thorough and professional approach to the task, which included certain quality control checks.
- Questions 23 (a) (d) related to the type of pill the interviewee was using at the date of interview, how long she had used it for, and what her previous pill use was. The pill calendar itself took the form of a simple grid in which the first column of the first row is reserved for the entry of the relevant year, the second for the subject's age at the start of that year and then 12 columns dividing the table up into cells each of which represents a month of that year . There is a final column for the type of pill used and any comments. The previous year's entries would then begin on the next row. The technique was for the woman to be asked to recall such memorable dates as her menarche and the dates of her pregnancies. These would be entered in the appropriate cells, and the subject could then fill in the gaps accordingly. The interviewer would be equipped with colour photographs of all pill packs in circulation over the relevant years to jog the woman's memory as to which type of pill she was using at various dates.
- Two objections are taken to Lewis' use of this information. First it is said that it was never intended to be computerised, but was merely an aide-memoire to enable each woman to recall her past pill use. Though Thorogood said this was so it is a part of her evidence I feel unable to accept. The evidence of Assmann is against this. Furthermore, I cannot see the point of going to the lengths that were gone to with 2,300 women if all that it was intended to be was an aide-memoire to prompt the answers to question 23. That could have been achieved in a much less laborious manner. Secondly it is argued that due to the problems at the UK end of the TNS these data were not physically put into the database in the UK as should have happened but were sent to Germany where they were entered much later (probably 3 or 4 years later) by persons who will have been at a disadvantage in that they were not the original interviewers, may not have been able to make sense of any marginal notes or comments made, and were not native English speakers.
- I accept Lewis' evidence on all this, which fits with my feeling for the way the world works. He said in his report that contraceptive decisions such as the decision to take a particular OC will be "associated with decisive events in the life of a woman" and thus memorable to most if not all; the decision to start to become sexually active and consequently "go on the pill", as it would universally be described, is a key rite of passage in any woman's life, as are (usually) the acquisition of a new partner or a change of partners , an unplanned pregnancy scare due to failure of other methods of contraception, the decision to "try for" a baby, pregnancy and childbirth. All these are matters so closely and intimately connected to a woman's essential self, her very sense of being a woman, as to be likely in most cases, with the help of an experienced interviewer, to be capable of being reconstructed with a good degree of accuracy, albeit necessarily falling short of infallibility. Lewis ran a tight ship in Potsdam, in my judgement of him, and ensured that satisfactory data entry procedures were in place and followed. In over 90% of cases the calendar information was complete. That, of course, is not the same as saying that it was also 90% accurate. But overall the result was, as I find, that a body of good quality evidence as to total contraceptive history was accumulated by the use of this device, and was there for Lewis to use, as he did.
- As to the fact that the use of Cox was novel, Lewis and MacRae say this is because a case-control study will rarely gather lifetime data of the type which the pill calendar gave to the TNS team. They rely also on the fact that Sir David Cox himself, after the exercise had been completed, described the use of his model as interesting if unusual in the case of a retrospective study, but said he could see no theoretical objection to it.
- The attack on Cox by Walker.
Walker's criticisms are these:
(1) Its application has meant that women who are destined to become future cases are included in the comparison group or "risk set" for preceding cases. Because of the relatively low numbers of subjects in a case-control study the "future cases" as they might be termed will feature proportionately much more prominently in the study than they do in the general population. To the extent therefore that COC3 use is a predictor of future case status it will be over-represented in the comparator pool as against the general population and will serve to depress the relative risk.
(2) The TNS team made a specification error in the form of a faulty definition of co-variates by attaching the current exposure at the time of the VTE event (for a case) or index date (for a control) to all the earlier time periods for each study subject.
(3) In addition to this , in circumstances where, as he understood it, penetration of the COC market by COC3s was growing, and could be expected to continue to grow after the earlier case suffered her event, the result was that what was called the "current exposure variable" would become systematically enriched by increasing numbers of COC3 users among the controls in the risk set, or by the addition of what amounted to bogus exposed controls, which would have the effect of depressing the relative risk estimate. This last criticism, which was not central to Walker's thesis, does not survive the evidence of Dr. Hans Rekers which shows that the market shares of second and third generation pills were more or less stable from 1993-1995.
- Apart therefore from objections in principle to the use of Cox in a study such as this Walker also criticises the way in which the TNS team applied it to the study. They included the future COC exposure of those who were either controls or "cases which had yet to happen" in the risk set for the purpose of considering the earlier occurring case. This became known, from a set of slides by which he illustrated the point, as a "Slide 26 error". He also believes, having examined their data, that in applying Cox to the TNS material Lewis and MacRae introduced a variable called "current exposure", meaning that type of pill exposure which the subject said at interview was her pill use at the date the case became a case, and treated it as if it were an invariant, something which had remained the same throughout the woman's COC taking history. To reach the adjusted OR of 0.8, says Walker, the TNS used this flawed definition of current exposure.
- Using the same data Walker calculated that 1.52 was the figure for "exposure at episode" or the contemporaneous exposure at the time of the event. When Walker omitted all variables relating to historical use, the figure changed very little. Therefore, he argued, it is not these variables which produced the lower figure, rather it must be something about the Cox analysis per se. He re-ran the calculation ignoring future exposure at interview and only taking account of current exposure at episode, and the figure went back up to 1.33. This he said is the right figure (subject only to the depressing effect of enrichment mentioned above) which emerges from a correct application of the Cox model to this problem. So the influence of duration of use is shown as being of no significance; he believed that the relative risk of COC3s as against COC2s , whatever it is, is "probably the same over different durations of use".
- In cross-examination he confirmed he had written one paper on the Cox model among the 204 papers he has published over the last 25 years or so (15). It was suggested that in it he was not critical of other authors who had earlier asserted that Cox could usefully be applied to case-control studies, albeit the thrust of his article was not directed at this point but a different one, the practical consequences of stratification for matched follow up studies. He denied that the article approved the type of use of Cox which he understood to have been deployed on the TNS material. He was proposing a use of Cox on cohort studies which, as he put it, made them "look very much like case-control studies". In so far as I can extract a principle of relevance to this case from this very dense and technical article I believe Walker was right. At all events I do not read it as expressing any views which embarrass him in his present stance on this issue. But he did concede that Cox was "a form of analysis which with proper manipulation [and] attention can be brought to bear on case-control studies
".
- The real thrust of Mr Spencer QC's questioning was not to suggest that what Walker had understood the TNS to have done was defensible, but rather that he had simply misunderstood what they had done. I start this passage of the judgment by saying that I do not believe Walker was deeply familiar with the Cox model, nor (though this is less important) with the STATA software on which it ran, for all his high qualifications as an epidemiologist. His initial comments on the use of it in his main report (paragraphs 67 73) and the answers he gave to questions under it were not clearly expressed; indeed they were and still are more than a little difficult to understand. They turn out to have been in substance a re-presentation (in large part verbatim) of a note he had written in December 1999 on the instructions of the German regulatory authorities who had asked him to comment on this question. He did not maintain any of these arguments in the hearing before me.
- I do not believe this was because he had not by then been allowed access to the data. The conceptual objections he now raises are hard to detect in any of his writings on this subject until one reaches his third report dated 26th February 2002. Walker's readiness to discredit the use of Cox, at least from the end of 1999, is plain to see.
- There are certain crucial co-variates in the Cox model which tell it what to look for in order to carry out its task. When these are described in words, rather than being given their exact labels, mistakes and misunderstandings can occur. One clear example of this is in paragraph 35 of Walker's third report in which he describes how, given access to the model, he tested the TNS 3 results by disabling or "turning off" those variables which he thought, from what he had been told, adjusted for historical factors. He found that doing so made no difference to the result. The variables he disabled were called "curd 1, 2 and 3", which related to duration of use of the current OC type by generation, "d 1, 2 and 3" which related to duration of OC use prior to the current use, again by generation, and "gennsw 11, 12, 13, 22, 23, 32, 33 and 99", a series of variables designed to address switching OC use between different generations of OC and/or non-use. In his relevant report he described these codes collectively as "all the duration terms and all the exposure at period terms". In chief he corrected "exposure at period" to "switching". This was plainly a mistake due to his unfamiliarity with this model.
- It was put to him that his understanding of the variables was simply wrong; that the variable "cgennw" was current exposure at the time of event, and "ocgennw" was the variable representing time-dependent exposures prior to the exposure at event. When "cgennw" is enabled it prevents "ocgennw" from looking at exposure at the time of event, merely prior periods of exposure or non-exposure to OCs. The "curd" and "d" variables Walker took out of play were merely fixed co-variates, such as would be found in a logistic regression, not time-dependent; therefore in taking them out while leaving in "cgennw" and "ocgennw" he was merely instructing the Cox model to run again. He maintained his position but in effect something of an impasse was reached. Due to his other commitments Walker had to return to the USA before completing his evidence on this topic.
- Time had played its part in the problem. Walker's access to the TNS 3 dataset was delayed by a dispute over the terms on which he could examine it, it being intellectual property of some value, whose owner, Heinemann, was in Germany whereas Walker was in the United States; neither of these is an area in which this court's jurisdiction runs, and concern was expressed as to the security of the data. It was not until 20th February 2002 that he gained access to it in London, and he first had a telephone conversation with Lewis and MacRae, to be "walked through" the data before examining it. A note of this conversation was made, recording Walker's understanding of the variables in the programme. He understood "episode" as used by the TNS to mean a continuous period characterised by a single pill exposure status in a single person, the information coming from the relevant pill calendar. "Exposure at episode" meant OC exposure in the episode, and "current exposure" meant the OC status of the subject for the episode which terminates at the date of the event (for a case) or the corresponding date (for a control). He was told that the TNS team had used calendar data to define time, did not employ matching as per the original case-control sets, and that other analyses had used age as "the measure of time" and had yielded substantially similar results. This record of his understanding was confirmed in correspondence.
- MacRae's response.
The use of Cox had been MacRae's idea, and it was reasonable for him to undertake the burden of seeking to justify it, he being a professional biostatistician. Lewis later deferred to MacRae, despite the fact that his own report carried over 10 pages of detailed treatment of the subject and I therefore had to consider whether he as lead author of TNS 3 was seeking to avoid questions he ought to have answered. He was a cautious and conservative witness and was not a statistician; I believe it was for such reasons that he took the course he took. It was also consistent with my trial philosophy, which was so far as possible to have one expert per topic where there were discrete issues such as this.
- Taking the same final dataset for the United Kingdom and Germany MacRae subjected it step by step to three forms of analysis or regression namely unconditional logistic regression, logistic regression and Cox regression with time dependant co-variates. The first two of these were of the type that any case-control study might carry out. The unconditional logistic regression using unmatched data yielded a crude OR for COC3 against LNG of 1.75 and when adjusted for age, smoking, BMI, alcohol and duration of past and current use and switching gave a figure of 1.83. The conditional matched regression yielded a crude figure of 1.82 and an adjusted figure of 1.72. He produced STATA Logs which showed how the computer had done this and explained the various instructions or commands. So far as these calculations were concerned the model had been instructed by means of the indicator variable or command "type" which told it that what was being asked for was a case-control comparison, without consideration of the pill calendar data. The subsequent pages show progressively the introduction of adjustment for the various factors I have listed above.
- MacRae then ran Cox on the final dataset. That yielded an OR for COC3 against LNG of 0.85 which he then progressively adjusted, as he had done with his two logistic regression exercises, for the various factors (age, smoking etc) finishing up with an OR of 0.86. He produced the STATA logs for this exercise and took me through them in detail. The all-important instruction "stset" was in operation which told the programme to set the data to survival time and he then in place of "type" entered the instruction "stcox" which started the survival time model for Cox regression. The indicator variables "cgennw" and "ocgennw" were both activated. He explained the second of these as:
"the variable which will take on various values over time as the persons starts, stops, switches and so on different pills",
in other words it was in the jargon a "time-dependent covariate", indeed the only such variable.
- To illustrate what "ocgennw" might involve he produced a printout for one particular study subject (a control whose identity number was 3140055) showing her pill calendar history, that is to say the information that "ocgennw" was designed to pick up in her case. Her first period of observation started on the 16th April 1967 when she was 9 years old and ran till 31st March 1975 when she was 16 and for the whole of that period she was not on the pill. The next period started on 1st April 1975 and ran to 31st August 1977 when she was between the ages of 16 and 19 and appears to have been exposed to a first generation pill. So her history went on over 17 different exposures or non-exposures (she plainly had a more complicated OC history than the average woman in the study) until the 31st October 1993 when she was 35 and reached her index date. At all events with those instructions, and comparing COC3 and COC2, with both variables in operation the model showed a hazard ratio of 0.84 (later on the same exercise was done comparing COC3 and LNG and the figure was 0.85). What Cox was doing, he said, was comparing not cases and controls but different exposure groups, here users of different generations of pill and their respective hazard rates. Continuing with the model he then progressively added instructions to the model to adjust for various factors such as smoking, alcohol, age, BMI and duration of current and past use and the figures changed slightly. Interestingly when the "curd 1-2-3" and the "d 1-2-3" variables were added these (fixed) variables made little or no difference to the calculations. The other important point to note is that throughout these calculations the corresponding hazard ratio for COC1s remained consistently raised. MacRae pointed to this as refuting the proposition that use of the Cox model intrinsically biases hazard ratios towards the null. His view therefore was that what was being seen in these calculations was the way in which prior exposures, properly accounted for, affected measurement of the risk for women in the current exposure group.
- MacRae then turned to the printouts of the work Walker had done. He said that when Walker claimed on his sheet No. 3 (tab 4 page 22 of what became known as the Claimants' Black STATA File) to have removed "all terms relating to past history" from the original Cox analysis he had carried out, and found that the hazard ratios were essentially unchanged, the explanation was that he had failed to remove all terms relating to past history, had failed to understand that "ocgennw" dealt with prior exposure episodes and that the terms he removed had minimal effects on the model. This he attributed to a complete misunderstanding of how Cox worked on Walker's part.
- When on sheet No. 12 (4.31of the Claimants' Black STATA File) Walker had included "ocgennw" but not "cgennw" and obtained a hazard ratio which was only slightly reduced as compared with the logistic regression figure he had failed to understand that he had merely conflated current and past exposures by this step. He rejected Walker's imputation of a Slide 26 error and said that when he and Lewis discussed their work with Sir David Cox in Oxford he had not commented on that or any model specification error made by them. Cox had given them the best part of a day of his time in Oxford, having had prior sight of the logs and some preliminary explanations from Lewis by e-mail. His overall view of what they had done was:-
".. an interesting application of a proportional hazards representation to a retrospective study
. This is an unusual analysis but I can see no theoretical objection whatever to it; in a sense it is like applying a lot of logistic regressions to such data [ he referred to the Prentice paper]
.It seems your analysis provides one convincing explanation for what is going on".
Sir David had allowed that opinion to go before the MCA appeal hearing in November 1998. He had not specifically been asked (and should have been, in my view) whether it could be used in this case, and had declined an invitation to attend as a witness.
- Walker's rebuttal.
The next evidence given, before cross-examination of MacRae, was by Walker who returned to the witness box to be re-examined. He re-asserted his view that the function of the Cox model was to look at serial cross sectional data, the cross sections being identified at the time of case occurrence, and to examine exposures which are defined as of those moments. The "risk set" in a Cox analysis he thought consisted of all individuals at risk of becoming a case on the date that the case actually became a case. Survival time was used "for indexing the comparison" as he put it but was not itself what was being predicted. To make good his point he then proceeded from the witness box to produce four pages of detailed algebraic calculations based on a formula which he had found in Lewis' report as Figure 5.10. Lewis had cited this formula, an algebraic expression of the maximum likelihood function, to show the similarities and the differences between Cox and ordinary logistic regression.
- At the outset of this exercise I warned him and Lord Brennan QC that my own familiarity with algebra lay in the past and at an elementary level. Quite undeterred Walker proceeded over 8 pages of transcript, aided by several pages of a flip-chart, to go through a series of calculations at some speed which purported to support what he had just said. I considered the matter overnight with the benefit of the transcript. I concluded I did not understand this evidence so I thought it right the next morning to say so in open court. I said that if this evidence was being relied on as a means of resolving this dispute I would need to consider the appointment of a judicial assessor under Section 70 of the Supreme Court Act 1981 to assist me. There was also a narrow technical objection that this material might not strictly have been re-examination, but that ought not to have been the basis for deciding its admissibility. This problem was left with Lord Brennan QC who later in the trial told me that the Claimants had decided not to persist with this evidence. This episode in the evidence was, I have to say, typical of Professor Walker as a witness. It was done with immense panache and every appearance of authority; it made no concessions to the tribunal and did not constitute, in my judgement, a serious attempt to assist the court. If it was relevant to this issue, the place to have set it out was in his supplementary report after he had seen Lewis and MacRae's evidence and well before trial, or at latest in his evidence in chief, and he did neither. It was plainly an afterthought and its introduction at this stage and in this way did little to enhance my view of the value of Walker as a witness in this case.
- More helpfully, Walker sought to demonstrate by an exercise on the data carried out that same morning, data to which he had had access since February, that his understanding of the function of the key variables was correct. If he was right he said he should be able to destroy all the information in "ocgennw" prior to the beginning of the study and that should have no effect on the analysis. This was a sensible approach to the problem. It will, I fear, be necessary to refer to this and other logs in this section of the judgment. I will do so by means of a tab and page number and these will be references to what became known as the Claimants' Black STATA File.
- The exercise Walker did was to create a new variable which was "no past history" and to equate that to "ocgennw" giving the latter a value which was plainly incorrect. He gave it a value of zero for all episodes of pill use during or prior to December 1992. Thus, he claimed, he was destroying all past history, the exercise he had previously done as he thought by removing the "curd" and "d" variables. On page 5.42 the hazard ratio appeared as .84, identical to that which he had obtained before making the substitution. Thus he said he had demonstrated that the lowered OR was not the result of control for any kinds of details of past history which preceded January 1993.
- MacRae's reply.
MacRae maintained his view as to what were the "risk sets" or exposure groups namely the individuals in the different strata or sub-sets of the co-variates specified in the model whether fixed (e.g. BMI ) or time-dependent (e.g. prior OC use). He said that Walker's exercise, which he had evidently carried out very shortly before going into the witness box on Day 14, was flawed. He believed that Walker had failed to disable "ocgennw" at least in its most important part, that is to say more recent exposures. The Defendants submit that it is plain that Walker's exercise left into account the most recent pill history for study subjects namely that pertaining to the years 1993, 1994 and 1995. It was therefore, they argue, not surprising that the 0.84 figure Walker had derived was not dissimilar to the TNS Cox figure of 0.79. MacRae's evidence was that the matter was best tested by disabling "ocgennw" completely. He said he had done so on the previous evening and the resultant hazard ratio went up to "nearly 1.2". Though he offered to produce those logs he was not taken up on that during his evidence.
- His evidence in turn was interrupted and could not be completed until Day 21. He corrected a mistake he had made due to a misunderstanding of what Walker had done in his exercise. He had failed to appreciate that Walker was using months from September 1957, that is a calendar time axis, on which to base his disabling of "ocgennw". That apart he stood by his evidence. In answer to a question from me he said that if one got rid of "ocgennw" completely unadjusted hazard ratios of 1.3 or 1.4 emerged.
- MacRae finished his re-examination shortly before noon on the 5th April 2002. Later that day as had been suggested to him he sent to Mr. Spencer QC a number of further STATA logs. He in turn properly disclosed these to the Claimants. They included logs for what he had done the evening before his cross-examination on the 25th March 2002, those appearing at page 7.50. He first replicated Walker's methodology, with the same results. He then introduced a new variable which amounted to an instruction to the model to assign a nil value to all periods prior pill exposure, in other words a complete obliteration of the "ocgennw" variable. That produced a hazard ratio of 1.167, the figure he had referred to as "nearly 1.2" in evidence.
- A further series of logs in Tab 8 related to work he had done on the 27th March (that is between cross-examination and re-examination). He altered the time axis from calendar time to age, ran the model omitting "ocgennw" and obtained an HR of 1.73. He then included "ocgennw", repeated the exercise and obtained an HR of .77. The corresponding figure when he had arranged time on the calendar time axis had been .84. This difference say the Defendants is not surprising and does not undermine his central point. Then over pages 59 to 62 he conducted an exercise whereby progressively more and more of the past history is included, an exercise he had described in re-examination. That showed that as this process continued the hazard ratio moved from 2.2 to 1.57 to .81 as less and less of the past history was excluded.
- Finally at Tab 9 were a series of printouts generated on the afternoon of the 5th April, after the end of his re-examination. That evening he had sent these to Mr. Spencer who had tried to contact him without success. Either that night or the following morning Professor MacRae died suddenly and unexpectedly at the age of 59. In this final analysis he reset time to the calendar axis and ran Cox without and with "ocgennw" obtaining the expected results, and then appeared again to have replicated Walker's exercise.
- So on the basis of this evidence the Defendants submit that while Walker was right to use calendar time for his calculations (which is what TNS3 had said had been used) if on the calendar time basis he obliterated part of the history only then only a partial picture could be expected to result. Properly to see the effect of differential removal of past history the exercise needs to be carried out in the context of an age axis such as MacRae used. That the numeric values for the key variables differ depending upon whether the time axis is calendar time or age, a criticism made by the Claimants, is not something which undermines the TNS team's position. In an e-mail to Sir David Cox prior to their meeting Lewis had pointed out that this was the case.
- Walker's separate algebraic attack.
Separate from the passage in his evidence I have referred to above Walker in the Appendix to his third report calculated algebraically that it would have been necessary for COC2 users to have used OCs for about 6 ½ years longer than typical periods of use of COC3s to produce the effect shown by Cox, which was both implausible and unsupported by any observation reported. He had based this calculation on Lewis' Appendix XIV from which he took the apparent effect of the "curd" variable of 0.992. The Defendants answer this by saying that this figure comes from a calculation in which "ocgennw" has already done its work and achieved the main adjustment for past history. Therefore the premise on which this argument is based, a supposed decline in risk of 0.992 per month or 9.2% per annum is a false one. I believe the Defendants' answer to this point to be right.
- Conclusion on Cox.
In my judgement the key calculation is that culminating at 7.54 of the Claimants' Black STATA File in which MacRae unequivocally removed any form of adjustment for past history and obtained a Hazard Ratio which increased in the manner he would have expected it to.
- Walker was in my judgement not fully conversant, despite his high intelligence and ability, with what the Cox model was doing. His first attempts to use it showed this. Unlike MacRae he was not a career statistician. His initial attack was conceptual entirely. His first attempt to remove all terms relating to past history showed a misunderstanding of the variables. His exercise on the 25th March was not an effective support for the point he was making. As an expert Walker had a formidable and high intellect and held positions of considerable eminence in this field. Overall I gained the impression that he had not quite devoted the time and original thought to this case that it perhaps warranted. There were surprising inaccuracies, many but not all of them minor, in his written work which he called "inattention to detail". His re-use of old material had contributed to this. His speed of thought led him into trouble; for example his re-examination material gave every impression of a hasty conception.
- I have to decide whether, on balance of probability, the TNS team's use of Cox was appropriate in these circumstances and if so whether it was vitiated by any specification error on their part. It is common ground that if Walker's attacks on their use of it fail then, as Walker himself accepted, this study gave the best estimate for the TNS, itself the most impressive dataset in all the studies. That the causal mechanism underlying the duration of use effect is controversial and unclear does not seem to me to bar the conclusion that somehow the duration and pattern of past use has a significant confounding influence on the measurement of this particular risk.
- My conclusion is that Cox was an appropriate model to apply to this dataset, was applied to it correctly, that it did (uniquely among the studies) make effective adjustment for the effect of lifetime duration of COC use, and therefore as a matter of probability there is no true relative risk of VTE attaching to COC3s as against COC2s with LNG. As the Claimants say in their closing submissions if the Cox analysis was correct that is the end of their case. I believe it is, and for the following main reasons:-
(1) Walker was not familiar with this model. He did not, for example, really know what the instruction "stset", fundamental to the survival time analysis, meant. He misunderstood the respective roles and inter-relationship of "cgennw" and "ocgennw". He thought that the "curd" and "d" variables controlled all past history. To a great extent this was compounded by the late date at which he was able to get access to the data, through no fault of his I should stress.
(2) His first report attacked Cox in a superficial and misdirected way. He did not say, as he should have, that he needed to see the data before commenting. It contained no hint of what were later to become his views. His re-examination evidence was in both its main areas a new line of attack, both evidently having occurred to him shortly before his return to the witness box. His difficulties were understandable, but it was always open to him to decline the role of Claimants' protagonist in this complex area of the case.
(3) MacRae in his reports and oral evidence gave a clear and confident account of the model's operation, its use having been his idea. He was an experienced biostatistician, whose generic evidence on statistics was barely challenged. He had used Cox many times before, though never in such a case as this. He cited, in my view correctly, and as Walker had not, the Prentice & Breslow paper (16) as authority for the theoretical appropriateness of its use in a case-control study. He impressed me as having a good command of the subject.
(4) MacRae's calculation culminating at page 7.54 when he removed "ocgennw" entirely seems a highly persuasive demonstration that the model worked in the way he said it did and that he had not introduced any specification error into its operation.
(5) The unique nature of the exercise meant it had to be approached with circumspection, and I confess to some scepticism when I first read about it. But I accept that the lengthy longitudinal data of good quality made this study entirely appropriate for the deployment of this model. No equivalent allowance for past use and its effects could be achieved by stratification or other methods of regression, since the complexities of 2,300 odd detailed histories with 17,500 odd discrete episodes of use would have been such as to make such a course impossible. If there had been no effect from duration of use I am satisfied no violence would have been done to the original TNS figures by the use of Cox.
(6) Cross-sectional risk-sets as envisaged by Walker's Slide 26 error would have been entirely inconsistent with the essential concept underlying Cox (in Chapter 1 of his 1984 book) of survival time to failure, namely a longitudinal consideration of the data.
(7) If enrichment was a problem, as it would indeed have been if Walker was right, the whole virtue of cross-sectional studies being that they enrich the dataset with cases as compared with follow up studies, then its tendency to bias the results to the null should be apparent in COC1s, and there was no such effect either in MacRae's or Walker's further work on Cox.
(8) Sir David Cox would I am sure have paid attention to the use of his model in such a controversial area and is likely to have noticed if its use was inappropriate in any way; I must not place too high a value on this as he did not give evidence before me, but I cannot close my eyes to his views.
(9) As to the problem of "time", since the "risk sets" or exposure groups were followed to failure on a survival time basis MacRae was right to say that whether the time axis was calendar or age was a "distinction without a difference" since the function of the axis was to measure the risk or time to failure, not to organise the "risk sets".
(10) Past use of COC1s appeared from the calculations to be associated with a reduced risk; this would be expected where women who had survived a period of COC1 use changed to less thrombogenic pills.
- This finding means, as the Claimants' counsel accept in their final submissions, that the Claimants' case does not survive the first issue and therefore fails. However, in view of the very substantial evidence I have heard on the other parts of the first issue, and against the possibility that a higher court in future might take a different view from me on Cox, I think it sensible to go on to reach conclusions on the other points outstanding, despite my finding on Cox.
SECTION F: THE JICK V FARMER DEBATE
- One of the most difficult and contentious areas of this case has been the profound disagreement between Jick and Professor Richard Farmer, each of whom has carried out studies based on the same data source with markedly differing results. It will be necessary therefore to start with a consideration of the data source they both used.
- The UK General Practice Research Database (GPRD).
In about 1988 a commercial company named Value Added Medical Products Limited (VAMP) which had supplied computer systems to general practitioners for some years set up a research databank based on those systems. The idea was that those doctors participating would receive financial incentives to do so. They in turn agreed to follow a protocol for the recording of clinical data relating to their patients and to transfer those data suitably anonymised to VAMP on a periodic basis. VAMP would then commercially exploit this database by licensing research bodies to use it.
- By the beginning of 1993 some 521 practices in the UK had agreed to participate in this scheme and between them they had some 3.4 million patients. The participating doctors agreed to keep the data in a particular form and to enter all significant morbidity events in the records. Every 6 weeks or so the practice would send to VAMP a disk or disks with the new data which it had accumulated. The quality of that data was assessed by VAMP and stored and the database itself was updated annually. According to a document produced in 1994 data from each practice was examined monthly for a quality assessment and practices whose data persistently failed to reach research standards did not have their data entered on to the database. In this way a very large body of evidence about a significant proportion of the UK population was accumulated. This constituted a valuable source of data for would-be researchers into health issues.
- It is necessary however to continue with the history of this database. Despite its market penetration the GPRD was not commercially successful. In November 1993 the business of VAMP was acquired by Reuters, who were not interested in the research database and intended merely to shut it down. Dr. Alan Dean, whose brainchild the GPRD was, had a very short time to save the database. He did so by persuading the UK Government to take it over on a non-profit basis. The idea was that the payment for licences would cover the cost of running the scheme. Although the Department of Health has owned the database since November 1993 it has been managed on its behalf initially by Populations Census and Surveys, later the Office for National Statistics (ONS) and from April 1999 by the Medicines Control Agency (MCA). Since its acquisition by the Department of Health a non-profit company named EPIC set up by Dean has been licensed to supply data for research from the GPRD as has the Boston Collaborative Drug Surveillance Programme (BCDSP) run by Jick in Boston. The problems in the autumn of 1993 were acute and the threat to the continued existence of the GPRD was a real one but was eventually averted in the way set out above. By January 1994 525 practices were participating with a total of fractionally under 3.5 million patients.
- The advantages of an automated database for research of this nature are plain. Provided the data recording is of sufficient quality (and this is a most important proviso) it affords the researcher a very large source of information which no field study can attempt to emulate. Studies based on it can also be produced much more speedily than conventional field studies as will be seen when Jick's two principal studies are considered.
- The disadvantages stem from the fact that the data concerned are not assembled by the investigator; he uses data originally collected by others for a different purpose namely the running of primary care medical practices. It assumes that when a drug has been prescribed the patient had it dispensed and took it. Thorogood described this assumption as "one of the big weaknesses of the GPRD". Further all database studies suffer from the problem of "left censoring". The GP Practice will not record any drug history for the patient before she joins that practice. They are not able to identify such features as first time use or switching between COCs. They are dependent on the GP entering the prescription correctly, and experience of more old fashioned forms of records kept by GPs suggest that not infrequently problems occur.
- One of the most intractable problems in this area of the case has been the difference between Jick and Farmer in their approach to this source of data. Farmer used all practices which were considered up to standard by the GPRD itself following its routine quality assessment checks, which I am satisfied it carried out. Jick had carried out 2 separate studies in 1991 and 1992 (not in relation to OCs) in which the quality of the database was checked and generally found to be good (17)(18). Jick used much more restricted sub-sets of the practices considered by the GPRD to be up to standard and the basis upon which he selected these sub-sets will have to be considered below. Jick, a strong believer in automated databases (he has done almost all of his recent work on them) sounded a timely warning; he said that if the information is complete and of high quality they represent a "monumental revolution in the study of drug effects". On the other hand if the information is misused or people use databases of low quality then the possibility exists of the production of work which is invalid.
- The methods and findings of the relevant studies.
Jick, prior to 1995, had had no involvement in any scientific consideration of COC3s, indeed he had never heard of the names of the products concerned, although his group had done a number of studies in the 1970s and 80s on VTE and COCs generally. He approached the matter entirely afresh, as I accept. In 1995 he carried out 2 studies. The first which has been called his mortality study was in response to a message from the ONS about young women who had died while taking a COC containing GSD. The findings of that study, while interesting, do not bear on the issues in this case. The second study which was to be a morbidity study, and which I will call Jick 1995, came in response to an approach at about the end of June or possibly the beginning of July when the so-called unexpected findings of the WHO were rising to the surface. Someone "fairly high up in the MCA" as he put it and another person who worked for Organon had got wind of this pending announcement and asked him to use the GPRD to do a study on the topic. Both studies were eventually published in one paper in the same issue of the Lancet on 16th December 1995 (19) as that which contained the two WHO studies and one from the "Leiden Group" in Holland, with which I will deal later.
- Jick 1995, unlike the mortality study (which had drawn data from 470 practices), derived its data from 370 practices from which 80 cases of non-fatal VTE were drawn from a cohort of 238,000 women. Incidence rates of VTE were calculated by means of a cohort analysis. There was also a nested case-control analysis carried out which resulted in RR estimates for DSG and GSD respectively as compared with LNG of 2.2 (1.1-4.4) and 2.1 (1.0-4.4).
- The methods were described in the paper. The study looked at women under 40 who had received one or more prescriptions for OCs with less than 35΅g of EE with either LNG, DSG or GSD after 1st January 1991. Women with any history of VTE/AMI and other specified conditions were excluded. Questionnaires on cases were sent to the GP and requests were made for a copy of any referral or hospital discharge letter. The questionnaires sought histories of contraceptive use, hospital admissions and anticoagulant therapy. Discharge letters were reviewed by 3 independent specialists who were blind to OC exposure and the cases were divided into "confirmed" and "possible" categories. A confirmed case was a woman who had presented signs and symptoms of VTE with a clinical diagnosis of idiopathic VTE who had been admitted to hospital and anticoagulated and whose discharge letter stated that certain specified tests supported the diagnosis. A possible case was a woman with the same definition but for whom the diagnostic tests gave equivocal results or were absent from the information obtained.
- In the case-control analysis for each case 4 controls with the same exclusion criteria were identified on a random basis and matched to cases by general practice and by age within 2 years (i.e. in 5 year bands). In other words a case born in 1985 could be matched to a control born in years 1983-4-5-6-7. In 15% of the cases it was necessary to expand the age criterion to "within 5 years of the age of the case". Smoking history was obtained. Three categories of BMI, an index of obesity, were laid down and in 31% of cases BMI data was unavailable. Conditional logistic regression was used to analyse the results and adjust for smoking and BMI.
- Based therefore on 75 cases (excluding 5 who were past users) the case-control study reached the RR estimate set out above. The authors concluded:-
"Age, practice and calendar time were closely controlled by matching. While both smoking and BMI were independently associated with an increased risk of VTE, controlling for them only slightly changed the RR estimates
. confounding by age, practice, calendar time, smoking and BMI is thus unlikely to explain the associations found. The inclusion of only otherwise healthy women minimised the possibility of selection bias for OC prescribing. The high quality and completeness of the recorded clinical information had been previously demonstrated, and there is little reason to suspect that the results were due to a bias in recording either exposure or outcome. Nevertheless an observational study such as this one cannot rule out additional biases particularly selection bias, which may provide a noncausal explanation for the observed association".
This study had started sometime in July and was completed and sent prior to publication to the MCA some weeks before the Dear Doctor Letter of 18th October, that is to say within about 2 months or so of its start. It was the third unpublished study referred to in the dear doctor letter. The final conclusions of the authors are interesting. They said:-
"This study and the WHO one provide evidence that the risk of non-fatal VTE among recipients of the new generation OCs
. may be about twice that for older OCs .... the new generation preparations may lower the risk of arterial thrombosis. In view of the modest increased risk for non-fatal VTE noted for new generation OC's in the available studies, and the absence of a substantial difference in risk for cardiovascular deaths in the current study, it may be premature to conclude that third generation OC's compared with older OC's confer an increased risk for cardiovascular illnesses as a group".
- In 2000 Farmer published the results of his first studies based on the GPRD(20). He had experienced delay in publication caused by the need to obtain permission from the scientific and ethical advisory group of the MCA who operated the database. His paper was received by the Journal in June 1999.
- The study used all 618 practices deemed by the Management Committee of the GPRD to be providing data which was up to standard for research purposes and it looked at events and exposures occurring between January 1992 and June 1997. In principle the study was based on the methodology of Jick 1995 but there were differences which will be summarised later. Cases for the purposes of this study were all women who had a diagnosis of DVT, point estimate or VTE, a record of a prescription of a COC and evidence of treatment with an anticoagulant. Admission to hospital was not adopted as a criterion. Women aged between 15 and 49 were studied. There were 6 exclusionary disease conditions. Women were also excluded if there were less than 6 months of research standard data available in the record prior to the event. The study included fatal cases.
- In the nested case-control study 4 controls subject to the same exclusionary criteria were randomly matched to each case, the matching being by practice and year of birth. They were women exposed to a COC on the day of the event of the case. A second study used a group of controls again matched by practice but this time matched within 5 year age bands (meaning that a woman who was aged between 15 and 19 was matched to other women in that 5 year band). By these means 296 cases were identified who met the inclusionary criteria and were not excluded, including 9 fatal cases. All cases were successfully matched to controls save that 1 case was "orphaned" in the 5 year banded study and 10 further cases in the year of birth matched study. 10 different types of COC were considered with monophasic LNG used as the reference product. When year of birth controls were studied 139 cases were exposed to DSG or GSD and when 5 year banded controls were used the figure was 146.
- The nested case-control study using year of birth controls showed no significant difference in risk between the major COC formulations and slight increases in ORs when 5 year banded controls were used. BMI over 35, current smoking and asthma were each shown to have a significant association with idiopathic VTE as was general ill health, based on the proxy evidence of 3 or more prescriptions.
- So far, therefore, the differences between Jick 95 and Farmer 2000 in terms of methodology were these:
a) The periods covered by each study were different.
b) Jick based his figures on 75 cases (52 exposed to COC3s) drawn from 370 practices; Farmer used 296 cases (139/146 exposed to COC3s for year of birth/5 year controls) drawn from 618 practices.
c) Jick restricted his study to users of DSG, GSD or LNG. Farmer included women using any COC on the relevant date.
d) Jick excluded fatalities; Farmer included them.
e) Farmer considered women from 40 to 49; Jick excluded them.
f) Jick obtained case exposure data from hospital records and control exposure data from the database. Farmer obtained these data from the same source namely the database.
g) Farmer had 5 categories of BMI the highest being 35 plus; Jick had 3 the highest being 25.
h) Farmer adjusted for asthma; and Jick did not.
i) Farmer matched controls to cases by year of birth as well as 5 year banding; Jick used 5 year banding.
- There appeared on the face of Jick's report to be a further important difference namely that he had required evidence of hospital admission for cases but in his evidence he explained that that was not the case. A final difference was that Farmer operated under the supervision of an independent advisory board who had access to the data and approved his methodology. Jick was not impressed by its membership, though McPherson agreed they were distinguished. He himself had no such board. Farmer said his board met four times, actively examined his methods and data and were "quite powerful, sticky" in their requirements. Farmer's conclusion was:-
"in our study the ORs for individual products did not differ significantly when cases were compared with the exact year of birth match controls. On the basis of the current analysis we do not believe there are differences in risk between COCs"
- Farmer 2000 was followed by a second paper Farmer 2000 "Pill Scare" (see paragraphs 255-7 below) which was published in late August of that year. There is no doubt at all in my judgement that its appearance incensed Jick. Examination in chief he said that it was his judgement that the paper did not contribute anything useful to the issue but he was alarmed by the publicity surrounding it. The credence in terms of the evidence that it deserved, as he put it, was "zero". In cross-examination he described the surrounding publicity as "frankly obscene". He said "we felt it was our duty to the public health to actually give the relevant facts so that people could decide on their own and doctors could decide whether there was an issue or problem here and whether it was all hocus pocus and spurious". He described the reception accorded to the "Pill Scare" paper as "hullabaloo". He therefore embarked on a study which we came to describe as Jick 2000 as a matter of urgency and at great speed (21). In order to achieve that speed he compromised the design of his earlier study, as he said because there was a serious public health risk. In my judgement that risk, however one quantifies it was a very small one indeed, was over and dealt with some years before and one report from Farmer was never going to reactivate it. Having seen Jick I am entirely satisfied that he wanted to produce evidence to refute Farmer's views at the first available opportunity and that he set out to do this new study with that aim in his mind, prepared if necessary to compromise what he had previously regarded as full scientific rigour to achieve his purpose.
- The methodology chosen by him was with a few exceptions, one of them important, the same as before. Cohort and case-control analyses based on the GPRD were again deployed, this time considering data from a smaller subset of 288 practices over 2 study periods falling either side of the pill scare namely January 1993 to October 1995 and January 1996 to December 1999. 106 cases were identified based on the same criteria as used in his 1995 study, save that there was now introduced the additional requirement that they should not have less than 1 year's information recorded on the practice computer. Other significant differences in methodology were that the previous method of case verification, namely obtaining medical case histories, full referrals and records from admission to hospital was abandoned because "we had insufficient time". 6 controls rather than 4 as previously were matched to the cases by year of age and practice. The case-control analysis yielded ORs for DSG and GSD rolled up as a single entity called Third Generation OCs of 2.2 (1.1-4.3) for the pre-pill scare period and 2.8 (1.1-7.3) for the second period after the scare. The overall figure was 2.3 (1.3-3.9). The conclusion of the study was that its findings were consistent with those previously reported which had suggested an elevated risk for COC3s around twice that associated with LNG. It is to be noted that this study took about 6 weeks from start to finish.
- At this stage therefore it may be helpful to set out in tabular form the results of the relevant findings of these 3 main GPRD studies.
STUDY (and controls) |
PRODUCT |
COC3 CASES (n) |
RR (Adjusted) |
|
|
|
|
Jick 1995 ( 5 year) |
DSG |
30 |
2.2 (1.1-4.4) |
|
GSD |
22 |
2.1 (1.0-4.4) |
|
|
|
|
Farmer 2000 (year of birth) |
DSG + EE 30 |
62 |
1.0 (0.6-1.6) |
|
GSD |
60 |
1.3 (0.8-2.1) |
(5 year) |
DSG + EE 30 |
65 |
1.4 (0.9-2.1) |
|
GSD |
63 |
1.5 (0.9-2.3) |
(See Jick V p.3) |
ALL COC3s |
140 |
1.2 (0.9-1.7) |
|
|
|
|
Jick 2000 (year of age) |
|
|
|
Pre-scare period |
COC3 |
54 |
2.2 (1.1-4.3) |
Post-scare period |
COC3 |
10 |
2.8 (1.1-7.3) |
Both periods |
COC3 |
64 |
2.3 (1.3-3.9) |
|
|
|
|
Jick 1995 + 2000 (mixed) |
COC3 |
104 |
2.2 (1.5-2.3) |
|
|
|
|
Cases in Common |
|
|
|
With Jick controls |
COC3 |
68 |
1.8 (1.1-2.9) |
With Farmer controls |
COC3 |
68 |
1.3 (0.7-2.2) |
Notes: 1. All comparisons are with LNG.
2. None of Farmer's results is statistically significant.
3. All of Jick's are, bar 1995 GSD.
4. All the results are statistically compatible in that their C.I.s overlap.
5. For cases in common, see later.
- The development of the issues.
In his first report dated 26th July 2001 having described his own studies Jick dealt with Farmer's criticisms, as he then understood them to be (the two had met, corresponded and disagreed before the start of this litigation). He described Farmer's comments on the width of his age matching as naive and a product of his inexperience in pharmacoepidemiology. So far as Farmer's studies were concerned his criticisms were these. He said that Farmer had used "hundreds of practices that the BCDSP has found to be of unsatisfactory quality and completeness", that he used 10 OCs' exposures in his matched analysis rather than COC3s against LNG "which were the only preparations evaluated in our published study"; that the numbers provided in at least one of the critical tables in the Farmer study were incorrect; and that many non-idiopathic subjects were not excluded who should have been.
- Farmer's first report dated 21st September 2001 was a much more detailed affair. He initially focused on the fact that Jick used a sub-set of the available practices in his 1995 report (365 out of the 550 then available) whereas the parallel mortality study he had carried out drew from 470 practices. He questioned the criteria on the basis of which Jick had removed practices from the dataset as well as Jick's exclusion of fatal VTE cases. In general the thrust of his criticism of the 1995 report was that it appeared to under-identify cases to a significant degree. So far as Jick 2000 was concerned he pointed out that a different sub-set of practices appeared to have been used. He was critical of the change of study design, in the sense that the authors of the study abandoned their previous methodology to the extent that they did not obtain medical case histories for referrals and records of admission to hospital due to "insufficient time". He considered that the Jick team had failed to identify all the cases of VTE that occurred amongst COC users which had led to an underestimation of VTE rates, and was critical of the omission from consideration of COCs other than those containing GSD and DSG.
- In his supplementary report dated 28th November 2001 Jick described his criteria for exclusion of practices where no data had been recorded for a significant period of time, where there had been marked under-recording of data or where the practices were uncommunicative in that they failed to respond to his request for information. These procedures were more fully described in a witness statement of Dean McLaughlin who was a long term employee of the BCDSP with a responsibility for this aspect of their work and who described and gave examples of their quality control procedures. He said in his statement, and I find, that "the most common reason to remove a practice was the refusal of that practice to correspond with the BCDSP to verify diagnosis". Jick accepted in cross-examination that failure by a practice to respond (something which so far as I can see from the 1994 document the GPs were not contractually obliged to do) did not necessarily mean that the data it kept were unreliable.
- Returning to Jick's supplementary report he explained the different sub-sets he had used for the different studies. He defended the 1995 study against Farmer's criticisms for asymmetrical ascertainment of exposure data on cases and controls. So far as the criticism of his change of design on the 2000 study he thought that the failure to obtain case histories would have led at most to the inclusion of an additional 10% of VTE cases which might otherwise had been excluded had a more exacting protocol been applied. He criticised Farmer's failure to reduce his VTE cases down to verified idiopathic cases, which he described as crucial, which failure was not redeemed in his eyes by the validation process Farmer had subsequently carried out and described in a later paper (22). He criticised him for including women between the ages of 40 and 49 in his study group. He said that the prescription of anti-coagulants is indicative but not conclusive of a VTE case. He said that the fact that Farmer had used twice the number of practices used by his group indicated that he had used "far from perfect data".
- Farmer's supplementary report dated 10th December 2001 added nothing new to this debate. He defended his decision to use close age matching and reiterated his previous criticisms of Jick.
- On the basis of the dispute as it then stood, and by consent, an order was made by me on the 21st December 2001 that Jick and Farmer exchange data regarding their respective studies. Jick was ordered to produce data concerning his practices, cases and controls in both his 1995 and 2000 studies and Farmer was ordered to produce data regarding his practices used in his 2000 study and the cases he had extracted from them for that study together with a list of those excluded from the study and the reasons for their exclusion. The witness statements in support of either side's position showed that a considerable head of steam had already developed between these two experts. Importantly, no order was made relating to Farmer's controls, I can only presume because none was sought, which in turn must have been because they were not then thought either by Jick or those representing the Claimants to be relevant to the dispute.
- Armed with that information Jick prepared his third report dated 14th February 2002. He submitted the cases in all the studies to close scrutiny and considered that, looking at the over-lapping practices common to both sides studies, some 59 cases which should, in his opinion, have been included were excluded by Farmer and their exclusion served primarily to exclude COC3s. He also thought that Farmer had included many cases which should have been excluded as being non-idiopathic, listing in a table some 16 such cases expressing the view that this would have diluted any real effects of comparison. He repeated his former criticism that Farmer had used "many practices which we documented as unsatisfactory for research".
- On the 26th February 2002, with a week to go to the start of the trial, the Claimants' solicitors asked for the first time for Farmer's controls. On the 3rd March 2002 Farmer produced his third report in which he concentrated in great detail on the contentious matters already raised about case exclusion and inclusion. So the battle lines were drawn when the trial started and in his opening submissions Lord Brennan QC promised Farmer that the material relating to exclusion and inclusion of cases incorrectly and preferentially would be:
"gone through in considerable detail, because they cannot both be right, using the same database".
The Claimants' opening note contained a helpful tabular analysis of the issues in this area of the case which made no reference to controls at all. For his part Mr. Underhill opening the defence on this issue put in a 5 page note based on Farmer's work and said to have been approved by him (which contained an egregious error by Farmer which I will have to deal with later) but which concentrated entirely on the question of case inclusion and exclusion.
- On the 15th day of trial the whole tenor of this debate changed dramatically with the service of Jick's fifth report dated 25th March 2002 (Jick V). This turned the spotlight on Farmer's controls. While maintaining the view that Farmer had omitted many cases which should have been included and included many which should have been excluded, in both instances preferentially in favour of third generation comparisons, he added further reasons. He suggested that the inclusion of "other COCs" had both unnecessary complicated Farmer's analysis and "somewhat depressed his resulting RR"; his decision to narrow his age matching to year of birth had had a dramatic affect on the number of controls available to him; this had led to the creation of "a comparatively large number of concordant sets"; that there was "a particular effect" in that the larger the controls set the larger the relative risk; and that "for some reason" the controls Farmer had used overall were:
"significantly skewed towards third generation OC users
. the effect of this is to significantly depress the RR".
This report though only some 15 pages long was extremely dense and included 15 detailed tables in which the controls used by Farmer were closely analysed and dissected. Thereafter Farmer responded to it with a series of notes and Jick himself provided a short addendum to his fifth report on 27th March 2002.
- Jick V and the attack on Farmer's controls.
Two main issues emerged on the question of controls which came to dominate this debate. The first was age matching and the second the inclusion in Farmer's study design of COCs other than GSD and DSG.
- It is common ground that age is or can be a confounder in studies of COC use, either directly or as some form of proxy for other confounding factors. Farmer said that if that is the case one ought to match cases and controls not only by practice, which ensures that the controls represent as closely as possible the population of women from whom the cases came, but by age as closely as possible. The relative abundance of controls in a database study facilitates this step. Thorogood accepted as a matter of principle that it was preferable to have year of birth controls, that is controls that were as close as possible to the cases to which they were matched, except for the practical issue of not being able to find enough controls for the cases. McPherson too accepted this as a general proposition. Year of birth was in fact as close as one could get with the GPRD, even if still finer matching had been required, since it only recorded the year and not the month or day of any patient's birth. Jick said that 5 year age matching had been accepted as a satisfactory practice for decades; this was certainly true if he meant to refer to field studies of an observational kind (witness the WHO and TNS). He said that year of birth matching was "fine" except for the risk of orphan cases. Therefore, as a matter of theory all the experts who commented on this issue could supply no theoretical or in principle objection to year of birth matching, simply the practical problem of finding enough controls if one did. No other example was adduced of a study in which their use had distorted the data or biased the findings.
- In the event, so far as Farmer 2000 was concerned, 264 of the 296 cases (89%) were matched to 4 year of birth controls, 11 cases were orphaned (4%) and the balance of 21 cases were equally divided between 1, 2 or 3 controls. At first sight therefore the problem associated with fine age matching does not seem to have had any great impact on his study.
- There are therefore two questions which fall for consideration. First, has Farmer's finer age matching produced a more reliable point estimate than Jick's 2 year or 5 year bands? Farmer conceded that 5 year bands were unlikely to lead to a change in the underlying risk, by which he meant a risk caused by or attached to some biological difference between women residing in such a band, such as a 20 year old as against a 24 year old. But he maintained it could be a proxy for other confounding features. The second and opposite question is, has Farmer's fine age matching had an adverse effect on his point estimate by skewing the population of controls so as to make it weighted in favour of COC3 use? The Claimants' case is that if question 1 is not proved the answer to question 2 must be in the affirmative. This they say must follow as a matter of logic. Somehow by artefactual means fine age matching has depressed the RR and they say they do not have to prove how this has happened. Though Farmer accepted the logic of this syllogism I am not sure he was right to do so. There may be other reasons unconnected with age matching which account for the difference between Jick and Farmer; Jick's point estimate may itself be open to question because of his practice selection, for example.
- The Claimants' two propositions are that to account for a reduced RR fine age matching has to remove a persistent or substantial difference of age between cases and controls which would be present with wider age matching and further that the available controls for wider age matching must have had a different COC exposure when compared with those residing in the same year of birth as the cases. The Defendants accept these propositions as not controversial but do not accept that they in any way undermine the use of fine age matching. They say that where plausible causes of confounding are in play these propositions are in effect a statement of the reasons which are, at least potentially, those which favour the use of fine age matching, which is something which can be achieved better in a database study than studies of a different type. The Claimants validly point to the fact that at least comparing Jick 2000 and Farmer 2000 the distance between cases and controls will vary between a maximum of 364 days with Farmer 2000 and a maximum of 1 year 364 days with Jick 2000. The distance of course is greater with Jick 95. Could this make a difference, in either direction, of the type we see in these studies?
- The Defendants say that at worst there is no warrant for linking the Claimants' criticisms of fine age matching, even if made out, to the question of whether Farmer's controls are skewed or unrepresentative. No link between the two was made in Jick V or in Jick's evidence. It is to Jick V itself that I must now turn.
- This report is predicated on two features. First it combines both Jick studies, omitting only the 15 cases common to both so as to avoid double counting. It treats the combined studies as if they were a single study with 166 cases giving an RR of 2.2 (1.5-3.3). This process is itself controversial. As Farmer pointed out it combines studies which had different criteria for the selection of cases and controls, different numbers of controls matched to the cases and different age matching criteria. Secondly, the calculations in Jick V are predicated on a deconstruction or dissection of Farmer's data to which process Mr. Underhill QC attached the unattractive but useful name "informatisation". Jick eliminated all the other products that Farmer investigated in his study leaving only LNG, DSG and GSD, then aggregated the last two and treated them as a single product, COC3. By this means he reduced Farmer's 285 cases to 235, which he described as "informative", because as a result of informatisation 50 cases had become concordant or non-informative.
- Two examples will illustrate the way in which this process could alter Farmer's dataset:-
Example 1 |
Case |
|
|
Controls |
|
|
|
1 |
2 |
3 |
4 |
|
|
|
|
|
|
Original (Informative) Set |
DSG |
DSG |
GSD |
NRG |
NRG |
|
|
|
|
|
|
Becomes |
COC3 |
COC3 |
COC3 |
---- |
---- |
(which is concordant, non-informative, and therefore ceases to contribute to the analysis)
Example 2 |
Case |
|
|
Controls |
|
|
|
1 |
2 |
3 |
4 |
Original set |
DSG |
LNG |
LNG |
GSD |
GSD |
|
|
|
|
|
|
Becomes |
COC3 |
LNG |
LNG |
COC3 |
COC3 |
(which is still discordant and therefore informative but is now reduced to a 2 control set from a full set).
Jick then proceeded in a series of tables to set out what he considered flowed from this procedure. He showed where there was a lower number of controls per set that LNG exposure was more prevalent than in full 4 control sets. Example 2 above shows a possible illustration of how this might happen. But Jick in evidence merely said there was:
"some kind of peculiar mechanism, bias, whatever you want to call it, which produces different results depending on the number of people in your set. I have to admit that that was a surprise to us because that does not normally happen."
When challenged in cross-examination to explain how it could be related to fine age matching he in effect declined to address the question.
- In Table 5 he stratified the case-control sets according to the number of controls and demonstrated an increasing RR as the number of controls increased. Interestingly the stratum of 4 control sets (of which there were 108), and in which the LNG/COC3 distribution was 34% : 66%, showed a RR of 1.5 (0.9-2.5). The increase in RRs was not statistically significant and the overlap of CIs was substantial. Jick did not offer an explanation for this, emphasised he was not a statistician but said that he employed excellent statisticians and trusted them. This was a thread running through this part of his evidence which became of greater concern to me as it continued. The same phenomenon as Table 5 had demonstrated was observed in a sub-set of Farmer's informative sets namely those where his cases were common to the Jick study. Jick found more concordant sets in Farmer's sets using the common cases than were apparent in his own and in those concordant sets found an higher prevalence of COC3 cases. He accepted that "in some part" this was a result of the informatisation procedure.
- This led in to the heart of this report. In Table 12 the report conducted an analysis of all controls available to be matched by practice to the cases common to both studies. Those cases were now further reduced from the 105 previously considered to 91, as 14 of the 105 went back to a time in the early 1990s and were no longer on Jick's database. This showed that there were available for exact year of birth, within one year of birth and 5 year age band matching respectively totals of 1114, 3326 and 3982 controls (though in passing I should note that I shared Farmer's mild surprise that the available 5 year controls were not more like 5 times the year of birth figure). But the point was that this table showed, broadly speaking, that the exposure distribution of LNG and COC3 was steady in all 3 groups at approximately 40% : 60%.
- In Tables 13 and 14 Jick looked at the same sub-set of 91 of the 105 common cases, the controls matched to them by Farmer and Jick and compared them with all available controls, matched first within 1 year and then by exact year of birth. He produced a "control-exposure odds ratio", described as the odds of sampling a COC3 control given their distribution in all available controls. He described this in evidence as having been done looking at each matched set separately, set by set and resulting in what he described as "in a sense a weighted RR estimate". The result was that the respective ORs were 1.36 (1.02-1.81) and 1.32 (0.95-1.82). These suggested, said Jick, that Farmer's controls were 36% and 32% more likely to be COC3s than a comparison with the generally available controls would suggest "and thus were not randomly chosen". It was never the Claimants' case that Farmer deliberately selected controls on a preferential basis, as these words on first reading might have suggested. In both Farmer's and Jick's studies the controls were selected in exactly the same way, that is on a randomised basis driven by a computer. No explanation for this phenomenon is offered but Jick says that that is what his dissection of Farmer's dataset shows, and if that is so that is all that matters. This skewed selection of controls is what he puts forward to explain the reduced risk, as he considers it, in Farmer's study.
- Jick accepted as a possibility that in such a calculation the exposure distribution of "all available controls" will be primarily determined by the exposure distribution of controls available to be matched to those cases in the period of maximum pill use. Therefore in a matched analysis, it was put to him, one could expect exposure distribution of the 4 or 6 controls actually selected and matched to the cases not necessarily to correspond to the exposure distribution of the general population of all available controls. There seemed to me to be force in that point. Be that as it may Jick was not unreasonably asked for the workings which led to Tables 13 and 14 which he did not have to hand and agreed to furnish . In re-examination he confirmed that this work was done by a colleague within his organisation but he did not know any more about the method used other than that it was "part of a statistical package. It is in principle either very similar or identical to the method one uses in other relative risk estimates". Elsewhere in his evidence he had, as I have said above, stressed that he himself was not a statistician. After Jick's evidence was completed a sheet of a computer print-out was produced leading to these figures. Farmer said that this did not explain to him the process by which these results were achieved, it did not show the commands that had been used to programme the computer to produce them, and it was not an exercise he had seen done before on any other data.
- For his part Farmer in Note 2 of his notes produced in response to Jick V analysed the distribution of controls by the number of years they were distant from the year of birth of the case to which they were matched. In Jick 95 and the 5 year banded section of Farmer 2000 he showed that in these studies 81% and 77% respectively of the controls lay within plus or minus 2 years of their case. Having established that, he proceeded to mimic as closely as he could Jick's methods while using his own 5 year controls. He adjusted for 3 categories of BMI as Jick had, eliminated OCs other than LNG,GSD and DSG and confined himself to cases in the period 1992 to 1994 common to both studies. This exercise yielded an ORs of 1.8 (1.1-2.9) and 1.8 (1.1-3.1) for DSG and GSD respectively which immediately strikes one as being very close indeed to that found in Table 6 Jick V, the 105 common cases using Jick controls, which was 1.8 (1.1-2.9).
- In a letter to the Lancet of the 8th March 1997 (23) Jick said that in his first paper he had achieved age matching to within 2 years of the case in 85% of cases. He said he had re-analysed his data restricting the exercise to cases and controls lying within 12 months of each other that is to say year of birth or 1 year either side so that the maximum distance the case and control could be apart would be 1 year 364 days. The resultant OR was 2.3 (1.0-5.5). Jick dismissed Farmer's reply in the same journal, couched as it was in cautious even moderate tones as "the kind of statement people make in church" .
- There was one final analysis of interest namely Farmer's Note 6. He had produced this over a weekend in advance of receiving Jick's workings on Tables 13 and 14. For each case he identified all women in the practice with which the case was registered born in the same year using an OC on the day the case occurred identifying the OC being used. He then performed a conditional (matched) logistic regression using all the women identified as controls. He was not in the time that he had able to adopt exclusion criteria because some 6200 women were involved in this exercise. Therefore with some reservations this exercise can be called an interesting one. The result was that as compared with his published study the use of all available controls on a year of birth basis yielded broadly similar figures which were only very slightly elevated if at all.
- Conclusions.
The Claimants submit that this debate should be resolved by me in favour of Jick's studies and that Farmer's findings should be omitted entirely from my final considerations as being of no value at all. The Defendants contend that the Farmer study is preferable, because it was bigger and therefore better, but that in the alternative I should look at the area where there was agreement between the experts as to case selection and use those as being the basis for my evaluation of this group of studies based on the GPRD. The Claimants adopt Walker's criticism of this compromise approach as "cutting the baby in half". Before embarking on this stage of my judgement it is necessary and appropriate to set out my evaluation of the protagonists as experts and as witnesses.
- Jick is a veteran pharmacoepidemiologist with a distinguished record of achievement and enjoys a high reputation. His group has published extensively in this field. Dean (himself not an epidemiologist) held him in high regard. McPherson thought Jick's reputation on its own was a sufficient criterion to prefer his work to Farmer's. The fact that the MCA chose him to carry out his 1995 study speaks for itself in terms of his standing. He was a strong advocate of database studies, referring almost dismissively to field studies as "paper studies". As a witness he suffered from the fact that quite evidently, from the impression that I formed of him throughout his days in the witness box, he could not understand why this case was being fought at all and regarded the issue as closed. As a result he was at times a testy, flippant and even cantankerous witness, evidently impatient of the courteous and well informed cross-examination to which he was subjected by Mr. Underhill QC. His contempt for Farmer was palpable; when I suggested at one stage on Day 19 that he and Farmer should meet to discuss and if possible resolve the issue about the inclusion and exclusion of cases, halfway through my sentence he made a grimace of disgust which spoke volumes. His criticisms of Farmer were almost corrosive in their quality. He said that Farmer displayed a "virtually complete lack of understanding of what is involved in this kind of research".
- Much of this may have been the product of Jick's personality which ought not of itself to detract from the value of his evidence as an expert in a field where he is plainly highly eminent and qualified. However when it is combined with the late development of the real case he wished to propose as an explanation for the intriguing difference between his results and Farmer's it is bound to weaken the strength of his opinion.
- Farmer for his part was also a flawed witness in some respects. Certainly when working under pressure he was prone to alarming errors. That said his written work prior to trial, particularly his lengthy first report, largely survived attack. Mr Oppenheim served him with a clip of extracts from his work foreshadowing cross-examination to the effect that his work was full of arithmetical errors. By the very high standards of the rest of Mr Oppenheim's cross-examination it proved to be a rather damp squib. But Farmer made two very bad mistakes. When the question of case inclusion was the main issue he said through Counsel that the data he had sent Jick pursuant to the Order of the 21st December included the data for a second 2000 study of his, which was referred to in the trial as "Farmer 2000 Pill-Scare". This was wrong and caused much confusion and waste of time as well as embarrassment to Counsel who announced that he proposed to use it as a basis for jumping on Jick in respect of part of his evidence. Secondly in the second of his six notes prepared for trial, which was work produced at short notice during the trial, he said:
"In observational studies it is important to take account of the fact that spurious associations may be found if there is a factor that is either associated with an exposure or the outcome or both. These are called confounding factors".
- This, of course, was a "howler", a complete mis-articulation of a confounding factor which he and everyone else accepts is a variable factor associated with both the incidence of the disease and the usage of the study drug. It may not even have been an isolated mis-articulation since it is possible to construe a sentence in his 2000 report, and an extract from a report of his in another case as making the same mistake, although less clearly. Farmer has written a textbook on the subject and there is no doubt that he does not believe in the definition of a confounder that he put forward and he indeed said so. He thought the mis-definition in Note 2 could be saved by re-punctuation but I doubt it.
- A look at Farmer's CV shows that up until about 1995 his work had lain in very different areas of epidemiology. Since 1995 he has done no work other than on COCs and has published quite prolifically, some 25 or so studies and letters, all based on commissions by drug companies. Even on his own case some of his early work was of patchy quality as will be seen when later in this judgment other database studies are considered.
- Farmer was subjected to the strongest cross-examination directed at any witness in this case. All of it was proper and fair and not all of it was he able to deal with convincingly. As a witness he was inclined at times to hold on to positions he would have been better advised to abandon. He was one of the two witnesses in this case who, as I was told under the rose, suffered from significant long-term physical ill health. In Farmer's case unlike that of the other witness this plainly showed and I therefore feel justified in breaking the degree of confidence in which I was given the information. He worked under great pressure in this trial and produced a number of complex notes in response to Jick V in circumstances which cannot have been easy. He did suffer physically at times in the witness box in my judgement. For the last day of his cross-examination he suffered the additional distress of the fact that the funeral of his colleague and close friend MacRae was taking place that same day; given the option he had elected to finish his cross-examination rather than attend. It seems to me right to make allowances for these factors when assessing him as a witness.
- Generally I was impressed by the rigour by which he approached his 2000 study and his report for the litigation, much more than I was by his earlier work. Certainly I am not able to accept Jick's estimate of him as someone whose work is simply of no value. His mistake on the confounding definition is not something which, even if it had been operative in his mind when doing the study (and I am sure it was not), would have affected either the design of the study or the analysis of the data produced by it. At worst it caused him to take an inappropriate position when justifying those results. His approach to age matching was I believe principled and appropriate. His Pill Scare study (see Section G paragraph 255) was an intelligent attempt to approach this problem from a different direction, albeit one that failed. In short his work deserves proper consideration on its merits in my judgement.
- Before dealing with the vexed question of controls, in general terms neither witness thought that differences in methodology could explain the whole, at least, of the difference between their results. The main overall differences appear (not listed in any particular order) to be these:
a) Jick used fewer practices (in Jick 95 he used 365 and in Jick 2000 288; Farmer 2000 used 618).
b) Jick's studies were based on fewer VTE cases (75 and 106 as against Farmer's 285) and fewer of those exposed to the study drugs; in Jick 95 when he divided COC3 into DSG and GSD his cases were 30 and 22 respectively when Farmer's were 83 and 63; in Jick 2000 when he aggregated the study drugs as COC3s he identified 64 cases whereas the same figure for Farmer 2000 would be 140 (per Jick V Table 2).
c) Both Jick studies were executed in some haste, Jick 2000 particularly so and for the reasons I have stated above; this was not the case with Farmer's work.
d) The basis on which Jick selected his sub-set of practices, as I find, was primarily to confine himself to those who were compliant with his requests for further information on study subjects. He accepted that was a reason which had "nothing to do with the quality of the data provided by that practice to the GPRD". Farmer used all practices accepted by the GPRD as up to standard. I do not accept Jick's evidence that the GPRD did not exercise proper surveillance of practices from that point of view.
e) There remains an unresolved dispute as to whether Jick was over exclusionary in his approach to the admission of cases and/or whether Farmer included incorrectly cases which were not true cases and/or which were not idiopathic cases of VTE.
f) Jick had no advisory group. Farmer's advisory group met 4 times and played an active role.
g) Jick had no preconceived ideas prior to Jick 95 and came fresh to the subject; conversely his motivation in Jick 2000 was to refute Farmer's case. Farmer's track record is much less satisfactory and, there is no doubt, he saw his academic future as largely consisting of executing studies for drug companies in this field.
h) Jick had space problems on his own computers; Dean said, reliably as I find, that he was always keen to "lose" practices to make space; to some extent Jick confirmed the problems that the size of these data could cause in his own evidence on Day 19.
i) The asymmetrical ascertainment of case and control exposure in Jick 1995 was described by Thorogood as "not the best design" and something which could have led to selection bias, though it should be said she remained a believer in the superiority of this study. Shapiro was also critical of this feature.
- Subject however to the argument about controls it would not in my view be right or possible on the basis of the issues set out above to say as a matter of probability that one of these studies should be accepted in total as being right and the other rejected in total as being wrong. The only fair and proper approach would be to acknowledge the impossibility of that task, to remind oneself that the respective results are all statistically compatible, could conceivably be the result of the play of chance (unlikely though that seems at first sight) and look for the common ground between the studies.
- However the issue as to controls has first to be resolved. In practice that issue reduces to the questions of fine-age matching and Farmer's inclusion of 10 different relevant OCs; this is as against Jick's approach which in 1995 was to look at DSG and GSD, rolling up all products containing those progestogens but considering each on a separate basis, and his approach in Jick 2000 where he simply aggregated them as COC3s.
- So far as age matching is concerned, in most cases the age differences were unspectacular, as the Claimants correctly argued. The second Table in Jick VI showed that in Farmer's informative 5 year age band sets the mean ages of cases and controls in each such band from 15-19 through to 45-49 was closely similar. Farmer objected to this exercise by saying that what mattered was the deviation within matched sets not over the control population as a whole. Yet his own note 2 page 8 showed that a comparison between the Jick 1995 5-year banded controls with his own 5-year banded controls showed no very marked differences so far as the distances between cases and controls were concerned.
- I am not satisfied that either by direct or proxy effect fine-age matching has exerted an influence on the results of these studies such as would entitle me to say that one or other study is right and the other wrong. In my judgement Farmer was right to set out to match for age as closely as he could and his approach was in theory an orthodox and appropriate one. The only adverse consequence of its use was that he lost a small number of orphans, which did not much matter given the size of his overall catch of cases. While I have not found myself satisfied that his results are preferable to Jick's for that reason, conversely I am not at all satisfied that his choice of this method has undermined his results by in some way skewing his controls in favour of COC3s, if indeed that is the case. 5 year banding may legitimately have been more attractive to Jick in 1995 because he knew that he had a relatively small number of practices to draw from, and therefore that the loss of statistical power which would stem from finer matching would be a problem for him. He paid appropriate homage to the concept of fine-age matching by achieving it within 2 years in the great majority of his controls.
- As to whether Farmer's control selection was in truth in some way distorted I remain entirely unconvinced. I accept that they were chosen by a correct random selection method. The deconstruction of his database on which this argument is dependent strikes me as a hazardous and illegitimate example of data manipulation. If Farmer himself had made similar alterations to the original integrity of his dataset and thereby achieved some different study results I could readily imagine the round terms in which he would have been criticised. The counter-argument for the Claimants is that Jick in doing this has done no more than treat Farmer's controls in the same way Jick treated his own, but that does not allay my fears. This radical approach to Farmer's data came very late in the day and more importantly was never fully transparent, certainly not in the important areas of Jick V Tables 13 and 14. In many cases it is legitimate for the expert giving the evidence to rely on the work of junior colleagues within his organisation who have done detailed calculations, and the court is familiar with this as a feature of litigation. This was not an appropriate area of the case for evidence to be given in this way. Much later in the trial the Claimants applied to introduce a seventh report from Dr. Jick and I refused permission for them to do so. I did not look at that report though I was invited to. I make no assumptions as to what it might have shown.
- I am left unconvinced by the attack on Farmer on this basis and therefore am thrown back on a wider consideration of the evidence of the two experts. The obvious way to resolve it is to have regard to the 105 cases that were common to both studies coming as they did from practices they both considered and a period which they both sampled. The range there is 1.3 (0.7-2.2) if those cases in common are matched to Farmer's year of birth controls or 1.8 (1.1-2.9) if Jick's controls are used. In my judgement the higher of these two figures represents the point estimate that should be regarded as the upper limit of the range for the most likely value emerging from the interrogations of the GPRD at the relevant time. It is supported by other findings disclosed in the course of trial. Farmer's Note 2 page 9 mimicking as closely as he could Jick's methodology, using 5 year banded controls, reached this figure. In Jick V Table 5 and Table 8, after Farmer's controls were subjected to the informatisation process, when his "full" case-control sets i.e. those containing 4 controls were considered on a stratified basis, point estimates of 1.5 and 1.6 emerged.
- Therefore in any overall consideration of this case neither Jick or Farmer should be considered to the exclusion of the other when I come to any overview exercise. The right value to give the GPRD studies collectively is one falling in the area between 1.5 and 1.8.
SECTION G: THE OTHER STUDIES
- Leiden 1995
The editor of the Lancet procured a fourth paper for inclusion in his December 1995 edition to go with the two WHO papers and Jick 1995 (24). This came from the so called "Leiden Group" whose research into thrombophilia has been of the highest quality and is internationally recognised as such. The paper identified a RR for DSG as against LNG of 2.2 (0.9 5.4). There were no data in this study relating to GSD at all. The RR for DSG based on 37 cases and 15 controls was 8.7 (3.9 19.3) and for LNG based on 20 cases and 18 controls 3.8 (1.7 8.4). In purely statistical terms therefore the relative risk between DSG and LNG was not statistically significant and the confidence intervals involved were wide reflecting the low number of cases and, particularly, controls.
- There is a remarkable divergence of expert opinion on this study. Walker, whose views are generally supported by Thorogood, called it "work of the highest quality" and in his original overview exercise accorded Leiden 1995 the highest possible mark in terms of its design quality, higher even than the WHO or TNS. Shapiro for his part described it as work so flawed that a student producing it to him would have been failed.
- It is necessary to look at the origins of this study which had opportunistically used data originally collected for a different purpose. The origins can be seen in earlier papers in 1993 and 1994 (25)(26). 301 patients of both sexes were considered, they having been diagnosed at three Dutch specialist thrombosis centres with first episodes of DVT between 1988 and 1992. These cases or patients found controls from among their friends and where necessary partners. They were interviewed about risk factors including OC use and type between 6 and 48 months after the event. The conclusion of the 1993 paper was the identification of an hereditary abnormality in the coagulation system associated with familial thrombophilia which was later given the name of "Factor V Leiden". This was a discovery of major importance in the study of thrombosis.
- Walker's views of the merits of this study were based on three factors. First, he considered the study achieved an excellent level of case ascertainment whereby all true cases of VTE were found thus eliminating arguments about selection bias. This seems to me to be a point in favour of the study. Secondly, he pointed to the excellence of diagnosis and verification of cases (the same point as it appeared to me) in that no true cases were allowed to fall out of the study. The Defendants argue cogently in response that there is no reason to believe that the three Dutch centres as a matter of fact achieved any more accurate diagnostic strike rate than any other hospital addressing the same diagnostic tasks, the techniques being likely to be the same in all cases. Thirdly, Walker was impressed by the high quality of data on exposure (albeit this had not been stressed in his first report). The Defendants argued that there was a significant delay in ascertainment of exposure and the context was a study which was not one of OC use still less of the differences of OC types, since at the time of the interviews (1990 to 1993) no one had any particular reason to view second and third generation OCs in any different light so far as VTE risk was concerned.
- Shapiro's criticisms were that there was no exclusion of non-idiopathic VTE cases. The Claimants responded that if this had any effect it would be to depress the OR and Shapiro ultimately accepted that it could do so. He made the obvious point that the data were collected for a different original purposes. He thought that the use of friend controls while acceptable for the original study purpose would have been entirely unacceptable in a case-control study about VTE risks arising from different OCs since friend volunteers might have been influenced in their decisions to participate by reference to whether or not they used OCs and a desire to help the person selecting. He pointed to the asymmetry in the manner in which information about exposure was gathered from cases (hospital discharge letters) and controls (interview). The Claimants accept that this information should be gathered in the same way. They argue that it is clear that this exercise was done at the same time for cases and controls; I do not consider that that is clear. Shapiro also pointed to the fact that 50% of the patients who were interviewed as to their OC use were asked for that information between 18 months to 4 years after the VTE occurred; inevitably he thought recall bias would operate in these circumstances. The Claimants say that would not have been differential as to OC type, but would have introduced non-differential misclassification which would have tended to reduce the OR. As to the failure to control for alleged confounders I believe that the influence of this was marginal at best.
- The study is plainly not one which is to be dismissed as not contributing any information to the overall question. I believe that both Walker and Shapiro have overstated their cases in regard to it. Of the two views I find myself closer to that of Shapiro while unable to fully accept his conclusion. It is a statistically weak study and its design (because of its evolution) is very far from sound. It contributes information about DSG only. It is based on very small numbers. It therefore goes into the reckoning and I will consider its position in that process later in this Judgment.
- Herings.
In a research letter in the Lancet in 1999 Herings and others published research based on an automated database in the Netherlands into women between the ages of 15 and 49 who used COCs between 1986 and 1995 (27). The study is not very fully described. There was no nested case-control study. As originally published it looked at what it described as new users of COC2 and COC3 products as between whom the text said there was an RR of 3.5 (1.4 8.8) although the table gave a crude figure of 3.7 (1.5 9.1) and an adjusted RR of 4.2 (1.7 10.2) the adjustment being for year and age. Thorogood and Shapiro both took the 3.5 point estimate as being the conclusion of this study. Walker and McPherson took 4.2. In a second letter to the same journal (28) they extended the study to what they called "recurrent users" of whom they had found 78 cases and for whom they gave a crude RR of 2.3 (1.5 3.7).
- The accuracy of the first paper was not impressive. The authors published some significant corrections to it. McPherson in his evidence found other uncorrected errors and "began to be worried about the accuracy overall". The first paper had included the sentence, "the adjusted RR of VTE in women who had used oral contraceptives already for a longer period was 1.7 (0.98 3.1)" which Shapiro described as incomprehensible and which no one has been able to explain. McPherson was also worried as was Shapiro about the exclusion of predisposing conditions. When the figures were adjusted for absence of disease they included "psychiatric problems" which no one can understand in this context, and McPherson agreed that this area of the study was "a bit of a dog's breakfast". The concluding sentence of the first letter is also hard to justify reading as it does, "Alternatively, our data points to an interaction between types of OCs and an unidentified susceptibility factor that might be a prothrombotic mutation". Shapiro points out that this is not supported by any evidence apparent from the rest of the research letter and in that criticism he seems to be right.
- In the circumstances described above, this being a relatively small study for a database enquiry (78 cases), and without a nested case-control study, it must be placed in the category of very weak evidence in the light of the criticism outlined above.
- Parkin.
This was a New Zealand based case-control study of fatal pulmonary embolism in 36 women aged between 15 and 49 (29). The main purpose for which it was cited as being important by the Claimants' experts Walker, Thorogood and McPherson, albeit without great emphasis, was that being a study of fatalities it was one in which selection bias would not have been in operation. I agree that that was the case and that Shapiro's criticisms on this point are far-fetched and not well founded.
- The study found an RR as against non-use (9 cases being non-users) in respect of LNG (3 cases exposed) of 5.1 (1.2 21.4) and for DSG (7 cases exposed) and GSD (5 cases exposed) of 14.9 (3.5 64.3). The statistical weakness of the study is therefore immediately apparent in that these results are not statistically significantly different and their confidence intervals are extremely wide.
- The study did not do a head to head comparison for LNG against COC3s. That has been carried out after the event by Hennessy and McPherson who each find a RR of 2.9. They differ remarkably as to the CI's they calculated, Hennessy's being 0.5 16.0 and McPherson's 1.05 8.09; therefore the RR is at best barely statistically significant again with very wide confidence intervals, which McPherson said were difficult to calculate in the circumstances.
- Although it was intended to carry out a matched analysis on this data it was impossible to do so, the authors said, because of the sparsity of data. They adjusted such data as there were for 4 confounding factors which in Shapiro's view, which I accept, makes the results more difficult to interpret. The conclusion by the study authors, that:
"The high mortality in New Zealand may partly reflect the extensive use of third-generation oral contraceptives, which seem to carry a higher risk of VTE than older contraceptives".
was not one which Thorogood would have included in the study had she been responsible for it and Shapiro agreed it was not justifiable. More sophisticated arguments were addressed on the likely effect of the loss or exclusion of controls but I am not greatly impressed by that line of attack. The Claimants put this study forward tentatively. Walker did not include it in either of his meta-analysis procedures. At best it seems to me it represents a study from another country from which can be deduced an elevated relative risk consistent with other studies. However the weight to be attached to it in statistical terms is in my judgement extremely small.
- Lidegaard.
Lidegaard and others carried out two studies in Denmark in 1998 (30) and 2002 (31), the second of these being published during the course of the hearing. The first was a study of Danish women aged 15 to 45. Cases were identified from the Danish National Patient Register and controls were women interviewed in a study on the risk of stroke in relation to OC use and who had originally been matched by age to cases of stroke. The choice of controls, it is accepted by the Defendants, vitiates this first study and renders it of no value, albeit it appeared to find no significant evidence of an increase in risk for COC3 use as against COC2; after adjustment for duration of use the RR produced by the study was 1.44 (0.83 2.5). The controls used were of a different age profile, enrolled at different times and there was asymmetrical information concerning potential confounding factors all of which Shapiro in his written report rightly described as "unacceptable".
- The second study added a further 3,013 controls and 612 cases gathered in the years 1996 to 1998 and aggregated them, in effect, with the first study. The conclusion was, after adjustment for duration of use, that the RR for COC3 v COC2 was 1.3 (1.0 1.8). The continued use of the inappropriate controls from the first study remained a problem. Shapiro's evidence in chief was that Lidegaard had partly atoned for his sins but not completely, he had not described how the matching was done, he had not accounted for the ratio of matched controls to cases and he continued to be open to the accusation that there was asymmetrical ascertainment of exposure in the cases and controls which he thought, as elsewhere in his evidence in connection with other studies, was a "serious flaw" . All this was a shame since the study was a large one (333 COC3 cases and 735 controls) but he concluded that "it missed the boat" and did not carry very much weight. In cross-examination he went further and said it was of no particular significance. Although in cross-examination McPherson was persuaded to accept that it was a useful contribution to the literature Shapiro's evidence and the fact that Walker was deprived of the opportunity to comment on it at all, since its publication came after the conclusion of his evidence, makes me inclined to accord no significant weight to this study in the overall scheme of things. This is a pity, since a great opportunity seems to me to have been lost here. Denmark's National Register made for an ideal source for such a study, and the Danish regulators had not followed the line of the CSM or CPMP, rather the reverse, so that post 1995 data could validly be considered. But my conclusion must stand.
- UK Meditel.
This study was Farmer's first foray into COC / VTE research. He was approached by Organon in early 1991 who commissioned a study at a cost to them of £88,000 based on the Meditel database. This was a privately owned software system which had been sold to certain general practices. The data were collected by the owner of the business and supplied in crude form to researchers. Of the projected price the cost of data was some £55,000, and most of the rest went on a computer programmer.
- The study started in mid 1992 and the data were not in a clean state when supplied. The report was in draft form by June 1993 and was presented in November 1994 at a conference in Montreal by Farmer. Schering asked him to make a further presentation and he responded by saying he thought the data was thin or fragile and not of good quality. An internal document from Wyeth-Ayerst an American associated with the Third Defendants was critical of the study pointing to its failure to control for smoking, describing the database as sub-optimal, and pointing out that cases and controls had not been actively validated. Generally their fear was that it lacked scientific rigour and would not command acceptance. They approached Farmer at the end of 1994 with a view to his doing a new larger study based on the successor to the Meditel system but decided in the event not to go ahead and embarked on what became known as the Wyeth-Ayerst GPRD Study.
- The paper itself was published in 1995 having been submitted for publication late in the previous year (32). It has major flaws all of which Farmer accepts, and it is not a paper for which he has ever made any extravagant claims. The raw data were as he accepted "very messy" which explains why it took so long, the study was not restricted to idiopathic cases, there was no adjustment for confounding variables, women were included as being exposed even if their last prescription had run out 6 weeks before. At one point in the study an expression of standard deviation was given within a confidence interval; it was put to Farmer that this was "epidemiological gibberish", a criticism which he accepted. In short he accepted readily that this was a study of no value. He was with some justification criticised for a letter in the Lancet the following year in which he said that his data had not supported an increased risk for COC3s. The Claimants suggest that the fact that notwithstanding the unimpressive nature of this study the Defendants were interested in commissioning Farmer to produce further and more expensive studies indicates, in effect, that they were looking to find someone from whom they thought they would get the right answers. I do not believe this is the right analysis. This was Farmer's first attempt in this field and it was not impressive. I am sure he was keen to get more commissions of this type, and know him to be a hard working and now a competent operator in this field. He was, I notice, sufficiently well thought of in this field to have been recruited as a temporary advisor to the WHO Scientific Group, which is not an appointment that would have been made, in my judgement, had he not enjoyed the confidence of the wider epidemiological community at this stage. I can draw no assistance in either direction from the Meditel Study.
- German Mediplus.
I will not lengthen this judgment by dealing with this at any length. The Defendants in their closing submissions lay no emphasis on it as likely to assist me. Farmer himself described it as "not pivotal" nor was it corroborative in his view but merely "in line with" other results he had found. I ignore it.
- UK Mediplus.
Farmer carried out two studies based on this database which, unlike Meditel, was a research standard source of data of this kind.
- The first study Mediplus 1997 covered the period September 1991 to September 1995 and considered 83 cases of VTE matched in a nested case-control study to 313 controls (33). The adjusted OR for COC3 against COC2 was 1.34 (0.74 2.39) for GSD v LNG 0.87 (0.41 1.83) and for DSG v LNG 0.84 (0.38 1.85). No adjustment was made for smoking and blood pressure. Data on BMI was substantially incomplete.
- The second study (Mediplus 99) covered the period January 1992 to March 1997 and there was considerable overlap, the second study capturing 62% of the cases from the first (34). In the second study 99 cases were considered and in a case-control study matched to 366 controls. The adjusted OR's for DSG and GSD against LNG were respectively 1.4 (0.7-2.8) and 1.3 (0.7-2.7). Whereas in Mediplus 1997 no validation exercise was carried out, in the second paper records were obtained to validate the sub-set of the cases from 1994 to 1997 and it was found that 23% of the 40 cases so reviewed were not confirmed. This must cast considerable doubt on the non-validated Mediplus data. In both studies there were serious data gaps that are particularly noticeable in relation to the important factor of BMI. 35% and 36% of cases respectively had unknown BMI and an imputed figure had to be used. This is a much higher percentage than any of the GPRD studies; in Farmer 2000 the unknown BMI percentage was 17.9.
- The incompatibility of the point estimates between these two studies is of concern as is particularly the non-validation of the database prior to Mediplus 1997. Walker and Thorogood have had regard to these studies in reaching their overall conclusions and were I believe right to do so but I note that Walker did it "with some trepidation" whereas McPherson was inclined to exclude it on the grounds of the heterogeneity of the two studies. In my judgement the weight to be attached to the Mediplus studies is small.
- Wyeth-Ayerst Study.
This is a curious study which was never published and exists in draft form only (35). It is huge, and runs for some 85 pages. It is signed at the end of 1997 by four employees of Wyeth-Ayerst, as American company associated with the Third Defendants. three other employees of the same company signed it as reviewing and/or approving it. Dean and Thorogood are both listed as consultants to the study, a status which both explicitly and strongly rejected in evidence.
- The study collected data from the VAMP/GPRD database between 1988-1995. Unlike Jick and Farmer it did not appear to use the latest edition of that database on a "closed" basis but rather periodically searched the database between those years; it will therefore have used practices later rejected as substandard as a matter of likelihood.
- The study considered 155 cases who were diagnosed as definite or probable VTE events and 629 controls matched to the cases by age to within one year but not matched by practice (again unlike Jick or Farmer). The primary objective of the study was to estimate the RR for first time occurrences of stroke for current users of GSD and other COC3s compared with LNG. Similar analyses were to be carried out into AMI, all CV events and all other VTE events.
- The essential finding was that for VTE and the comparison of COC3 v LNG the OR was 1.69 (1.10 2.60); for GSD it was 2.19 (1.33 3.61) and for DSG 1.31 (0.79-2.18).
- There were certain significant differences between this study and the other GPRD studies. Apart from the question of practice utilisation the most important seems to me the abandonment of matching controls to cases by practice. This was contrary to what was envisaged in the original protocol for the study and contrary to what both Jick and Farmer had done. Though Thorogood recited the arguments in favour of not matching by practice, namely the concern about over-matching (and thus concordant sets) due to a particular practice possibly having a tendency to prescribe in a uniform way, I feel if that had been a valid principle it would have been acted on by Jick and Farmer and was not. Furthermore cases where prescription had occurred 6 months before the index event were included, and this is accepted as a weakness of the study since some of these may not have been in truth exposed at the relevant date. Table 22 of the study stratified the case-control sets based on the presence or absence of stated risk factors. Thorogood pointed out that these were not adjusted for age which in my judgement undermines their usefulness and validity.
- Dean expressed concern about interference by Wyeth-Ayerst personnel (an example of which is visible in the supporting documentation) in the matter of case verification which ought according to the structure of the study to have been a matter entirely for the outside experts appointed for that purpose. As Dean pointed out if this had happened on a more systematic basis with the intention to favour Wyeth's product GSD it singularly failed.
- Much was made by the Claimants of the fact that this study was not published although it is the case that it was plainly sent to the UK regulatory authorities as can be seen by reference to the transcript of the 1999 Medicines Commission Hearing. There is no doubt that it is entirely unpublishable in its present form. It would need to be reduced considerably from its present length and, as Thorogood said, turned into three distinct studies. The faults in its methodology which I have pointed to are not the only holes that can be picked in it and, I have to add, in the many studies I have looked at in this case I have not seen one accepted for publication in a peer-reviewed journal which was exclusively written by employees of the drug company which manufactured the drug in question. Be all that as it may the question of whether or not Wyeth wanted to bury it does not help me answer the question of what weight should be attached to it in any overview of the scientific work into this question. The Claimants say for all its flaws it is of value, that its faults are such that could have been corrected and overall this is capable of being relied upon. I believe that broadly they are right to say so, and that this is some evidence for an RR of about 1.7.
- Farmer 2000 "Pill Scare".
This paper was an attempt by Farmer to test the CSM view of a doubled risk in a way that any non epidemiologist would regard as eminently sensible (36). It has received a critical reception which I do not think it entirely deserved. His thinking was that if COC3s did indeed carry a doubled risk, and when their use had declined from 53% to 14% almost overnight, then one should be able to see an effect. The CSM's intervention was after all designed to achieve an effect in public health terms. The presence or absence of that effect should be the best way to resolve the issue.
- The paper identified 35.9 cases of VTE per 100 000 women years of exposure in the post scare period between November 1995 and December 1998, as against 34.5 for the pre-scare period of January 1993 to October 1995. This apparent lack of effect could be said to be at the least not corroborative of a doubled risk. Jick 2000 contained a restricted version of this exercise finding 35 cases in a period in which he would have expected 44. When I first read this case this evidence interested me a lot.
- Ignoring the technical objections to how Farmer went about this study, none of which strike me as compelling, I am driven by a consensus of experts to find that no real weight can be put on these findings. Shapiro said that when a dramatic cure is introduced for a disease like polio such that it is made to disappear overnight by intervention in the form of universal vaccination then a secular trends study such as this will show that effect. Otherwise when the disease as here is rare the weapon is too blunt and there are too many other things happening at the same time to allow the comparison to be made. Thorogood for example led evidence of a dramatic increase in the BMI figures for UK women in this period, though they were not exclusively related to those of pill-taking age. There were, we know, many tens of thousands of additional pregnancies and terminations which would have served to increase the VTE incidence rate. Depressingly McPherson from a regulator's point of view confirmed this; when the CSM took this step the probable negative consequences of its action were readily foreseeable, whereas they had no means of knowing whether any countervailing benefit had been achieved. That is still the position today and it always will be, it seems. For my part this paper gives me no assistance one way or the other.
SECTION H: BIAS AND CONFOUNDING
- Bias is the epidemiologist's ever-present nightmare. If it operates it will distort the results of any study unless the design takes account of it in advance of the study starting its work, or unless proper adjustment can be made for it when the data are analysed; with most forms of bias, unlike confounding (the two concepts do sometimes overlap) adjustment is not possible. A paper in 1979 identified 35 different types of bias (37), and in my view there is no reason to believe the categories of bias are closed. In essence bias embraces anything which will distort the study by making the sampling process on which it is based unrepresentative or skewed in favour of or against a particular side of the equation. Almost all the main studies in this case and the expert commentaries on them have acknowledged the possibility that biases of various sorts might have affected the results; the controversial question is whether it has to be shown that they have in fact done so, and in that context whether it is for the Claimants to prove that they have not or for the Defendants to show that they have.
- The WHO main study having made its "unexpected finding" about COC3s immediately went on to consider possible sources of bias. It dealt with the possibility of investigation bias (differential diagnostic effort made in the case of OC users leading to an overestimate of the risk); bias introduced by the use of community controls who made differential use of hospital services compared with cases; proxy information bias; also what they saw as a bias introduced by the under-representation of sub-clinical cases of VTE in the study. None of these suspected biases impacted on the issue as between COC2s and COC3s. But they illustrate how, as Walker said, any good epidemiologist will look critically at his data at the end of the analysis and ask the question "Can this be right?". This is an area where calculation and statistics take a back seat and judgement comes to the fore.
- The WHO Study on the differential risk itself warned that its observations were based on an analysis of a secondary study objective and:
"The possibility that they are due to chance, confounding, or bias or a combination of these cannot be excluded entirely".
- In TNS 1 the authors said that before they evaluated the public health implications of the "weak association" they had found they needed to consider the effects of potential biases "that might have distorted the risk estimates". They looked at diagnostic bias, which might have occurred if doctors had been more likely to investigate and thus diagnose those on the newer OCs. They felt that this was answered by the elevated OR in PE cases where thorough investigation would have been uniformly deployed regardless of OC use. Secondly they looked at referral bias which could have arisen if women on COC3s were more likely to have been admitted to hospital; on their data they could not test for or eliminate this bias. Thirdly they considered prescribing bias, which would have arisen if doctors recommended newer products, advertised as safer, for women with higher risk profiles. They thought this had been met by adjustment for such risk factors as age, obesity and smoking (the last of these being a debatable risk factor for VTE on the evidence in this case), but noted they had no data on family history of VTE. In a carefully worded sentence they concluded "The existence of prescribing bias is not corroborated in our data". The fourth bias they considered was the much more complicated question of the attrition of susceptibles, elsewhere called the healthy user effect or bias, which in effect enriched the class of COC2 users in the study with long term and low risk users of COCs. The 12 % reduction in the point estimate when they adjusted rather crudely for "duration of lifetime use of OCs preceding the current use" was advanced seemingly as a possible indicator of that bias. Certain of these biases can be looked at as separate concepts. But it has to be borne in mind that they are all interdependent to a high degree.
- Prescriber bias.
Dunn was of the view that this occurs where a doctor might preferentially prescribe COC3s to those at an elevated risk of VTE. This might in my judgement happen because COC3s were or were perceived to be safer in thrombotic terms either because of the prescribers' understanding of the marketing thrust behind them, or because in a more loose way he/she thought that they were newer therefore they must be safer. In fact the claims made for COC3s, properly considered and construed, were that they offered increased safety in relation to arterial disease as opposed to problems on the venous circuit. The predictors for arterial disease (hyperlipidaemia, hypertension and smoking) are not, with the possible exception of the last, predictors for VTE. At all events in Dunn's view, and Walker agreed with him, at least where known risk factors for VTE drive the differential prescription this is not a true bias but rather a form of "confounding by indication", in the jargon term, and can be allowed for when the statistical analysis is adjusted for the various risk factors which drive it. Heinemann accepted this proposition in studies where the data gathered covered all risk factors, and I accept it too as it is plainly a logical proposition.
- It may be helpful to look first at some studies which cover this issue. Two parallel studies were carried out after the pill scare to see if bias was or might have been in play in the 1995 studies (38)(39). They were devised by Heinemann, who carried out the German arm of the enquiry, and the UK arm was carried out by Dunn of the Drug Safety Research Unit in Southampton. I found both these experts to be good, straightforward and helpful witnesses but in very different ways. Dunn was not the most highly qualified or academically high-powered witness in the case, but was a quietly impressive and careful witness. He had the additional advantage of experience in the past and present as a practising part-time GP. Heinemann was an engaging and candid witness, imaginative and lateral in his thinking. There was a slight but not damaging language problem in his evidence. He was entirely honest about the fact that the foundation of his evidence rested in the last resort on what was put to him as being speculation, but appeared unabashed by that. He had been involved on both the WHO and TNS studies, and therefore has a sound pedigree in these matters. He had in my judgement an enquiring mind which was ready to think the unthinkable in response to a problem which evidently caused him genuine intellectual concern.
- Returning to the studies, there is a curious conflict both between them and within the UK study itself. The common concept was that prescribing doctors in the two countries should be asked about their perception of other doctors' prescribing attitudes where women with VTE risk factors were concerned in the years before the pill scare, as well as those other doctors' likely response to complaints presenting to them which were suspiciously suggestive of VTE. Specifically they were asked how likely such doctors would have been to refer such women for further investigation. The studies were done in some haste, probably with a view to their use in regulatory hearings which were upcoming. Each study comprised a survey of doctors' attitudes to prescribing and referral in the "situation 3 5 years ago" (roughly the calendar years 1991 3) as well as a survey of their records to see how these fitted with the attitudes reported.
- As to the first part of this exercise, a striking feature is that the doctors were not being asked to cast their minds back over the divide of the pill scare to recall their own prescribing and referral practices, but rather to reconstruct those of "doctors [sc. in general]" or "physicians in general" respectively, to use the words of the UK version of the questionnaire. How were they to do this? Were they to put forward their personal practice as some form of proxy for that of others? Or were they to reflect as best they could the diversity that results from clinical discretion, and if so how were they to do this? Dunn said his interviewers were trained to answer any questions on these lines by saying that what was being sought was the practice of GPs in general so far as that was known to the interviewee. Furthermore it is a curious feature that the "risk" being considered is not itself defined (whether venous or arterial) and in the second questionnaire (Q2) 3 of the 4 risk factors mentioned are arterial only. There is a strange mismatch in two other ways. While the GPs were being asked to recall other doctors' prescribing habits from 1991-93, the survey of records was of those relating to the interviewee's own patients, and at a later point in time (late 1994).
- The UK attitudinal survey, at all events, revealed an extremely high preference for prescribing COC3s in those years for all women with perceived risk factors, and even some factors whose inclusion as such might now be considered debatable (alcohol abuse and a standing work position). Where more than one risk factor was present doctors were said to be no less than 59 times more likely to prescribe third than second, and all the other preference ratios were in double figures. Well known and recognised risk factors such as smoking and obesity (lumped together) and family history of DVT showed preference ratios of 27 and 23 on their own respectively. The strength of this apparent enthusiasm for the third generation was not matched by the German attitudinal survey. While German doctors also reported clear preferences for prescribing COC3s in the presence of risk factors this was at levels which were a fraction of those recorded in the UK (in the case of the numbers given immediately above the German figures were 3.5, 2.9 and 3.0 respectively).
- When the patients' records were studied in Germany a picture emerged which was broadly consistent with the expressed preferences, even though none of the ORs was statistically significant, and the CIs were quite wide. In the UK a different picture emerged when the patient notes were studied, either in the cases of patients using OCs in 1994 before the scare or in February 1996 after it. In the pre-scare period only women with a family or personal history of DVT were more likely to be on the third generation, and in the 1996 survey the picture was the same. In the case of all other risk factors the indication was that women with known risk factors were more likely to be on COC2s. Again all but 2 of the ORs involved were not statistically significant. Notwithstanding the publicity given to the scare and the huge loss of market share that followed it the 1996 survey implausibly suggested very little change in prescribing habits from what had happened before October 1995. It suggested that the odds against an obese woman being prescribed a COC3 4 months after the scare, for example, were almost exactly the same as before. I find that very hard to believe.
- It seems to me therefore necessary to look at what may have caused this curious result. The 106 doctors surveyed by Dunn comprised 83 GPs and 23 who worked in family planning clinics in the West Country. The GP section of this group was recruited from and through members of a group, with which Dunn himself was associated professionally, called the Wessex Research Network whose members numbered about 120-130 and who were doctors with an interest in medical research, a fact which on its own makes them atypical of the generality of UK GPs. This was the group being asked to recall across the divide of the scare, and about 4 months after it, what other doctors did 3 to 5 years before. GPs work in practices of various sizes, they will see their partners' patients fairly regularly and they also see evidence of other GPs' work when for example patients move practices. Beyond that they operate in a fairly self-sufficient way, though no doubt they talk to colleagues. The attitudinal survey yielded figures which are , I have to say, unbelievably high and cannot command any credence. As to the records-based survey that too is hard to believe, given the expected impact of the Dear Doctor letter; the best thing to say about it is probably that all but two of the ORs fail to achieve statistical significance therefore the whole of that part of the survey may simply not be reliable. I am satisfied Dunn collected this data conscientiously and on a random basis as he says. I just feel the results have to be approached with great caution.
- The German study by contrast is easier to accept. The attitudinal survey is less extreme in the preference ratios which it shows while evidencing a clearly discernible preference for COC3s in perceived risk cases. Again the prescription picture yields mildly raised ORs most of which fail to reach statistical significance except in the "any risk" category.
- Heinemann's unhappiness about the UK study and his attempts to stop its publication are understandable in my judgement; they were not the result of his having an a priori view on the matter, but were because it was in many ways (mostly due to his own design of it, the hurried nature of the task, and not for any failure on the part of the conscientious and careful Dr Dunn) an unsatisfactory piece of epidemiology from which no safe conclusion could be drawn.
- The conclusion I reach from these studies is that while they do not serve to negative the existence of prescriber bias in the 1995 studies, they give no clue as to how great its effect, if present, might be. The measured conclusion of Heinemann's study was that "Prescribing behaviour as oriented by risk as perceived by the treating physician is differential and may contribute to an increased association between COC3 and VTE. Similarly perception of increased risk leads physicians to intensify their diagnostic behaviour in the presence of mild symptoms of DVT". When later Heinemann concluded that this behaviour alone could have entirely explained the point estimates in the 1995 studies he was making in my judgement an unwarranted claim.
- Van Lunsen in 1996 (40) surveyed, by means of telephone interviews carried out in November and December 1995, 306 prescribers from the UK, Germany, Sweden and the Netherlands in relation to their prescribing habits and beliefs before and after the scare. In the UK the sample was 65 GPs and 15 family planning practitioners. 79% of all surveyed prescribed COC3s before the scare as their first choice. The results of the survey were said strongly to suggest selective prescribing of COC3s to women with risk factors for CV disease.
- Jamin in 1996 (41) carried out another telephone survey, this time of 120 French gynaecologists and 262 GPs. It was carried out just after the pill scare but before the papers were published. Of those who exclusively prescribed COCs there were clear preferences expressed for COC3s where risk factors were present, slightly more so in the case of the specialists than the GPs. Dunn fairly pointed out that there was no evidence that these preferences translated into actions on the prescribers' parts.
- Lidegaard in 1997 (42) carried out an analysis of 1200 women who had been randomly selected as intended community controls in an ongoing Danish case-control study. This was done with a view to identifying how different risk factors were associated with specific types of COC in the period covered which was October 1994 to May 1995. 27% of them used second and 57% third generation. COC3 use was notably more prevalent among younger women. 17.9% of COC2 users fell in the 15-24 year old bracket as against 30.5% of COC3 users. It also appeared that in terms of length of use COC3 users had used their pill for relatively shorter durations than the women on COC2s. There was statistically significant evidence for preferential prescribing where there was a family history of thrombosis.
- Returning to the evidence in this case, Heinemann did not accept that prescriber bias in the strict sense of confounding by indication was the only type of bias at play in this area. He believes that in the real world of practice doctors may be ignorant about the science involved, or not up to date, (and one could add in the UK context, unlike in Germany, hard pressed with 6 minutes to devote to each patient). Such doctors may, he feels, confuse concepts which should ideally be kept distinct. They will have a perception of circulatory risks in a general way that does not discriminate as it should between the two circuits of the system. There may be many factors which to the doctor will paint a risky picture: first ever use, old fashioned concepts such as lower limb stasis, women who frequently switch pills, those who are not happy on their present pill for vague or diffuse reasons. He agreed that what came to be called Factor X, one of these diffuse symptoms or conditions which may one day turn out to have a VTE connection when we know more about the body than we do now, was a "nebulous" concept and accepted that he could not say how it worked but believed that there "might be things in this risk array that lead to a risk of VTE". Its importance to him was that it led to the woman so prescribed being in receipt of more attention in terms of referral, making it more likely that she would finish up in hospital when she complained of suspicious symptoms.
- It was argued in response to this that there was no evidence for a pattern of preferentially prescribing COC3s to malcontents. I have to say this point does not trouble me. The usage figures show the onward march of COC3s up to the mid 1990s. I can readily accept that if an OC user presented to her doctor with some of the diffuse or nebulous complaints in this array he/she would react by moving her to a COC3 on a basis which might be no more scientific in its reasoning than "Try these, they're newer and probably better". She would have no realistic alternative, short of advising her patient against OC use altogether.
- But it has to be accepted that there is at present no evidence other than of an entirely speculative kind to support the proposition that any of these factors could in fact be associated with VTE and thus operate as a confounder for which no adjustment could be made. So the force of Heinemann's view is merely this, that what he calls prescriber bias is not a true confounder, but is a phenomenon which lines up COC3 users on course for the pathway of suspicion in a preferential way. As he put it " it is a nebulous concept, an initiation of a series of other steps". He could not say how it worked.
- Shapiro, who was far from being intellectually averse to the notion of unseen biases distorting studies, was not able to go along with Heinemann on Factor X. He said, with characteristic rigour, that he would need to see something which connected it with VTE, and that without that it was "very woolly stuff
I want something more than just a supposition to go on on this". That seemed to me to sound the death knell for prescriber bias as a distinct feature in this case. To be fair to him, by the end of his evidence Heinemann was indicating that there was not much in prescriber bias and that it was "only a small piece".
- Diagnostic and/or Referral bias.
That leaves for consideration another distinct form of bias which could be called diagnostic or referral bias. Heinemann of course saw them as all part of a continuum. Referral bias if present is a true bias in everyone's view, and is in reality closely connected with prescriber bias and diagnostic bias. It must be considered against the background that it is common ground that VTE is not an easy condition to diagnose. Its initial symptoms, typically leg pain and swelling, can easily be mistaken for other conditions. It may also resolve spontaneously and leave no lasting sequelae. Throughout the studies one can sense an awareness of this problem.
- In logic it has to be the case that the differential referral or "direct referral bias" as it was sometimes called in the trial must be unlikely, as the Defendants concede, since COC3s were generally perceived as safer therefore their users would not be more likely to fall under suspicion of VTE in ambiguous diagnostic situations. But Heinemann, having accepted this, argued for a form of "indirect referral bias", where women with his diffuse risk factors or perceived risk factors would have a greater chance of referral and diagnosis and thus entry into the case series because of their risk factors and would for the same reasons be on the third generation pill. It was put to him that if this argument is right it must also apply in a case-control study to the controls, who must be representative as we have seen, of the population from which the cases come. There will thus be a corresponding distortion in that control population and the bias will therefore not exist. Dunn in cross-examination did not think that in the context of such studies as the WHO and TNS the referral bias of the type which Heinemann describes would affect the controls as much as the cases, and so cancel itself out, since the hospital controls had by virtue of the study design been referred to hospital for conditions which were selected as having no possible link to VTE and the community controls similarly were selected in such a way as to exclude their having any VTE connection.
- Farley and others in a 1999 study (43) considered this form of bias and expressed the view that for it to be of concern the preferential referral and diagnosis must have been completely independent of all factors recorded in the studies, otherwise some effect of adjustment would have been noted. Walker and Dunn agreed with this view. Heinemann described this as a proposition which would be true "in the best of all worlds", but in the real world all risk factors are not recorded in the study. In Annex 2 to his report he had set out the risk factors which were and were not collected as data by the various studies. It is also his view, as I understand his evidence, that this is an area of medicine where all true risk factors are not yet identified, and knowledge of them is still developing, for example the recent emergence of long haul travel. He cited the problem of residual confounding as apparent in the paradoxical findings of Miettinen in which Warfarin was found after adjustment to be associated with about a 4-fold increase in the risk of thrombosis when all biologically based reasoning would suggest the opposite. In effect he falls back on the argument that at these low levels of association epidemiology is an imprecise and vulnerable science, and its microscope has insufficient resolution to distinguish confounding, causation and bias.
- A case-control study by some of the Leiden group of doctors (Leiden 1999)(44) was carried out on patients referred to 2 Dutch diagnostic centres with suspected DVT. Those objectively diagnosed as having DVT were used as cases, and those who were established as not having it were the controls. The controls were thus a phenocopy or "negative print" of the cases, in that they were drawn from the same body of women seeking care with similar complaints and who had all been part of the same referral process. The ORs which resulted were elevated and were in line with other studies. As between COC3s and LNG the adjusted OR was 1.9 (0.8-4.5). The authors took their results as compatible with the 1995 studies and thus as showing that diagnostic suspicion and referral bias had not played an important role in those earlier studies since their study had patients and controls who had been subject to the same referral and diagnostic procedures.
- Walker in his first report described the design of this study as "risky" in that the investigator seeks to neutralise a feared bias of unknown magnitude (here preferential referral) by introducing a counter-balancing bias (symptomatic controls) which she hopes will be of similar size. Though he included its OR in his first "meta-analysis" he excluded it from the second, in a proper response to other views he had read about it. But its relevance is that it is said to show that even when the design of the study is such as to exclude entirely referral or diagnostic bias, in that both cases and controls come down the same pathway in order to enter the study, an elevated OR emerges. If Walker is right in saying that the OR from this study is unreliable, then it seems to me it is unreliable for all purposes, including that of supporting the conclusion that there is no referral bias, which itself depends on the OR being right.
- A variation on this study was carried out by Heinemann and others (Heinemann 2000)(45). The exercise performed in Leiden 1999 was repeated, on a larger scale, and cross-referred to a parallel and traditional case-control study. Out of 1068 women aged 15-49 suspected of having VTE 606 were classified as having the condition ( 346 being classed as "definite", 49 "probable", 103 "possible" and 211 "equivocal") leaving 462 proven non-cases. At the same time 2942 population controls were selected. The cases were then subjected both to a case/non-case analysis and to a case/control analysis. The thinking behind the study was that if there were no referral or diagnostic bias the risks emerging from these two studies using the same cases should be the same, but if there was then use of the non-case-controls these should yield a lower risk than the population controls; the former category of woman would have passed though a selection process entry to which was partly dependent on OC use and would therefore tend to have higher levels of OC exposure than the general population of women. The results of this study were such that significantly lower ORs emerged for the case/non-case analysis, most noticeably when the "definite" and "probable" categories of case were separately analysed. The effect was discernible but less marked in the categories of "possible" and "equivocal". The authors appeared to explain this difference on the basis of a referral bias being at work, which Walker in cross-examination accepted would be a fair explanation. There is nothing in this study to suggest that the referral bias was differential as between the COC generations, nor did the study set out to test that.
- Another study relied on as undermining the possibility of diagnostic bias is
Parkin (29) based on data from New Zealand, and already referred to in Section G paragraph 234 above. The point of the study was that its cases would be unlikely to be affected by diagnostic bias since "most young women who die unexpectedly are referred for necropsy". That said and acknowledged it is not of great use as establishing or refuting the existence and effect of this form of bias unless confidence can be reposed in its results, which I have already said I cannot do save to a very small extent.
- Conclusions on Prescriber, Referral and Diagnostic bias.
There is little left of Heinemann's version of prescriber bias. In so far as they are distinct entities referral and diagnostic bias are plausible forms of bias in this case, but there is no clear evidence as opposed to suspicion that they were in operation in such a way as to affect the point estimates in an upward direction. If therefore the resolution of the first issue depended on whether some overall point estimate derived from the relevant studies which was over 2 should be reduced below 2 to account for the biases discussed above, while I find myself suspicious that these or some of them may one day prove to have been in play, I would not be able to say on the evidence before me that they probably were, or that it would be right on their account to reduce the values which I consider that the studies otherwise show before arriving at a "true" RR. That leaves only for consideration the possibility of hidden multiple biases.
- Hidden bias and confounding.
It is widely accepted that observational epidemiology, especially where it is dealing with weak associations and non-infectious disease risk factors, is working at the limits of its operating range. This combined with the readiness of the scientific media to pick up and run with environmental health scare stories makes for a potentially dangerous mixture. But a thoughtful article in the American journal Science in 1995 (46), with whose fundamental message Walker was in substantial agreement, underlines this problem. The well known statistician Norman Breslow whose work has been encountered elsewhere in this case said in connection with advances in methodology in this area:-
"Today people are doing much more in the way of mathematical modelling of the results of their study, fitting of regression equations, regression analysis. But the question remains: What is the fundamental quality of the data and to what extent are there biases in the data, that cannot be controlled by statistical analysis? One of the dangers of having all these fancy mathematical techniques is people will think they have been able to control for things that are inherently not controllable".
The attainment of statistical significance does not answer this problem, since it eliminates the play of chance only. No less an authority than Sir Richard Doll of Oxford University is quoted in the same article as saying that no single study is persuasive by itself unless the lower limit of its CI falls above a threefold increase risk. Others have put this yardstick at a fourfold risk.
- Professor Shapiro is firmly in the sceptical camp. I find him the most impressive expert witness in this case. His great experience and easy command of the material were impressive; he rarely needed to look at a study and made few mistakes. His readiness to consider the views of others and make concessions to them, criticised at times by the Claimants as an undesirable quality, and his sense of the limits of the science he practised made him almost unique among the lead experts in the case and commended him to me. At times he expressed himself forcefully, even on occasions (e.g. Leiden 1995) using hyperbole and that has to be allowed for. But I found myself increasingly reliant on his approach to this problem, albeit unable in the end to go the full distance with him. His views are based on a particular professional experience which he called his "Road to Damascus" experience.
- In 1974 when working with Jick in the BCDSP at Boston he produced a paper suggesting that there was a RR for a then commonly used anti-hypertensive drug Reserpine and breast cancer of over 3 (47). Two further studies one in Bristol (involving Sir Richard Doll) and another in Helsinki were published with it suggesting the RR was around 2. Over the next six years 12 further studies were published, most confirming the risk. In 1980 a definitive study by the WHO International Agency for Research on Cancer, using essentially the same methods as the first study but with a far larger study base, established to general satisfaction that there was no such risk. Many epidemiologists have a similar tale to tell. Heinemann gave the example of Miettinen who had had such an experience. Other examples are given in the article in Science to which I have referred.
- The problem with confounding is that it requires a keen eye to spot it. Sometimes it is obvious. The absurd but imaginary example is sometimes quoted of the study which finds an association between drinking tonic water and cirrhosis of the liver which has failed to detect the confounding effect of gin. A better example from real life was given by Shapiro in his report. A paper in 1992 (48) found that compared with non-smokers the relative risk of suicide for those who smoked was significantly increased and that the trend of increase rose according to the number of cigarettes smoked. The literal minded investigator might have deduced from the study that there was some underlying causal link between smoking and suicide. But a moment's thought reveals the confounding effect of depression which is associated both with smoking and forms of high risk behaviour. That one was quite easy to spot. Others, it is argued, may be more subtle and difficult to detect.
- The WHO Scientific Reference Group (4) said in this connection that a study might find a statistically significant association but that did not make it causal:
"This conclusion can be reached only after alternative explanations for the finding are excluded as implausible. These explanations include coincidence, bias and confounding. Observational studies can never fully eliminate the effects of bias and confounding
. It is especially important to consider which of the numerous potential biases or sources of confounding are likely to have affected a particular study, and what effect these may have on any inferences from the study".
The Defendants do not pin their colours to any particular mast but say that there are here in operation a number of potential biases in addition to those specific forms of it with which I have dealt with above which have an additive affect. Where there is a weak association by definition it will be more sensitive to the effect of bias and lead therefore to a greater need to examine the possibility.
- Heinemann spoke of a "network" of bias at work here. Much of it was in his view speculative in nature. In a study in 2002 (49) he has begun to develop an interesting concept which I could call, for want of a better term, "hospital study bias". He demonstrated a near doubling of risk when ORs are calculated from hospital cases and controls rather than from cases and controls which came from the whole community. He expressed the opinion that hospital based studies which were by definition largely confined to idiopathic cases of VTE that is to say cases for which other potential causes were identified and excluded used a highly selective case population which led to a marked overstatement of risk of VTE. He further said, more controversially, that "seems to be differential for use by generation of OC". That statement seems less well supported by his paper, in my view. But his more general proposition is that the emphasis on hospital cases, inviting as it does the use of hospital controls, may fail to capture the characteristics of the target population, that is to say healthy women in the community who take OCs. This was a line of argument that Thorogood accepted, with some reservations, as being very interesting. In my judgement it is not fully made out but is an indicator of something on which further work could profitably be done.
- Shapiro talked of multiple biases "each of which is quite small and each of which tends to operate in the same direction" as making a contribution. His general position was that as a rough rule of thumb a RR of less than 3 should be considered tentative at best without more ado.
- McPherson thought that the kinds of bias contended for in this case by the Defendants were plausible but that that was not enough to undermine confidence in the RRs produced. His argument was that the biases should be identified, their direction of operation ascertained i.e. do they increase or reduce the RR, the amount by which they do so should be ascertained and further investigation of them should be carried out by further studies preferably on different datasets. The Defendants argue that McPherson in this part of his evidence was talking more as a regulator than a pure scientist in the sense that he would not assume the effect of bias in the face of an apparently elevated RR where public health issues were in play; he was putting the burden of disproof on those who supported the drug.
- In his vigorous but respectful debate with epidemiological colleagues conducted in the pages of the American Journal of Epidemiology, Shapiro (50) had acknowledged the dilemma posed by evaluating the risk of common diseases in relation to common exposures, where even small elevations to the RR "well below 2" might have profound public health implications. It is his view that it may well not be possible to judge whether such low magnitude risks can be accounted for by bias. The focus of the epidemiological microscope, he argues, is not fine enough to detect them. He understands the need for the regulator to take a different approach and act when confronted by such a risk. But as a scientist his position was and is a sceptical one as to whether such studies indicate a causal connection.
- He was answered in an article by Dr. Irva Hertz-Picciotto (51). She accepted that the "appropriately sceptical epidemiologist" must ask the questions that Shapiro was raising but continued, in a passage with which Shapiro agreed in cross-examination:-
"In light of the above and, indeed, for any small or even moderate associations I suggest that the question is not whether we can devise some scenario that could produce the relative risks that we observed. Rather we should ask: 1. What evidence exists that upward biases are present and that they outweigh biases in the other (downward) direction? And
2. How does the evidence for such biases compare with the evidence for causality, i.e. is it more or less convincing?"
This seems to me a sensible approach to the question. Shapiro's own views on what was called "Heinemann's Factor X" are in point on this question, albeit given in the context of residual confounding by indication. He said:
"
. there is a point where we are not permitted to speculate in that way, so it would not matter. May there be some factor X which is connected both to the exposure and to the outcome? My attitude is I suppose there might be, but I want something more than just a supposition to go on [on] this"
With this I agree.
- Conclusions on bias and confounding.
I cannot, as already stated, identify the evidence which I feel I need to see in this case to allow me to say that prescriber, referral or diagnostic bias together or separately have been in operation in the main studies in this case in such a way as to account wholly or to some measurable extent for the elevated point estimates in them. Nor can I find that some residual innominate form of bias or confounding has so operated. The remaining candidate is a form of confounding factor or bias introduced by the so-called duration of use effect, where again suspicion or speculation however appealing is not an adequate basis for judgment in this case. Cox if right remains the only hard evidence for it, and if it is right as I believe it is it is both solid and compelling evidence for its existence and the direction and magnitude of its effect. But for Cox I would be obliged to find that the main study results stand at the values I have attributed to them, and that there is no warrant for further reducing them from the figures I have indicated above to arrive at a "true" as opposed to a merely apparent RR.
- Industry funding bias.
This section is perhaps the convenient place to deal with this form of bias, said by the Claimants to operate in the opposite direction. It does not relate to the data as such, but describes a state of mind present in those conducting studies supported financially by drug companies. The word is used in a way much closer to its everyday sense.
- The Claimants are unable to point to evidence of this in operation let alone quantify its effect. One piece of possible skulduggery in the Wyeth-Ayerst study apart there is no direct evidence of it, and as to that Dean was anxious to withdraw his initial characterisation of it as falsification. The whole argument is based on the fact, and it is a fact, that the studies which "help" the Defendants here are all studies they have funded, and the "independent" studies, by and large, do not help them. As a general proposition this immediately encounters the problem of the original TNS study. This is plainly the most impressive single piece of evidence in the whole debate, as few would disagree, yet it was fully funded by Schering. The TNS "helps" the Defendants in this case in the light of the litigation-specific threshold embedded in Issue 1. In the wider sense it was far from helpful to them and was a powerful piece of evidence relied on by the UK, German and European regulators whose actions caused the bottom to drop out of Schering's market for Femodene in commercial terms.
- The hard truth is that most drug studies are paid for by the industry because governments do not like to spend money on them. Jick's group the BCDSP appears to run on industry funding, and he was at least part commissioned by Organon for Jick 95 (though I have no evidence that they paid him for it). Of course the drug companies want results which boost the value of their products and thus their share prices. But they know they will waste their money unless those studies can withstand the type of scrutiny that has been directed at them in this case. The memorandum concerning Farmer's Meditel study referred to at paragraph 241 above is evidence of that. All investigators have to submit their methods, data and conclusions to examination in a transparent way. The minor decisions that McPherson cited such as whether to categorise a case as a case or not are generally taken care of by methodology which blinds the selectors to exposure. But of course there are ways in which a person so inclined can set out to produce a false result, or manipulate data into a form to suit his paymasters.
- Just as I said when examining questions of bias operating in the other direction, so in this case I should say that there has to be evidence indicating the presence operation and effect of this bias, not just suspicion or speculation, and that evidence I do not see. I acknowledge that there is something of a "team" look to the Defendants' array of experts; Lewis, Heinemann, MacRae and Farmer could fairly be said to be members of the McGill Potsdam Surrey axis, if I may be forgiven for calling it that. But they are all true experts in this field, albeit in Farmer's case as a late entrant, and all have operated in a transparent environment. Shapiro is entirely beyond criticism of this nature, being quite his own man in these respects. But there are pressures working the other way, pressures to publish something interesting, to be associated with the discovery of a new health risk. I doubt if they have operated on the minds of any of the Claimants' advisers, but they are there. While I am alive to its presence, I believe here too I should remain faithful to the Hertz-Picciotto precepts; I do not make any specific allowance to any of the figures which I have otherwise reached in this case to reflect the influence of the Defendants as study funders.
SECTION I : CAUSALITY
- The Bradford Hill Criteria.
The famous epidemiologist Sir Austin Bradford Hill in an address in 1965 considered the difficult question of when a statistically apparent association between A and B should be translated into the proposition that A causes B (52). He proposed nine criteria for consideration, which have been described as an attempt to systematise common sense, but which repay consideration. I set them out in his (descending) order of importance.
(1) The strength of the association. He cited the 9-10 fold difference between smokers and non-smokers in terms of the risk of cancer and the 20-30 fold increase for heavy smokers, as against the two fold increase for coronary thrombosis. Here we have weak associations at best, which on their own are not strong support for a causal link.
(2) Consistency derived from repeated observations by different persons in different places and at different times; Cox apart there is considerable consistency here in my findings in the overview section which follows , albeit the studies are all retrospective and none is prospective.
(3) Is the association specific to the exposure in question? This is an example of a question capable of being blurred by aggregation of COC3s. If they are separated out into different progestogens specificity becomes less clear.
(4) Temporality; on any view the cart comes after the horse as it should.
(5) Biological gradient; the Mercilon phenomenon weakens the argument for causality, as would a heavier death rate in light smokers had such been found.
(6) Biological plausibility; I will deal with this under Haematology below. Bradford Hill accepted that this was a feature which could not be insisted on, since to do so would be to hold up the increase of knowledge until experimental science caught up with valid advances made by observational studies.
(7) Coherence is what is known of the relationship coherent with all other knowledge; the WHO gave the answer to this when they called their original finding "unexpected".
(8) Experimental evidence. None can be prayed in aid here, nor could there be any in this field.
(9) Analogy from other fields. No assistance comes from this.
- Haematology.
Initially this issue was due to take three weeks of the court's time and involve five expert witnesses. In marked contrast with the epidemiologists, this area of the expert evidence was dealt with in a commendably sensible and cost-saving way. That it did so is due to the strenuous efforts of counsel and the open minded and professional way in which the haematological experts assisted the court. Counsel drafted before trial a comprehensive and clear set of questions to be addressed by the experts before trial, and the experts considered them and gave clear answers. The result was that the remaining issues were greatly narrowed and disposed of in one day rather than 15.
- There is a complex haematological "cascade" which keeps the blood in a healthy state of balance, between a tendency to thrombosis or clot formation at one extreme and excessive bleeding at the other. This is done by an interplay between the walls of the relevant vessels, circulating platelets, coagulation factors in the blood and protein inhibitors which check inappropriate activation of thrombotic activity. The measurement of changes in all these constituents of blood is complex, highly technical and an area of medicine where our state of knowledge is very much imperfect. New developments are occurring with some frequency. In the last decade, to take two examples of prime relevance to this litigation, a new thrombophilic genetic mutation present in about 10% of the population, Factor V Leiden as it is known after the group which identified it, has emerged. A new technique, called the Rosing assay, has enabled reliable measurement to be made of Endogenous Thrombin Potential (ETP), itself a measure of the Activated Protein C sensitivity ratio (APC-sr).
- There was an original consensus among the experts that COC3s have a differential effect as compared with COC2s on the following haemostatic variables.
(1) Factor VII is raised on average by 19%. Professor Machin for the Claimants believed that this had a probable causal effect increasing the RR of COC3s, but the Defendants' expert Dr Baglin did not accept that there was any evidence of a causal relationship;
(2) Free Protein S was reduced by 15-17% and either "certainly" (Machin) or "possibly" (Baglin) associated with VTE;
(3) ETP as measured by the Rosing assay was "altered significantly" by COC3 use, though it was not possible to express this as a percentage.
(4) Thrombin activatable fibrinolysis inhibitor (TAFI) was increased by some 2 4 %. Machin thought this a possible contributory factor, Baglin an improbable one.
- By the close of the trial this agreement had been refined and developed to the following currently agreed position, subject to one small area of disagreement to which I will refer.
(1) The Defendants do not contend that a "true" RR of more than 2, which the Court may find proved on the basis of the totality of the epidemiological evidence, should be rejected on the basis that the evidence of a mechanism is absent or inadequate.
(2) The Claimants do not contend that, if the epidemiological evidence fails to establish such an RR, it should nevertheless be found proved because the haematological evidence establishes that there is a mechanism which accounts for such an RR.
(3) There is no biologically understood reason that would explain an epidemiological finding (if there is one) that 20 ΅g EE pills carry a higher RR of VTE than 30 ΅g pills with the same progestogen and this would not be expected to be the case on the basis of present haematological understanding.
(4) It is agreed that there is no distinction to be drawn between DSG and GSD as to their effect on the haemostatic factors of relevance to this case.
(5) The following is agreed in relation to the effect over time of COCs on haemostatic variables:-
(i) the effect is produced essentially within one treatment cycle and
certainly by the third;
(ii) there are no further changes with continuing COC use;
(iii) during the pill-free week (if taken) there is a slight reduction in haemostatic effects. If the pill-free week is not taken there is no such reduction;
(iv) on discontinuance, the haemostatic effects of COC use disappear within 6 weeks and largely after 2 weeks, i.e. levels of coagulation factors return to baseline, with no past user effect.
- Therefore there is no biologically based obstacle to the proof of this case on a statistical basis, nor is the claim proved or materially assisted by this evidence.
- Dr. Baglin for the Defendants does not believe that this agreement precludes, in haematological terms, a "duration of use effect". He stresses the relationship between the risk associated with pill use, which achieves a plateau after 2 or 3 cycles, and all the other risks which are present in an individual. He argues that not only does the interaction between the pill-dependent risk and the pre-existing risk and the increasing intrinsic risk added by ageing have to be considered but that in addition, when a population is being considered, and because that population is made up of different individuals, the new risk posed by the new factor will be different in each individual within the population as will be the case for the risks already present and the interaction between the two. He relies on a model demonstrated in a 1999 paper by Rosendaal (53), which made a very late entry into the literature in this case, but which he and Machin agree is a:
"theoretical(TB)/hypothetical(SJM) model
. a simplistic approach to help understanding of a complex issue, but [which is] of no direct quantitative value for individual patient's risk estimates".
There has yet to be any haematological study on a long term follow-up basis over 20 years or so such as would be needed for a full answer to be given to this. The assistance I get from Rosendaal is slight. Haematology neither advances nor precludes a "duration of use" argument; this can only be resolved by epidemiological means.
SECTION J: OVERVIEW OF THE STUDIES
- With the sole exception of TNS 3 (the Cox analysis), no single study has any plausible claim to exclusive access to the true figure; indeed no study attempts to make such a claim. But, Cox apart, there is a broad range of results, and the question arises whether one can reconcile, combine or draw an overall conclusion from these results and if so how.
- Epidemiologists have a technique which they call "meta-analysis" , meaning an analysis across a range of studies, though as with so many topics in this case controversy exists as to whether its application is a legitimate exercise or something which should even be attempted. While it can be and often is used to combine the results of randomised controlled trials its deployment with observational studies whose datasets are not combinable is beset with difficulty. What is plain is that whatever technique is used a great deal of subjective judgement comes into this stage of the enquiry. Should certain studies be excluded entirely? All agree that some (e.g. Andersen) are so flawed as not to be worthy of any consideration. Should parts only of certain studies be considered for this purpose and other parts discarded? This arises as a very important question in the assessment of both the WHO and TNS studies. Do some studies carry greater weight and authority than others, and if so how should this be measured (if it can be) and reflected? How should the Database studies be dealt with? Should one alone be considered for inclusion and the others rejected entirely? Should they all go in or should some synthesis of them be attempted? One thing is clear; it cannot be the task of the Court to attempt to frame its answer to the essential question in statistical or pseudo-statistical terms, for example giving a synthesised figure of its own with confidence intervals. Is it enough for the Court to answer the question either Yes or No, to say that the true risk is or is not above 2 and say no more than that? These are among the difficult issues which arise at this stage of this judgment.
- Two so-called meta-analyses have been published in this matter namely Kemmeren (54) and Hennessy (55) who came to figures of 1.7 (1.4-2.0) and 1.7 (1.3-2.1) respectively. No expert in the case has laid any great emphasis on these and I therefore merely note them as part of the literature in the case.
- More importantly, I should consider how the various expert witnesses addressed this problem. I should stress at the outset that when forming such estimates each of them had before his or her eyes the forensic imperative of either proving or refuting a relative increase of more than 2; this knowledge cannot but have influenced their thinking, particularly in a case notable for the highly committed , sometimes even partisan tone of some of the expert evidence in this field. In a field of science where great store is set by blinding those who have to exercise professional judgements (for example assessing exposure of subjects in a trial without knowing who is a case and who a control) so as to eliminate investigator bias this is a remarkably unsatisfactory feature. In the truest sense of the word all the experts who attempted this task having previously expressed a view on the issue were biased, however hard they tried to be objective in their task.
- Thorogood took a robust and attractive approach. She selected the 7 studies which she thought were the most impressive, namely WHO (Oxford hospital controls only), TNS (UK and Germany only), Jick 1995, Mediplus, Herings (all users), and Parkin. She ran her eye across the results and gave a rough and ready estimate of the true risk of "around 2.1" without CIs, saying it was almost certainly greater than 1.7 and less than 2.5. She had seen and drawn support from Walker's first "metaanalysis" (which then stood at 2.3) when reaching this conclusion. The very simplicity of this method is refreshing, free as it is from that phenomenon of superficial but misleading precision which so besets this discipline and which Greenland tellingly called "pseudo-accuracy". The hard fact is that choices have to be made at this stage and those were the choices she made. But its lack of transparency is a problem, since at the heart of it lies a judgement which it is hard to get behind.
- Walker took a more elaborate approach, which fell short of a full-blooded meta analysis. It was as I understood it an ad hoc procedure which he devised for this particular case. He selected six studies. He excluded studies which duplicated analysis from previous reports or which reported only subsets of the original data. He gave each a statistical weight calculated as the reciprocal of the variance of the natural logarithm of the effect measure (i.e. the OR or RR as the case might be). This gave a heavier weight to the larger studies, as one might expect. From the appearance of the table he then, more controversially, seems to have added a measure of his own which he called "study design weight" on a scale ranging from 0 50, apparently at 5 point intervals. He had already excluded entirely those studies which in his view had case ascertainment problems or a study design which was so weak that he thought they "offered no relevant information". By this method he was explicitly asserting that a weighting scheme based solely on the statistical power of the studies, such as had commonly been deployed in other meta-analyses, would as he put it "fail to account for the diversity of information sources".
- In a later report and in cross-examination he said that the process in fact worked differently. As he did not want the bigger studies to overwhelm the others, as would have happened using only a standard weighting scheme, he formed some form of overall assessment of the "evidentiary value" of the studies, set that against the calculable statistical weight "and it turned out that to a reasonable approximation, what was left over corresponded to my sense of the non-statistical reliability of the studies. I therefore re-assessed the reliability in round numbers and added the quantities back up to get to the overall score
". I have to say that I found paragraphs 17-27 of his Supplementary Report, from which this comes, and the evidence he gave in cross examination on Day 5 to be Walker at his least convincing, to put it mildly. Taking, for example, the first study in the table (set out below) he is saying that he formed the initial assessment of the overall value of the WHO at the curious figure of 59, calculated its statistical weight at 14 and was left with 45 which was assigned to the column "design weight". It also means that this process when repeated through the table yielded on every occasion round figures for the design weight ending in 5 or 0 and a predominance of ragged figures (5 out of 7) for the "combined weight" column which if Walker is right was the starting figure.
- In whatever sequence it was done one part of his adjustment was an objective or statistical calculation and another an opaque and subjective value judgement, which was combined with the objectively calculated element. This tortuous, unconvincing and poorly described process is the least impressive of the various attempts to reach overall conclusions from these studies, the problem probably stemming from Walker's firmly held belief that "anything that cannot be expressed in numbers is woolly thinking".
- He made adjustments to his original calculation when he wrote his third report just before trial, quite properly taking into account further information then available. The end result was an overall estimate of 2.2 (1.3 3.7) as appears from the table below.
Study |
PE |
LCL |
UCL |
Statistical Weight |
Design Weight |
Combined Weight |
WHO |
2.7 |
1.6 |
4.6 |
14 |
45 |
59 |
BCDSP (JICK) |
2.2 |
1.3 |
3.6 |
15 |
40 |
55 |
TNS (S.RIM ) |
5.2 |
2 |
13.6 |
4 |
5 |
10 |
TNS (UK+ GER) |
1.73 |
1.15 |
2.38 |
29 |
45 |
74 |
UK MEDIPLUS 97 |
1.34 |
0.74 |
2.39 |
11 |
15 |
26 |
LEIDEN 95 |
2.5 |
1.2 |
5.2 |
7 |
50 |
57 |
HERINGS |
2.3 |
1.5 |
3.7 |
19 |
20 |
39 |
|
|
|
|
|
|
|
SUMMARY |
2.2 |
1.3 |
3.7 |
|
|
|
- Methodology apart, there are a number of controversial features involved in this exercise:-
(1) It includes for the WHO study the all centres RR of 2.7, not the results for UK/Germany only; two of those linked to the study who gave evidence, namely Thorogood (on its writing committee) and Heinemann (an investigator) though the Oxford hospital based figure of 2.2 was the best estimate.
(2) It includes one GPRD study only, Jick 1995, and rejects Jick 2000 and Farmer 2000.
(3) It includes, albeit with a very low weighting, the Southern Rim data which most others discarded, and which the Claimants themselves also now exclude.
(4) It excludes TNS 3, the Cox regression analysis based on the pill calendar data.
(5) It includes Leiden 95, and gives it a maximum score for design weight, when serious criticisms have been made as to its reliability as a relative risk study.
(6) It excludes Lidegaard entirely, rather than including it with if appropriate a reduced design weight.
- In fairness to him Walker put this forward as what he calls "a cross-check" for the Court's use when deciding this issue rather than a reliable statistical exercise and in my judgement he was right to limit the claims he made in this way.
- To show how sensitive such an exercise is to possible permutations, six spreadsheets of alternative calculations were put to Walker in cross-examination showing on a "what if" basis the effect of various changes to his table. If the subjective study design weights were removed the RR fell to 2.05. If they remained, but the TNS Southern rim figures and Leiden 1995 were excluded, and the WHO Oxford hospital controls RR of 2.2 was used the RR fell below 2. Inclusion of the Cox figure and Farmer brought it down to 1.5.
- McPherson adopted a third approach. In his disclosed report at various points he assessed the increase in risk in words as follows:
"roughly a doubling of risk
at least double
probably more
at least 1.8 fold
very likely to be in excess of 2
unequivocally in excess of [1.93]
in excess of twofold
the overall results from independent studies would indicate a relative risk of around [2.4] with confidence limits between [1.9 and 3.0]".
- In addition he purported to carry out a conventional meta-analysis while acknowledging the difficulties in using such a technique with observational studies, since any biases in the component studies will be pooled in the final result. He included both Jick studies, Leiden 95, WHO all centres, TNS 3 not TNS 1 including Southern Rim data, Mediplus 97 and 98, Lidegaard 98, Herings first time users and not all users and Parkin. These yielded a "pooled OR" as he called it of 1.87 (1.57 - 2.23).
- He then carried out the same exercise for what he called "all sponsored studies" (the TNS, Mediplus and Lidegaard) and reached a figure of 1.53 (1.23 - 1.91) and for all other studies, which he called "all independent studies", reaching a figure of 2.62 (1.98 - 3.49). There then followed exercises showing what would happen if various studies were selectively removed from the analysis and a great variety of figures emerged.
- Finally he calculated what would happen if the TNS figure used was 1.5 as per TNS 1 not 1.7, and alterations were made as to the lower confidence interval, of which he was suspicious in what can only be described as a generalised way. The broad effect of using the lower point estimate of the powerful TNS was to reduce the pooled OR in the 10 study calculation to a range of 1.63 (1.44 - 1.84) to 1.81 (1.51 - 2.18). He then excluded both Farmer studies and Lidegaard and reached a range of 1.68 (1.48 - 1.91) to 2.03 (1.65 - 2.51).
- So far, so complicated. However on entering the witness box McPherson replaced all these calculations with new ones. The key features were that his first 10 studies overall analysis became 9 studies, as he realistically replaced the two Jick studies, based of course as they were on the same database, with a composite figure excluding those cases common to both; different values for Leiden 95 were used, the TNS remained at 1.7 but with widened CIs, and Herings was reduced to the "all users" figure of 2.3. The new pooled OR became 1.934 ( 2.324 - 1.609). The difference between the funded and unfunded studies remained but was no longer statistically significant.
- In cross-examination he produced a third version of all this. The changes were the removal of Farmer's German Mediplus paper and its replacement by Farmer 2000 and the incorporation of Lidegaard 2002 in place of his 1998 paper and the result was a pooled OR of 1.85 (1.56-2.18), with a difference between funded at 1.39 (1.14-1.70) and independent at 2.35 (1.85-2.98) which had again achieved statistical significance. In a final table excluding all funded studies bar the TNS 1996 figure he reached 2.21 (1.79-2.74).
- I have not done full justice to the 21 pages of calculations, recalculations and revised calculations which McPherson has produced on this one topic, but the complexity of the process and, more importantly, the potential for almost endless adjustment is well demonstrated. Even so brief a summary as I have been able to give above stands in marked contrast to the metaphor of "an anchor" by which he sought to describe the role of meta-analysis in conflicting observational studies such as we have here. Listening to this evidence I felt that shifting sand summed it up better.
- So far as what might be called "funding bias" was concerned, McPherson acknowledged the vital role industry plays in determining the safety of drug products, and said the pharmaco-vigilance could not proceed without it. But in this case he had come to a different view. He cited evidence he had heard about apparent interference in the never published Wyeth-Ayerst Report, not itself a report which anyone has included in the final array. His view is that the answer to the relevant question has been supplied, that there is "a clear scientific consensus about the question" and therefore one has to be as he put it wary about interpreting the four studies which seek to cast doubt on the consensus emerging from the five he classes as independent.
- I must here give my views on McPherson as a witness. His main report to the court dated 7th September 2001 started with a summary of the relevant studies in this dispute. This was no mere scene-setting exercise by him since it led into what was the main purpose of his report namely the meta-analysis that he carried out. The commentary on the studies justified the view he took about including and excluding various of them in this final process. Cross-examination revealed mistakes and incorrect statements embedded in this summary. The Defendants argue that this shows not mere inattention on this witness's part but a tendency not to be even handed as between the two sides of the debate. The main examples are set out below.
(1) In relation to Leiden 1995 he described the study as giving RR estimates for the third against all other OC types. In fact the study considered DSG only and not all third generation. Although in cross-examination he accepted that the "buddy" controls used by this study were "not the kind you normally expect to see" this did not appear in his report nor did any statement to the effect that the numbers involved were low.
(2) He described Farmer's Meditel Study as a study based on the GPRD for 1990 to 1991. This was no slip of the pen since he made the point (wrongly) that it therefore covered some of the same data as Jick 1995. He said the risks were calculated for second generation OCs when the comparison was with first generation. He criticised Farmer's failure in this report to analyse the risks associated with particular progestogens not withstanding the fact that this report was submitted for publication at the end of 1994 when such a debate had not yet started. He in effect accused Farmer of concealing data on the relevant risk of different progestogens and called his article "no more than polemic in the debate about the role of the particular progestogens". This was an unfair criticism in my judgement; in any event after the pill scare occurred Farmer did publish such material as he had in this category, but up to then it had not in my judgement been relevant information.
(3) In his criticism of Lidegaard 1998 he stated that it was "quite interesting" that notwithstanding what he called manipulations of data within the study "nowhere is a direct comparison with confidence limits made of third v second generation pill". In fact quite plainly there is exactly such a comparison evident on the face of the report.
(4) Dealing with Jick 2000 he described this as "clearly an excellent study
.". In cross-examination he said he was not aware that Jick had not verified the cases in the same way that he had in Jick 1995 though that is again plainly there to be seen on the face of the study itself.
(5) He wrongly described the Farmer Pill Scare Paper as having used the GPRD for 1993 to 1995 instead of to 1998. This was more than a slip; the whole point of the paper was to compare the positions before and after the scare. He criticised it for not attempting to examine particular effects of any pill type without apparently appreciating that that was not the purpose of the study; that attempt was made in Farmer 2000.
(6) In the Farmer Validation Paper he misread the summary stating that it used 286 cases plus 177 additional events; plainly the 177 were a sub-group of the 286.
(7) In his commentary on Suissa 1997 he states "data were presented for third generation but not for second generation pills
.."; this is plainly wrong as Suissa did give such data in a clear and obvious table.
(8) He included both Jick 1995 and Jick 2000 in his original meta-analysis as he was unaware that the two studies had overlapping cases, although again that is clearly stated in the latter paper.
- These errors are more than mere carelessness. The exercise on which he was engaged in paragraphs 14 to 73, in which these mistakes occurred, was an important part of what followed. In my judgement they showed that Professor McPherson did not approach this exercise in a careful or an even handed manner. I am therefore not confident in him as an expert witness on whom I can rely in this case.
- MacRae in his first report steered clear of meta analysis altogether. He said it was a valuable exercise in the case of randomised clinical trials which followed the same study protocol and whose data were designed to be combinable. Where the exercise is carried out retrospectively on studies of different designs, particularly observational studies where bias and confounding may have been in play, those factors will also affect the pooling of the results. True to his beliefs he did not attempt the task.
- In his supplementary report, after having seen the Claimants' experts' views as set out above, he made the not unfair point that these three experts had between them cited 13 studies of which only 5 were common to the calculations of all of them. In those 5 in no single case did they agree as to the OR or RR which should be used for the purpose of metaanalysis. So, he said, choices were made even before any calculation started.
- He then responded to their work by an analysis of his own, in which he excluded only those studies to which nobody had ever attached any weight at all. He set out in tabular form a complete list of all studies. Following the death of MacRae, Suissa gave the Defendants' statistical evidence on this issue and essentially adopted MacRae's approach. He made minor changes only to three figures in his table to reflect developments since MacRae wrote his supplementary report, which are changes I am sure MacRae himself would have made had he lived to give this part of his evidence. I include this table as Appendix 3 of this judgment; it serves to give a purely narrative and comprehensive overview of the choices open to anyone wishing to see the range of findings made in this area between 1995 and 2002.
- MacRae then used the same technique as McPherson namely a variance-based calculation and an additional test of homogeneity, and he set out to calculate the upper and lower limits of the meta-analyses which could be performed on this material. This yielded for all COC3s as against LNG a "best" case of 1.40 (1.15 1.69) and a "worst" case of 1.76 ( 1.48 2.11) so as to give a spread of possible findings. By this stage he too had made choices, as is inevitable in this area. The worst case excludes Farmer 2000 and includes the WHO all centres OR of 2.7, the TNS at an OR of 1.7 and a study weight twice as high as Herings, its nearest rival; few, certainly not I, would place Herings this high on a subjective assessment of its worth as a study. The best case includes the WHO at the Oxford GP controls figure of 1.4 but with less weight, the TNS at the Cox value of 0.79 but again with its weight nearly halved, all Jick studies excluded , and Wyeth included with a high weight for what is on any view a flawed study, effectively in draft form. Suissa followed this format but with minor variations which reduced both figures slightly.
- In a final series of tables MacRae then added his own subjective quality assessment. He identified those studies which he thought used adequate adjustment for duration of use and matching for year of birth and both best and worst case figures dropped considerably. Suissa performed these same calculations with not very significant variations.
- Shapiro took an entirely different view. Not surprisingly, given his low opinion of studies with weak or low level statistical associations, he does not subscribe to the view that pooling or meta-analysing such studies does anything to cure the defects in the form of biases and confounding which cause him to be sceptical about the worth of the individual component studies. He has expressed this view repeatedly since the mid 1990s and conducted a debate, to which I have referred, with epidemiological colleagues on the subject. In two articles he argued that any meta-analysis of the cascade of studies which followed his 1974 paper on Reserpine would almost certainly have found a summary RR significantly raised and would have in the eyes of some been evidence of a causal link which could not have been claimed on the basis of the individual studies viewed separately. This would have been entirely wrong as events later proved. He calls this the search for the Holy Grail of attaining statistically stable estimates for effects of low magnitude. With a characteristically forthright flourish he called for the use of this technique on non-experimental data to be abandoned.
- Professor Greenland responded (56) by arguing that there were two types of meta-analysis. The first he called "synthetic" and described as the mindless agglomeration of study results into a single summary estimate (with or without adjustment for random effects) and he plainly deplored. What he thought was valuable, at least potentially, was the "comparative approach" in which meta-analysis is used as an aid in critical comparison of different studies. He believed that weak studies (and perhaps all studies) are worthwhile only to the extent that they can contribute to the comparative overview that employs comparative meta-analysis. He and Dr. Petitti, well known in this field, both agreed with Shapiro that in this process "quality scoring" should be condemned.
- To complete the evidence in this area of the case, as already noted the relevant regulators have themselves plainly performed some overview calculation to reach the RR figures they now put forward as "true risk" figures, and equally plainly they have in doing so not chosen any single study, as their figures do not represent any individual finding, but rather formed a composite or overall figure of 1.5 2.0 (CPMP) or 1.7 (CSM). Beyond noting this, not much help can be derived from their figures as their reasoning, not surprisingly in a public health context where the public interest is in the conclusions only, is not stated.
SECTION K: CONCLUSIONS ON THE FIRST ISSUE
- On the evidence I have heard and seen and for the reasons given in Section E above the Cox Regression Analysis carried out by MacRae and Lewis yields the most compelling evidence in this case. Based on that evidence I find that there is not as a matter of probability any increased relative risk of VTE carried by any of the third generation oral contraceptives supplied to these Claimants by the Defendants as compared with second generation products containing Levonorgesterel.
- If I had had to decide the case without the evidence of Cox, my views on the other studies into this question are I hope already clear from the sections in which I have dealt with them. I will neither attempt any pseudo-precision of my own nor construct some form of formal "league table" of results, as neither exercise seems to me appropriate. But I have to form an overview of where I am left having read the studies and heard what the experts have had to say about them. This now is a judicial and not a statistical step in the case. For reasons stated above in Section H it would not be right to discount or increase the figure I arrive at for any impact of bias or uncontrolled confounding.
- The most powerful and impressive single piece of evidence in this case, Cox apart, is TNS 1 indicating a RR of about 1.7. Not far behind it comes the WHO study, albeit as a tentative result, at or just under 2. The Jick/Farmer GPRD studies are also good evidence for a RR in the 1.5 1.8 range. No other evidence approaches these three in terms of weight or cogency.
- Of the minor studies Leiden 1995 and Wyeth-Ayerst are both flawed in significant respects but stand in at 2.2 and 1.7 respectively and carry roughly similar weight. The Mediplus studies all indicate a lower RR, somewhere below 1.5, but are much less helpful, as are Lidegaard, Parkin and Herings, all contributing little or nothing to the final exercise. Meditel I find to be of no value at all.
- Therefore, if I am wrong about the validity of the Cox regression analysis of the TNS dataset which indicates no elevated risk, I am not satisfied that the effect of the other investigations into the third generation pill is to show, on a balance of probability, that the risk of VTE which it carries is more than twice that of the second generation products that it sought to replace. The most likely figure to represent the relative risk is around 1.7.
- As the above findings both dispose of the first issue in a way which means that the claim must fail, it is not strictly necessary for me to make a finding as to whether the RR of 1.7 itself translates into a relationship of true cause and effect or is a merely statistical appearance. If I had to do so I would incline to a finding that there is an underlying causal connection at about that level of increased risk. Though it is very weak it is based on the 1995 studies which are broadly consistent and impressive pieces of epidemiology.
- For these reasons therefore these actions fail. I am fully aware that this result will come as a serious disappointment to all the Claimants involved in this case. It may or may not be any comfort to them to know that this trial was almost certainly the most exhaustive examination that this question has yet received and that their case could not have been more effectively put forward than it was by the highly skilled and dedicated legal team who acted for them.
- APPENDIX 1 INDEX TO THE JUDGMENT
Section |
Paragraph |
|
|
|
|
A |
|
INTRODUCTION |
|
1 |
General. |
|
6 |
The Combined Oral Contraceptive. |
|
11 |
The Regulatory History. |
|
20 |
The Issues in the Litigation. |
|
|
|
B |
|
THE APPROACH TO DECIDING THE FIRST ISSUE |
|
26 |
Cohort Studies. |
|
27 |
Case Control Studies. |
|
29 |
Database Studies. |
|
33 |
Expert Evidence |
|
36 |
Point Estimates and Confidence Intervals. |
|
45 |
Aggregation of COC Products. |
|
|
|
C |
|
THE WHO STUDY |
|
59 |
An a Priori Hypothesis? |
|
64 |
All Centres or Oxford? |
|
68 |
Hospital or GP Controls? |
|
78 |
Conclusion. |
|
|
|
D |
|
THE TNS: THE FIRST TWO STUDIES |
|
81 |
The Origins of the Study. |
|
84 |
The Progress of the Study. |
|
91 |
TNS1 |
|
95 |
TNS2 |
|
98 |
The Mercilon Anomaly. |
|
106 |
Duration of Use. |
|
115 |
Suissa's Splines. |
|
|
|
E |
|
121TNS3 AND THE COX REGRESSION ANALYSIS |
|
127 |
The Pill Calendar Data. |
|
133 |
The Attack on Cox by Walker. |
|
142 |
MacRae's Response. |
|
148 |
Walker's Rebuttal. |
|
152 |
MacRae's Reply. |
|
158 |
Walker's Separate Algebraic Attack. |
|
159 |
Conclusion on Cox. |
|
|
|
F |
|
164THE JICK v FARMER DEBATE |
|
165 |
The UK GPRD. |
|
171 |
The Methods and Findings of the Studies. |
|
185 |
The Development of the Issues. |
|
194 |
Jick V: The Attack on Farmer's Controls. |
|
209 |
Conclusions. |
|
|
|
G |
|
THE OTHER STUDIES |
|
225 |
Leiden 1995. |
|
231 |
Herings. |
|
234 |
Parkin. |
|
238 |
Lidegaard. |
|
240 |
UK Meditel. |
|
243 |
German Mediplus. |
|
244 |
UK Mediplus. |
|
248 |
Wyeth-Ayerst. |
|
255 |
Farmer 2000 "Pill Scare". |
|
|
|
H |
|
258BIAS AND CONFOUNDING |
|
262 |
Prescriber Bias. |
|
279 |
Diagnostic/Referral Bias. |
|
286 |
Conclusions on Prescriber, Diagnostic and Referral Bias. |
|
287 |
Hidden Bias and Confounding. |
|
288 |
Conclusions on Bias and Confounding. |
|
298 |
Industry Funding Bias. |
|
|
|
I |
|
CAUSALITY |
|
302 |
The Bradford Hill Criteria. |
|
303 |
Haematology. |
|
|
|
J |
309 |
OVERVIEW OF THE STUDIES |
|
|
|
K |
339 |
CONCLUSIONS ON THE FIRST ISSUE |
APPENDIX 2
THE CURRICULA VITARUM OF THE PRINCIPAL EXPERT WITNESSES
The Claimants' Witnesses
Alexander M. Walker. Qualified in medicine and holding a doctorate in epidemiology. Since 1991 Professor in the Department of Epidemiology at Harvard School of Public Health. Formerly Statistical Consultant New England Journal of Medicine. Contributing Editor "The Lancet" and Co-editor "Journal of Epidemiology and Bio-statistics". Author or Joint Author of over 200 articles, 5 dealing with risks associated with oral contraceptives. Currently on leave of absence from his academic chair. Employed by Ingenix Pharmaceutical Services, a commercial concern providing services in the field of epidemiology and public health.
Margaret Thorogood. A sociology graduate. Worked under Sir Richard Doll at Oxford University and gained a PhD based on a thesis studying the risk of fatal CV disease and OC use. Has published some 30 papers studying OC use and cardiovascular disease. Became Co-principal Investigator on the Transnational Study and was a member of the Publications Advisory Committee of the WHO. Collaborated on the MICA Study by the Drugs Safety Research Unit at Southampton into AMI and OC use.
Hershel Jick. A graduate of Harvard Medical School. Practised for a while in internal medicine and then moved to research in clinical pharmacology. Since the mid 1960s has been concerned in pharmacoepidemiology and with the BCDSP of which he has been director since 1971. His name appears on over 300 papers in learned journals all of which are concerned with health implications of diverse drugs. Under his direction the BCDSP holds a licence to use the UK GPRD based on which it has published very many drug studies.
Klim McPherson. Graduated in mechanical sciences and took a doctorate in medical statistics at the London School of Hygiene and Tropical Medicine. A University Lecturer in Medical Statistics at the University of Oxford, and from 1991 to 1996 Professor of Public Health Epidemiology at the LSHTM. Now Senior Scientist with the Medical Research Council at the Bristol Department of Social Medicine. A member of the CSM. A temporary member of the Sub-Committee of the CSM that considered the evidence in relation to the "Dear Doctor" letter of 1995.
Nicolas Dunn: Senior Lecturer in primary medical care at the University of Southampton. Fourteen years a General Practitioner. Senior Research Fellow in the Drug Safety Research Unit, Southampton. MSc and Diploma in Epidemiology. An author of a study in 1997 into myocardial infarction and OC use.
The Defendants' Witnesses
Samuel Shapiro: Qualified as a doctor in South Africa and practised in clinical medicine from 1957 to 1972 since when he has worked as an Epidemiologist. He worked in the BCDSP and from 1974 to 1999 at the Drug Epidemiology Unit at Boston University, retiring as its Director. Was Professor of Epidemiology at Boston University until June 2001. Now teaching at Columbia University in New York. 24 of the 306 papers to which his name is attached deal with the health risks of oral contraceptives.
Kenneth Duncan MacRae. Since 1969 a Bio-Statistician. 1976 1998 Senior Lecturer and Reader in Medical Statistics at Charing Cross and Westminster Medical School University of London. 1999 to 2002 Professor of Medical Statistics, Post-graduate Medical School, University of Surrey. Author of many published studies on epidemiology and bio-statistics, 13 of which relate to OC use and Cardiovascular illness.
Michael Lewis. Qualified in medicine in Germany. Holds a Diploma in Epidemiology from McGill University Canada where he was Assistant Professor of the Department of Epidemiology and Bio-Statistics 1993 to 1996 and Associate Director of Potsdam Institute of Pharmaco-Epidemiology and Technology Assessment. Currently Director and part owner of EPES a German commercial entity providing services in epidemiology principally to pharmaceutical companies. Senior Investigator in the Transnational Study.
Lothar Heinemann. Qualified in medicine in Germany. Professor for Preventive Medicine at the Academy of Sciences Berlin 1982. Adjunct Professor Epidemiology and Bio-Statistics at McGill University Canada 1993. Since 1990 Director of ZEG a commercial concern providing services in epidemiology and health research based in Berlin. A member of the WHO and TNS teams. Since 1993 has published 40 papers on the relationship between OC use and cardiovascular disease.
Richard Donald Trafford Farmer. Medically qualified, a Fellow of the Faculty of Public Health Medicine at the Royal College of Physicians. Previously Professor of Public Health Medicine at Charing Cross and Westminster Medical School. Currently Professor of Epidemiology at the Post-graduate Medical School at the University of Surrey. Since 1989 his published work mainly related to the epidemiology of suicide risks on public transportation systems. His publications relating to the cardiovascular effects of OC use started in about 1995 since when they have become the main focus of his interest.
APPENDIX 3 TABLE 1B PROFESSOR MACRAE'S OVERVIEW
APPENDIX 4REFERENCES TO THE RELEVANT SCIENTIFIC LITERATURE
(1) WHO collaborative study of cardiovascular disease and steroid hormone contraception. A multinational case-control study of cardiovascular disease and steroid hormone contraceptives. Description and validation of methods. Journal of Clinical Epidemiology 1995; 48: 1513-1547.
(2) Venous Thromboembolic disease and combined oral contraceptives: results of international multi-centre case-control study. Lancet 1995; 346: 1575 1582.
(3) Effect of different progestagens in low oestrogen oral contraceptives on venous thromboembolic disease. Lancet 1995; 346: 1582 1588.
(4) World Health Organisation. Cardiovascular disease and steroid hormone contraception. Report of WHO Scientific Group. WHO Technical Report Series 877; Geneva: 1998.
(5) Spitzer WO, Thorogood M, Heinemann L. Trinational case-control study of oral contraceptives and health. Pharmacoepidemiology and Drug Safety 1993; 2: 21 31.
(6) Lewis MA, Assmann A, Heinemann L, Spitzer WO. Interim review of the transnational case-control study of oral contraceptives and health: Approved protocol revisions through September 1995. Pharmacoepidemiology and Drug Safety 1996; 5: 43 51.
(7) Spitzer WO, Lewis MA, Heinemann LAJ, Thorogood M, MacRae KD. On behalf of the Transnational Research Group on Oral Contraceptives and the Health of Young Women. Third generation oral contraceptives and risk of venous thromboembolic disorders: An international case-control study. BMJ 1996; 312: 83 88.
(8) Lewis MA, Heinemann LAJ, MacRae KD, Bruppacher R, Spitzer WO, with the Transnational Research Group on Oral Contraceptives and the Health of Young Women. The increased risk of venous thromboembolism and the use of third generation progestogens: Role of bias in observational research. Contraception 1996; 54: 5 13.
(9) Alexander M Walker. Newer oral contraceptives and the risk of venous thromboembolism. Contraception 1998; 57: 169 181.
(10) Rosendaal FR. Venous Thrombosis: A multi-causal disease. Lancet; 353: 1167 1173.
(11) Suissa S, Spitzer WO, Rainville B, Cusson J, Lewis M and Heinemann L. Recurrent use of newer oral contraceptives and the risk of venous thromboembolism. Human Reproduction 2000; 15: 817 821.
(12) Suissa S, Blais L, Spitzer WO, Cusson J, Lewis M, Heinemann L. First time use of newer oral contraceptives and the risk of venous thromboembolism. Contraception 1997; 56: 141 146.
(13) Farley TM, Merik O, Marmot MG, Chang CO, Poulter NR. Oral contraceptives and the risk of venous thromboembolism: Impact of duration of use. Contraception 1998; 57: 61 65.
(14) Lewis M, MacRae K, Kuhl-Habich D, Bruppacher R, Heinemann L, Spitzer WO. The differential risk of oral contraceptives: The Impact of full exposure history. Human Reproduction 1999; 14: 1493 1499.
(15) Walker AM. Efficient assessment of confounder effects in matched follow up studies. Applied Statistics 1982. 31: 293 297.
(16) Prentice RL and Breslow NE. Retrospective studies and failure time models. Biometrika 1978. 65: 153 158.
(17) Jick H, Jick S, Derby L. Validation of information recorded on General Practitioner based computerised data resource in the UK. BMJ 1991; 302: 766 768.
(18) Jick H, Terris B, Derby L, Jick S. Further validation of information recorded on a General Practitioner based computerised data resource in the UK. Pharmacoepidemiology and Drug Safety 1992; 1: 347 349.
(19) Jick H, Jick S, Gurewich V, Myers M, Vasilakis C. Risk of idiopathic cardiovascular death and nonfatal venous thromboembolism in women using oral contraceptives with differing progestogen components. Lancet 1995; 346: 1589 1593.
(20) Farmer R, Lawrenson R, Todd J, Williams T, MacRae K, Tyrer F, Leydon G. A comparison of the risk of venous thromboembolic disease in association with different combined oral contraceptives. British Journal of Clinical Pharmacology 2000; 49: 580 590.
(21) Jick H, Kaye J, Vasilakis C, Jick S. Risk of venous thromboembolism among users of third generation oral contraceptives compared with users of oral contraceptives with Levonorgestrel before and after 1995: Cohort and case-control analysis. BMJ 2000; 321: 1190 1195.
(22) Lawrenson R, Todd J, Leydon G, Williams T, Farmer R. Validation of the diagnosis of venous thromboembolism in general practice database study. British Journal of Clinical Pharmacology 2000; 49: 591 596.
(23) Jick H, Jick S, Myers M, Vasilakis C. Third generation oral contraceptives and venous thrombosis. Lancet; 349. March 8th 1997.
(24) Bloemenkamp K, Rosendaal F, Helmerhorst F, Buller H, Vandenbroucke J. Enhancement by Factor V Leiden mutation of risk of deep vein thrombosis associated with oral contraceptives containing a third generation progestogen. Lancet 1995; 346: 1593 1596.
(25) Koster T, Rosendaal F, De Ronde H, Briet E, Vandenbroucke J, Bertina R. Venous thrombosis due to poor anti-coagulant response to activated protein C: Leiden Thrombophilia Study; Lancet 1993; 342: 1503 1506.
(26) Vandenbroucke J, Koster T, Briet E, Reitsma P, Bertina R, Rosendaal F. Increased risk of venous thrombosis in oral contraceptive users who are carriers of Factor V Leiden mutation. Lancet 1994; 344:1453 1457.
(27) Herings R, Urquhart J, Leufkens H. Venous thromboembolism among new users of different oral contraceptives. Lancet; 354: 127 128.
(28) Herings R, Urquhart J, Leufkens H. Venous thromboembolism and oral contraceptives. Lancet; 354: 1469 1470.
(29) Parkin L, Skegg D, Wilson M, Herbison G, Paul C. Oral contraceptives and fatal pulmonary embolism. Lancet; 355: 2133 4.
(30) Lidegaard O, Edstrom B, Kreiner S. Oral contraceptives and venous thromboembolism; A case control study. Contraception 1998; 57: 291 301.
(31) Lidegaard O, Edstrom B, Kreiner S. Oral contraceptives and venous thromboembolism: A five year national case control study. Contraception 2002; 65: 187 196.
(32) Farmer R and Preston T. The risk of venous thromboembolism associated with low oestrogen oral contraceptives. Journal of Obstetrics & Gynaecology 1995; 15: 195 200.
(33) Farmer R, Lawrenson R, Thompson C, Kennedy J, Hambleton IR. Population based study of risk of venous thromboembolism associated with various oral contraceptives. Lancet; 349: 83 88.
(34) Todd J, Lawrenson R, Farmer R, Williams T, Leydon G. Venous thromboembolic disease and combined oral contraceptives: A re-analysis of the MediPlus database. Human Reproduction; 14: 1500 1505.
(35) Neubauer R, Nortington R, Grubb G, Olsen A. GPRD epidemiology study on the use of oral contraceptives and the risk of cardiovascular events. Unpublished report, Wyeth-Ayerst Research. Philadelphia. 85 pages.
(36) Farmer R, Williams T, Simpson E, Nightingale A. Effect of 1995 pill scare on rates of venous thromboembolism among women taking combined oral contraceptives: Analysis of GPRD. BMJ; 321: 477 479.
(37) Sackett D. Bias in analytic research. J. Chron. Dis.; 32: 51 63.
(38) Dunn N, White I, Freemantle S, Mann R. The role of prescribing and referral bias in studies of the association between third generation oral contraceptives and increased risk of thromboembolism. Pharmacoepidemiology and Drug Safety; 7: 3 14.
(39) Heinemann L, Lewis M, Assmann A, Gravens L, Guggenmmoom-Holzmann I. Could preferential prescribing and referral behaviour of Physicians explain the elevated thrombosis risk found to associated with third generation oral contraceptives? Pharmacoepidemiology and Drug Safety; 5: 285 294.
(40) van Lunsen, H. Recent oral contraceptive use patterns in four European countries: Evidence for selective prescribing of oral contraceptives containing third generation progestogens. Eur. J. of Contraception and Rep. Health Care 1996; 1: 39 45.
(41) Jamin C, de Mouzon M. Selective prescribing of third generation oral contraceptives. Contraception 1996; 54: 55 56.
(42) Lidegaard O. The influence of thrombotic risk factors when oral contraceptives are prescribed. Acta Obstet Gynecol Scand 1997; 76: 252 260.
(43) Farley T, Meirik O, Poulter N, Chang C, Marmot M. Oral contraceptives and thrombotic diseases: impact of new epidemiological studies. Contraception 1996; 54: 193 198.
(44) Bloemenkamp K, Rosendaal F, Buller H, Helmerhorst F, Colly LP, Vandenbroucke J. Risk of venous thrombosis with the use of current low dose oral contraceptives is not explained by diagnostic suspicion and referral bias. Arch. Intern. Med. 1999; 159: 65 70.
(45) Heinemann L, Garbe E, Farmer R, Lewis M. Venous thromboembolism and oral contraceptive use: A methodological study of diagnostic suspicion and referral bias. Eur. J. of Contraception and Reproductive Health Care 2000; 5: 183 191.
(46) Taubes G. Epidemiology faces its limits. Science 1995; 269: 164 169.
(47) Boston Collaborative Drug Surveillance Programme. Reserpine and cancer. Lancet 1974; 2: 669 671.
(48) Davey Smith G, Phillips A, Neaton J. Smoking as "independent" risk factor for suicide: Illustration of an artefact from observational epidemiology? Lancet 1992; 340: 708 712.
(49) Heinemann L, Lewis M, Assmann A, Thiel C. Case control studies on venous thromboembolism: Bias due to design? A methodological study of venous thromboembolism and steroid hormone use. Contraception 2002; 65: 207 214.
(50) Shapiro S. Bias in the evaluation of low-magnitude associations: an empirical perspective. Am. J. of Epidemiology 2000; 151: 939 945.
(51) Hertz-Picciotto I. Invited commentary: Shifting the burden of proof regarding biases and low-magnitude associations. Am. J. of Epidemiology 2000; 151: 946 948.
(52) Bradford Hill, Sir A. The environment and disease, association or causation? Proceedings of the Royal Society of Medicine 14th January 1965.
(53) Rosendaal F. Venous thrombosis: A multicausal disease. Lancet; 353:1167 1173.
(54) Kemmeren J, Algra A, Grobbee D. Third generation oral contraceptives and risk of venous thrombosis: Meta-analysis. BMJ 2001; 323: 1 9.
(55) Hennessy S, Berlin J, Kinman J, Margolis D, Marcus S, Strom B. Risk of venous thromboembolism from oral contraceptives containing gestodene and desogestrel versus levonorgestrel: A meta-analysis and formal sensitivity analysis. Contraception 2001; 64: 125 133.
(56) Greenland S. Can meta-analysis be salvaged? Am. J. Epidemiology 1994; 140: 783 787.