Exercises 01

Information Retrieval - Graduate Level

 

A Small Reference Collection

 

For this set of exercises (and also for other exercises in this course), we use a small sub-collection derived directly from the Cystic Fibrosis Collection. The vocabulary and the documents in this sub-collection are as indicated below. For each index term we include an IDF factor. The frequency of a term in a document is always equal to 1, as in the original Cyctic Fibrosis Collection. Our sub-collection also includes a set of 5 example queries. For each of these queries, a set of relevant documents (specified by a group of specialists) is provided.

 

VOCABULARY

Id Idf Keyword Id Idf Keyword Id Idf Keyword

1 5.2 ADAPTATION-PHYSI 69 5.2 EDEMA 137 5.2 OPTIC-NEURITIS

2 5.2 ADMINISTRATION-O 70 5.2 EMULSIONS 138 5.2 PALMITATES

3 1.1 ADOLESCENCE 71 5.2 ENDOCRINE-DISEAS 139 2.9 PANCREAS

4 1.9 ADULT 72 5.2 ENERGY-METABOLIS 140 3.2 PANCREATIC-DISEA

5 5.2 AEROSOLS 73 5.2 ENZYMES 141 2.7 PANCREATIC-EXTRA

6 5.2 AGE-FACTORS 74 5.2 ERYTHROCYTES 142 5.2 PANCREATIC-JUICE

7 5.2 ALCOHOL-ETHYL 75 5.2 EXOCRINE-GLANDS 143 5.2 PANCREATIC-NEOPL

8 5.2 ALCOHOLS-BUTYL 76 5.2 FATS 144 3.7 PANCREATIN

9 5.2 AMINO-ACIDS 77 5.2 FATTY-ACIDS 145 5.2 PANCREATITIS

10 3.7 AMYLASES 78 4.2 FATTY-ACIDS-ESSE 146 5.2 PARENTERAL-FEEDI

11 3.7 ANEMIA-HEMOLYTIC 79 5.2 FATTY-ACIDS-NONE 147 5.2 PATIENTS

12 5.2 ANESTHETICS 80 5.2 FATTY-ACIDS-UNSA 148 5.2 PEDIGREE

13 5.2 ANIMAL 81 3.2 FECES 149 4.2 PEPTIDE-HYDROLAS

14 5.2 ANOREXIA 82 1.3 FEMALE 150 4.2 PHOSPHOLIPIDS

15 5.2 ANTIOXIDANTS 83 5.2 GALLBLADDER 151 4.2 PLACEBOS

16 5.2 AUTOPSY 84 4.2 GASTROINTESTINAL 152 5.2 PNEUMONIA

17 5.2 BICARBONATES 85 5.2 GASTROINTESTINAL 153 5.2 PORPHYRIA

18 5.2 BILE-ACIDS-AND-S 86 5.2 GENES 154 5.2 PREGNANCY

19 5.2 BILIARY-TRACT 87 5.2 GENITAL-DISEASES 155 4.2 PROGNOSIS

20 4.2 BIOLOGICAL-TRANS 88 5.2 GENITAL-DISEASES 156 5.2 PROTEIN-CALORIE-

21 5.2 BLOOD-COAGULATIO 89 5.2 GIARDIASIS 157 5.2 PROTEINS

22 5.2 BODY-COMPOSITION 90 3.7 GROWTH 158 5.2 PSEUDOMONAS-INFE

23 4.2 BODY-HEIGHT 91 4.2 GUANIDINES 159 5.2 PSYCHOLOGY

24 4.2 BODY-WEIGHT 92 4.2 HAIR 160 4.2 RETINOL-BINDING-

25 5.2 BROMELAINS 93 5.2 HEART-DISEASES 161 4.2 RETROLENTAL-FIBR

26 5.2 BRONCHI 94 5.2 HEMOGLOBINS 162 2.4 REVIEW

27 5.2 BRONCHITIS 95 5.2 HEMOLYSIS 163 5.2 SELENIUM

28 5.2 BRONCHODILATOR-A 96 0.0 HUMAN 164 5.2 SERUM-ALBUMIN

29 5.2 CALORIC-INTAKE 97 1.2 INFANT 165 5.2 SJOGRENS-SYNDROM

30 5.2 CARBOXYPEPTIDASE 98 2.9 INFANT-NEWBORN 166 5.2 SMELL

31 5.2 CAROTENE 99 5.2 INFANT-NUTRITION 167 5.2 SOCIAL-ADJUSTMEN

32 5.2 CARRIER-PROTEINS 100 4.2 INFANT-PREMATURE 168 5.2 SODIUM

33 4.2 CASE-REPORT 101 5.2 INFANT-PREMATURE 169 4.2 SODIUM-CHLORIDE

34 2.7 CELIAC-DISEASE 102 2.7 INTESTINAL-ABSOR 170 5.2 SPECTROMETRY-FLU

35 0.5 CHILD 103 5.2 INTESTINAL-DISEA 171 5.2 SPECTROPHOTOMETR

36 5.2 CHILD-NUTRITION 104 5.2 INTESTINAL-OBSTR 172 5.2 SPECTROPHOTOMETR

37 1.1 CHILD-PRESCHOOL 105 5.2 INTESTINAL-SECRE 173 5.2 SPINAL-CORD

38 5.2 CHLORAMPHENICOL 106 5.2 INTUSSUSCEPTION 174 5.2 SPUTUM

39 3.7 CHLORIDES 107 5.2 IRON 175 5.2 STAPHYLOCOCCUS

40 4.2 CHOLESTEROL-ESTE 108 5.2 IRRIGATION 176 5.2 STARCH

41 5.2 CHRONIC-DISEASE 109 4.2 KIDNEY-DISEASES 177 5.2 SUPPORT-U-S-GOVT

42 5.2 CHYMOTRYPSIN 110 5.2 LACTOSE-INTOLERA 178 2.4 SUPPORT-U-S-GOVT

43 4.2 CIMETIDINE 111 5.2 LINOLEIC-ACIDS 179 2.9 SWEAT

44 3.7 CLINICAL-TRIALS 112 3.7 LIPASE 180 5.2 SWEATING

45 5.2 COLITIS-ULCERATI 113 5.2 LIPID-METABOLISM 181 5.2 SYNDROME

46 3.2 COMPARATIVE-STUD 114 3.2 LIPIDS 182 3.7 TASTE

47 5.2 CORONARY-DISEASE 115 5.2 LIPOPROTEINS 183 5.2 TASTE-BUDS

48 5.2 CREATINE 116 5.2 LIVER 184 4.2 TASTE-DISORDERS

49 5.2 CREATININE 117 5.2 LIVER-CIRRHOSIS 185 5.2 TIME-FACTORS

50 5.2 CROHN-DISEASE 118 5.2 LIVER-CIRRHOSIS- 186 5.2 TRACE-ELEMENTS

51 0.0 CYSTIC-FIBROSIS 119 5.2 LUNG 187 2.4 TRIGLYCERIDES

52 5.2 DEFICIENCY-DISEA 120 4.2 LUNG-DISEASES 188 3.7 TRYPSIN

53 5.2 DIABETES-MELLITU 121 4.2 LYMPH 189 5.2 UREA

54 4.2 DIAGNOSIS-DIFFER 122 5.2 LYMPHANGIECTASIS 190 4.2 URIC-ACID

55 3.7 DIET 123 2.4 MALABSORPTION-SY 191 2.9 VITAMIN-A

56 5.2 DIET-THERAPY 124 1.2 MALE 192 4.2 VITAMIN-A-DEFICI

57 5.2 DIETARY-CARBOHYD 125 5.2 MIDDLE-AGE 193 5.2 VITAMIN-B-12-DEF

58 2.7 DIETARY-FATS 126 5.2 MONOGRAPH 194 5.2 VITAMIN-D-DEFICI

59 3.2 DIETARY-PROTEINS 127 5.2 MUCUS 195 2.1 VITAMIN-E

60 5.2 DIFFERENTIAL-THR 128 5.2 MUSCLES 196 2.2 VITAMIN-E-DEFICI

61 5.2 DIGESTION 129 5.2 MUSCULAR-DISEASE 197 5.2 VITAMIN-K-DEFICI

62 5.2 DISACCHARIDASES 130 5.2 NEOPLASMS 198 4.2 VITAMINS

63 5.2 DOSE-RESPONSE-RE 131 5.2 NERVE-DEGENERATI 199 5.2 WATER

64 5.2 DRAINAGE 132 5.2 NERVOUS-SYSTEM-D 200 5.2 WOUNDS-AND-INJUR

65 5.2 DRUG-COMBINATION 133 3.7 NITROGEN 201 5.2 XYLOSE

66 5.2 DRUG-THERAPY 134 4.2 NUTRITION 202 3.7 ZINC

67 4.2 DRUG-THERAPY-COM 135 3.7 NUTRITION-DISORD

68 5.2 DYSAUTONOMIA-FAM 136 5.2 NUTRITIONAL-REQU

 

 

DOCUMENT VECTORS

Id Keywords

Doc 1: (44, 51, 96, 128, 129, 151, 195)

Doc 2: (3, 10, 25, 35, 37, 44, 46, 51, 58, 59, 65, 81, 82, 96, 102, 112, 114, 123, 124, 141, 149, 188, 189)

Doc 3: (3, 4, 23, 24, 30, 31, 34, 35, 42, 51, 58, 81, 90, 96, 105, 112, 114,

123, 133, 134, 144, 155, 188, 201)

Doc 4: (35, 51, 82, 96, 97, 113, 123, 135, 195, 196)

Doc 5: (3, 4, 20, 32, 35, 51, 82, 96, 97, 98, 116, 121, 124, 125, 154, 162,

170, 171, 172, 191)

Doc 6: (3, 6, 34, 35, 37, 46, 51, 89, 96, 97, 195)

Doc 7: (10, 17, 35, 37, 41, 51, 82, 96, 120, 124, 152, 174, 175, 179, 188)

Doc 8: (3, 35, 37, 46, 51, 82, 96, 124, 139, 169, 178, 183, 184)

Doc 9: (4, 11, 51, 93, 96, 97, 118, 153, 195, 196)

Doc 10: (1, 3, 4, 8, 35, 51, 60, 82, 96, 124, 166, 169, 178, 182)

Doc 11: (35, 37, 51, 82, 96, 97, 102, 114, 124, 133, 141)

Doc 12: (3, 4, 5, 10, 28, 29, 35, 37, 51, 58, 59, 64, 96, 97, 108, 112, 146,

149, 162, 177, 187, 198, 199)

Doc 13: (20, 51, 75, 82, 84, 87, 88, 96, 120, 124, 127, 139, 155, 159, 162,

167, 178, 179)

Doc 14: (11, 33, 51, 55, 69, 96, 97, 98, 124, 140, 196)

Doc 15: (3, 4, 7, 35, 37, 44, 46, 51, 70, 82, 96, 97, 102, 124, 138, 191, 192)

Doc 16: (34, 35, 37, 45, 50, 51, 62, 96, 97, 98, 100, 103, 122)

Doc 17: (9, 22, 51, 53, 58, 83, 84, 85, 90, 96, 102, 104, 106, 110, 117, 126,

133, 135, 140, 162, 187, 198)

Doc 18: (12, 14, 51, 52, 66, 68, 71, 96, 109, 130, 132, 134, 165, 182, 184, 200)

Doc 19: (13, 15, 21, 34, 47, 51, 55, 58, 74, 80, 94, 96, 98, 100, 115, 121,

123, 156, 157, 161, 162, 195, 196)

Doc 20: (3, 35, 37, 40, 51, 79, 96, 97, 102, 114, 141, 150, 187)

Doc 21: (3, 4, 35, 37, 51, 82, 95, 96, 97, 102, 123, 124, 140, 187, 195, 196)

Doc 22: (24, 35, 51, 56, 57, 58, 59, 63, 96, 109, 136, 141, 187, 190)

Doc 23: (3, 16, 37, 38, 51, 96, 131, 137, 173, 178)

Doc 24: (3, 4, 35, 37, 51, 61, 96, 97, 98, 139, 140, 143, 145, 162, 176, 181)

Doc 25: (3, 35, 51, 92, 96, 160, 164, 191, 202)

Doc 26: (35, 37, 43, 51, 76, 81, 82, 91, 96, 124)

Doc 27: (35, 39, 51, 96, 139, 168, 179)

Doc 28: (18, 34, 35, 37, 51, 81, 96, 144, 187)

Doc 29: (3, 35, 37, 48, 49, 51, 67, 82, 96, 124, 191, 195)

Doc 30: (3, 27, 35, 37, 39, 51, 54, 82, 86, 96, 124, 148, 158, 178, 179)

Doc 31: (51, 54, 96)

Doc 32: (3, 4, 35, 51, 55, 90, 92, 96, 151, 160, 182, 185, 191, 202)

Doc 33: (3, 4, 34, 35, 43, 51, 67, 82, 91, 96, 123, 124, 141, 144, 178)

Doc 34: (51, 96, 141, 190)

Doc 35: (33, 35, 39, 51, 96, 97, 124, 139, 179)

Doc 36: (51, 59, 72, 78, 96, 97, 107, 111, 135, 162, 163, 178, 186, 192, 193,

194, 196, 197, 202)

Doc 37: (2, 11, 19, 26, 35, 36, 37, 51, 96, 97, 99, 101, 119, 123, 161, 195,

196)

Doc 38: (3, 23, 35, 37, 40, 51, 77, 78, 82, 96, 97, 124, 150, 187, 195, 196)

 

 

QUERIES

 

Sets of Keywords

Query 1: (72, 117, 191)

Query 2: (147, 195, 196)

Query 3: (55, 56, 73, 139, 141, 142, 147)

Query 4: (147, 179, 180)

Query 5: (147, 182, 184)

 

Natural Language

Query 1: What is the association between liver disease (cirrhosis) and vitamin A

metabolism in CF?

Keywords: (ENERGY-METABOLISM, LIVER-CIRRHOSIS, VITAMIN-A)

 

Query 2: What is the role of Vitamin E in the therapy of patients with CF?

Keywords: (PATIENTS, VITAMIN-E, VITAMIN-E-DEFICIENCY)

 

Query 3: What is the most effective regimen for the use of pancreatic enzyme

supplements in the treatment of CF patients?

Keywords: (DIET, DIET-THERAPY, ENZYMES, PANCREAS, PANCREATIC-EXTRACTS,

PANCREATIC-JUICE, PATIENTS)

 

Query 4: Has any CF patient been found to have consistently normal sweat tests?

Keywords: (PATIENTS, SWEAT, SWEATING)

 

Query 5: Are there abnormalities of taste in CF patients?

Keywords: (PATIENTS, TASTE, TASTE-DISORDERS)

 

Relevant Documents According do Specialists

Query 1: (5, 15, 25, 36)

Query 2: (1, 4, 6, 9, 14, 17, 19, 21, 23, 29, 36, 37, 38)

Query 3: (2, 3, 11, 12, 13, 16, 20, 22, 24, 26, 28, 33, 34)

Query 4: (7, 27, 30, 31, 35)

Query 5: (8, 10, 18, 32)

 

 

 

 

Problems

 

1. For the queries 1 to 5, do as follows. (a) Compute the answer sets generated by the Boolean model. (b) Compare with the set of relevant documents given by the specialists.

2. For the queries 1 to 5, do as follows. (a) Compute the ranking generated by the Vector model and show your computations. (b) Compare with the set of relevant documents given by the specialists. (c) Compare with the answer sets generated by the Boolean model. What key distinctions do you notice?

3. For the queries 1 to 5, do as follows. (a) Compute the ranking generated by the Probabilistic model and show your computations. Use a single iteration for the computations. (b) Compare with the set of relevant documents given by the specialists. (c) Compare with the answer sets generated by the Vector model and comment.

4. For the queries 1 to 5, do as follows. (a) Compute the ranking generated by the Probabilistic model using several iterations and show your computations. (b) Compare with the probabilistic ranking computed using a single iteration. Does the ranking improve with the number of iterations? Comment.

5. For the queries 1 to 5, do as follows. (a) After the first iteration in the computation of a probabilistic ranking, consider that a perfect user selects (from the ranking computed) which documents are relevant. This can be simulated by looking at the set of relevant documents for each query. Repeat the computation of the probabilistic ranking considering the presence of this perfect user. (b) Compare this probabilistic ranking with the probabilistic ranking of problem 4.

6. One important distinction in practice between the probabilistic and the vectorial rankings is that the first one does not include any information about tf-idf weights. Since there is plenty of empirical evidence indicating the usefulness of tf-idf weights, their inclusion in the ranking is important. (a) Modify the probabilistic ranking (through an extension to the model) to include information on tf-idf weights. Justify your extension.

7. For the queries 1 to 5, do as follows. (a) Compute the ranking generated by the modified Probabilistic model you defined in problem 6 (the one using tf-idf weights). (b) Compare with the ranking generated by the classic Probabilistic model (in problem 4) and comment.

8. One complaint with regard to the probabilistic model is that it is based solely on probabilities that have an epistemological nature. For instance, not even a sample space is clearly defined. Using a metaphor of your choice, device a probabilistic model based on probabilisties that have a frequentist nature. Demonstrate how a ranking could be computed. Hint: A suggestion is to consider a model based on sets and subsets. For instance, terms could be balls, while documents and queries are urns (which contain the balls).

9. For the queries 1 to 5, do as follows. (a) Compute the ranking generated by the frequentist-based Probabilistic model you defined in problem 8 (b) Compare with the answer sets generated by the classic Probabilistic model (in problem 4) and comment.

10. One frequent argument nowadays is that no single ranking strategy can provide a very good ranking by itself. Thus, one alternative that seems promissing is to combine two or more rankings into a final ranking. To illustrate, there are Web search engines that use such type of combined ranking. Device a strategy of your liking to combine the rankings generated by the classic Vector and Probabilistic models. Comment on your framework.

11. For the queries 1 to 5, do as follows. (a) Compute the ranking generated by the combination model you defined in problem 10. (b) Compare with the rankings generated by both the Vector and the Probabilistic models (the rankings computed in problems 2 and 4). (c) Is it the case that the combination ranking is always better? Why?