Selasa, 10 Juli 2018

T8 : International Journal 2

Judul : Penerapan Aplikasi Pencarian Data Persediaan Obat Berbasis Website Menggunakan Metode K-Means

WEBQUAL: A MEASURE OF WEBSITE QUALITY1

ABSTRACT 
The paper presents WebQual™, a Web site quality measure with 12 dimensions. Development was based on an extensive literature review, and interviews with Web designers and Web visitors. The instrument was refined using two successive samples, and the validity of the final instrument was tested with a third confirmatory sample. 

INTRODUCTION 
There does not exist a comprehensive instrument specifically designed to focus on the consumer’s perception of Web site quality in the context of predicting the behavior of reuse of the site. This paper seeks to address that gap, utilizing as a general underlying model the Theory of Reasoned Action (TRA) (Ajzen et al. 1980; Fishbein et al. 1975), and particularly the TRA as applied to information technology utilization in the Technology Acceptance Model (TAM) (Davis et al., 1989). 

The Theory of Reasoned Action (Ajzen et al. 1980; Fishbein et al. 1975) states that individuals' behavior (in this case revisiting or purchasing from a Web site) can be predicted from their intentions, which can be predicted from their attitudes about the behavior and subjective norms. Following the chain of prediction further back, attitudes can be predicted from an individual's beliefs about the consequences of the behavior. Subjective norms can be predicted by knowing whether significant other individuals think the behavior should or should not be done. TRA is a very general theory, and as such does not specify what specific beliefs would be pertinent in a particular situation. 

Davis (1989) applied TRA to a class of behaviors that can be loosely defined as “using computer technologies,” and produced a Technology Acceptance Model (TAM). Davis argues that for the behavior of “using computer technologies,” two particular beliefs are predominant in predicting behavior: perceived ease of use and perceived usefulness. We also believe that there may be multiple distinct dimensions of ease of use and of usefulness, as well as other categories of beliefs such as “entertainment” that together predict intentions to reuse a Web site. Determining the relevant specific dimensions of WebQual, and developing an effective instrument to measure those is the subject of the rest of the paper. 

There are many frameworks for thinking about measurement validity. Bagozzi (1980) and Bagozzi and Phillips (1982) are used in this paper due to their comprehensive coverage of six key components of validity, which are explained along with the process used to develop WebQual.

INSTRUMENT DEVELOPMENT PROCESS 
The goal was to develop a valid measure of Web site quality that would predict Web site reuse. The overall process included four stages, which together addressed each of Bagozzi’s validity concerns. 

1. We move beyond and within the two constructs of ease of use and usefulness. That is, we examined whether there are other categories of beliefs that also need to be considered, and whether there are there distinct dimensions of “ease of use” and “usefulness” that should be considered separately. 

2. We developed questions for each of the dimensions of WebQual identified in Stage 1. The initial result was an eighty-eight-item instrument that measured 13 distinct beliefs about a Web site. 

3. The instrument was refined by administration to two different samples (N = 511 and N =336). After each administration, the measurement validity of the constructs was analyzed and problem questions pruned, revised, or replaced and redundant dimensions collapsed. This resulted in an instrument with 36 questions measuring 12 dimensions of WebQual. 

4. A confirmatory analysis of the overall measurement validity of the final instrument was conducted using a new sample of 307 subjects. The instrument demonstrated strong measurement validity for the four validity issues that can be empirically assessed.

STAGE 1: DEFINING THE DIMENSIONS OF WEBQUAL 
In order to reveal the pertinent dimensions of Web site quality and establish content validity, a four-pronged effort was employed more or less simultaneously. First, a review of the MIS and marketing literature revealed existing constructs related to quality and customer satisfaction. In parallel with this effort, we conducted three exploratory research projects to ensure the comprehensiveness of the constructs. These included soliciting criteria from Web surfers, interviewing Web designers, and studying a large organization's standards for Web site design. 

Initial dimensions of Web site quality 
Five general categories of Web site quality arose from a literature review and exploratory research: ease of use, usefulness, entertainment, complementary relationship, and customer service. These can be further broken down into 14 distinct dimensions of Web site quality as shown in Table 1.

STAGE 2: DEVELOPING THE ITEMS 
Scale development can be either deductive or inductive (Hinkin 1998). We incorporated both approaches through an extensive literature review (inductive) and an exploratory research (deductive) phases. 

An initial set of 142 candidate items was developed based on 13 constructs arising out of the literature review and our exploratory studies. This list of items was then refined based on an approach used by Davis’ (1989) in his pretest of measures for the TAM. Mindful of the cognitive complexity of handling all 13 constructs (Miller 1956),we opted to reduce the difficulty of this initial screening. Twenty experienced Web users from a large southeastern University (5 graduate and 15 undergraduates students) rated the items on how well they corresponded to the four high-level categories of Web site quality (ease of use, usefulness, entertainment, and complementary relationships). 

A non-statistical cluster analysis (similar to Davis, 1989) was performed by incorporating an item into one of the four high-level constructs if at least fifty percent (10 out of 20) of the subjects ranked the item as one of the top three for these particular constructs. This resulted in an initial WebQual instrument of 88 items covering a possible 13 constructs. 

STAGE 3: REFINEMENT 
In order to prevent item order bias, two random order versions of the initial instrument were created. To preventing inflating reliabilities from artificially high correlations where subject answered adjacent questions using anchoring and adjustment, items pertaining to a similar construct were separated from other items. Items were measured using a seven-point Likert scale. In addition, reverse scored items were included to ensure respondents were alert while completing the survey and to eliminate response bias (Hensley 1998; Spector 1992). 

The instrument was refined by examining its reliability and discriminant validity after each of two distinct administrations. With 88 questions in the initial questionnaire and a rule of thumb for factor analysis of at least five times as many observations as there are variables to be analyzed (Hair et al. 1998), at least 440 subjects were required. Data were collected from 510 undergraduates in round 1.

Subjects were given a context (e.g., “Imagine it is your friend's birthday and you are searching for a good gift—a book.”) and told to explore a designated Web site as if they were considering which book to buy for their friend, and then to complete the questionnaire. Three of each of four different types of Web sites were used (12 in all).The sites were chosen for their quality variability, based on rankings of specific Web sites generated by subjects in the exploratory research phase. In order to control for time of day bias, the time of day the sites were visited was varied.

Item assessment and purification: round 1 
Data analysis and purification consisted of the following three steps. First, the Cronbach's alpha for each measure of the 13 target constructs was calculated. Items that were determined to decrease the reliability (alpha) of a construct’s measure were deleted and the process continued until no item’s removal increased a construct’s overall alpha. The end result was the removal of eleven items. 

As a second means to identify internal consistency problems, those items found to possess low correlations with similar traits (i.e., less than .40) were removed from the instrument. A total of 13 items were deleted during this phase, while one was simply modified in order to clarify its meaning. 

The final step consisted of removing items that appeared to have discriminant validity problems. Items were removed if they correlated more highly with items measuring different constructs than they did with items in their intended construct (Campbell et al. 1959; Goodhue 1998). Under these criteria, eight items were deleted. 

After the deletions, each construct was reviewed to ensure that at least five items per dimension remained. (This permitted us to drop up to two items for each dimension, if we discovered measurement validity problems, and still had at least three items per dimension.) For those dimensions that were underrepresented, additional items closely related to the remaining items were added. Twenty-seven items were added—resulting in an 83-item instrument.

STAGE 4. FINAL ITEM SELECTION AND ASSESSMENT OF MEASUREMENT 
A second round of data collection allowed testing of the measurement validity of the second version of the instrument. Data were collected from 336 undergraduate students. A two-step process was employed to select the subset of the questions to be included in the final version of WebQual.

Discriminant validity: round 2 
Discriminant validity for the second version of the questionnaire was first assessed using exploratory factor analysis (EFA). Five of the 13 constructs (information quality, fit-to-task, interactivity, innovativeness, and business processes) appeared to have some possible discriminant validity problems. All other constructs loaded on separate factors. 

To explicitly test for discriminant validity of the five problematic constructs, we used confirmatory factor analysis (CFA) and chi-squared difference tests. The results revealed that the information quality and fit-to-task constructs should be combined. Thus, the outcome of discriminant validity analysis was to reduce the WebQual measure from 13 to 12 constructs. 

Validity of final instrument 
Once the 12 key concepts of WebQual were identified, the top three items loading on each factor were chosen for the final questionnaire. Since a construct should have at least three items (Cronbach et al. 1955) and we sought a parsimonious instrument (lengthy questionnaires typically have a lower response rate) (Babbie 1998), three items were selected for each construct. This final version of 36 items was used to assess the empirically testable validity concerns (Bagozzi's last four concerns from Table 1).

Confirmatory factor analysis 
A final confirmatory factor analysis was run using a sample of 307 undergraduate students. The results indicate strong support for the overall fit of the model. Both the RMSEA (0.060) and SRMR (0.053) are within conservative limits of acceptable error. Further the RNI (.92) and NNFI (.91) confirm good model fit. 

Internal consistency (reliability) 
The reliability of the final questionnaire (12 constructs with 3 questions each) was calculated using Cronbach's alpha (Cronbach 1951). The alphas of the twelve constructs ranged from .72 to .93, with 10 of the 12 constructs having an alpha greater than .80. Of the two remaining constructs, one had an alpha of .79, and one had an alpha of .72 

Discriminant validity 
As a final check of discriminant validity, we tested all possible pairs of the 12 remaining constructs to see if fit was improved when any pair was collapsed into a single construct. The two most highly correlated constructs are information fit to task and interactivity, correlated at .90. Even though these two are highly correlated, the discriminant analysis confirms that all 12 constructs are separate dimensions of a Web site's quality. 

Convergent validity 
Following the example of the development of SERVQUAL (Parasuraman et al. 1988), subjects were asked one additional question on overall Web site quality—Overall, how would you rate the quality of this Web site? (1 to 7 point scale anchored on “poor” and “excellent”). The total WebQual score (36 items) was computed and then compared to the overall quality of the website question. The two were correlated at .78 (p < .001) indicating convergent validity. 

Nomological validity/predictive validity 
Confidence in the measure increases, if it behaves as expected in relation to other acceptable constructs (Bagozzi 1979; Bagozzi 1980). In the case of WebQual, predictive validity is demonstrated by testing the ability of the instrument to accurately predict a Web visitor’s intention to purchase from or revisit a Web site. The correlations between the composite WebQual measure and intention to purchase and intention to revisit were .56 and .53 respectively (both with p< .0001). This is good confirmation of nomological validity

Adequacy of model fit 
Four recommended fit indices (Vandenberg et al. 2000) indicate quite acceptable fit for the 12 construct, 36 question model (see Table 2). These indices provide consistent and reinforcing indications of the overall adequacy of WebQual.

DISCUSSION 
This research makes two contributions. First, it provides practitioners and researchers with a validated reliable measure of Web site quality. The term “electronic commerce” may soon be redundant since nearly all commerce may be electronic (Porter, 2001). Even with slightly less optimistic growth of electronic commerce, firms will increasingly need a means of assessing the quality of a Web site. Web sites in many cases will fashion the customer’s view of the firm (Watson, 1998) and could have an important impact on performance. Second, this study adds to our understanding of TAM, a widely used MIS instrument, by revealing the components of ease of use and usefulness. Thus, it provides the basis for refining TAM to increase its diagnostic power

LIMITATIONS 
WebQual’s development was based on the responses of undergraduate business students to a selected group of Web sites. While these subjects are typical of a substantial body of Web users, they are not a representative sample of all users. Furthermore, many of the subjects were not ongoing customers of the sites selected for assessment. These important limitations are typical of those facing most instrument developers because such work often needs to start in an environment where many subjects are readily and repeatedly available. Further confirmatory research needs to be done with broad samples of on-going customers of a range of Web sites.

CONCLUSION 
In the age of the Internet and electronic commerce, MIS and marketing need a means of assessing the effectiveness of a Web site. Our efforts have produced, we believe, a valid and reliable instrument for measuring Web site quality. WebQual should be able to support a range of important MIS and marketing studies as researchers attempt to understand what contributes to success in the electronic marketspace. The research presents the beginning of a cumulative research program by the authors, and we hope others, to understand the nature and characteristics of high quality Web sites.

sumber: https://www.researchgate.net/publication/313001568_WEBQUAL_A_measure_of_website_quality_2002_Marketing_Educators

T8 : International Journal 1

Judul : Penerapan Aplikasi Pencarian Data Persediaan Obat Berbasis Website Menggunakan Metode K-Means

Constrained K-means Clustering with Background Knowledge 

Abstract 
Clustering is traditionally viewed as an unsupervised method for data analysis. However, in some cases information about the problem domain is available in addition to the data instances themselves. In this paper, we demonstrate how the popular k-means clustering algorithm can be profitably modified to make use of this information. In experiments with artificial constraints on six data sets, we observe improvements in clustering accuracy. We also apply this method to the real-world problem of automatically detecting road lanes from GPS data and observe dramatic increases in performance.

1. Introduction 
Clustering algorithms are generally used in an unsupervised fashion. They are presented with a set of data instances that must be grouped according to some notion of similarity. The algorithm has access only to the set of features describing each object; it is not given any information (e.g., labels) as to where each of the instances should be placed within the partition. 

However, in real application domains, it is often the case that the experimenter possesses some background knowledge (about the domain or the data set) that could be useful in clustering the data. Traditional clustering algorithms have no way to take advantage of this information even when it does exist. 

We are therefore interested in ways to integrate background information into clustering algorithms. We have previously had success with a modified version of COBWEB (Fisher, 1987) that uses background information about pairs of instances to constrain their cluster placement (Wagstaff & Cardie, 2000). K-means is another popular clustering algorithm that has been used in a variety of application domains, such as image segmentation (Marroquin & Girosi, 1993) and information retrieval (Bellot & El-Beze, 1999). Due to its widespread use, we believe that developing a modified version that can make use of background knowledge can be of significant use to the clustering community. 

The major contributions of the current work are twofold. First, we have developed a k-means variant that can incorporate background knowledge in the form of instance-level constraints, thus demonstrating that this approach is not limited to a single clustering algorithm. In particular, we present our modifications to the k-means algorithm and demonstrate its performance on six data sets. 

Second, while our previous work with COBWEB was restricted to testing with random constraints, we demonstrate the power of this method applied to a significant real-world problem (see Section 6). 

In the next section, we provide some background on the k-means algorithm. Section 3 examines in detail the constraints we propose using and presents our modified k-means algorithm. Next, we describe our evaluation methods in Section 4. We present experimental results in Sections 5 and 6. Finally, Section 7 compares our work to related research and Section 8 summarizes our contributions.

2. K-means Clustering 
K-means clustering (MacQueen, 1967) is a method commonly used to automatically partition a data set into k groups. It proceeds by selecting k initial cluster centers and then iteratively refining them as follows:

1. Each instance di is assigned to its closest cluster center. 

2. Each cluster center Cj is updated to be the mean of its constituent instances. 

The algorithm converges when there is no further change in assignment of instances to clusters. 

In this work, we initialize the clusters using instances chosen at random from the data set. The data sets we used are composed solely of either numeric features or symbolic features. For numeric features, we use a Euclidean distance metric; for symbolic features, we compute the Hamming distance. 

The final issue is how to choose k. For data sets where the optimal value of k is already known (i.e., all of the UCI data sets), we make use of it; for the realworld problem of finding lanes in GPS data, we use a wrapper search to locate the best value of k. More details can be found in Section 6.

3. Constrained K-means Clustering 
We now proceed to a discussion of our modifications to the k-means algorithm. In this work, we focus on background knowledge that can be expressed as a set of instance-level constraints on the clustering process. After a discussion of the kind of constraints we are using, we describe the constrained k-means clustering algorithm. 

3.1 The Constraints 
In the context of partitioning algorithms, instancelevel constraints are a useful way to express a priori knowledge about which instances should or should not be grouped together. Consequently, we consider two types of pairwise constraints: 

      • Must-link constraints specify that two instances have to be in the same cluster. 
      • Cannot-link constraints specify that two instances must not be placed in the same cluster. 

The must-link constraints define a transitive binary relation over the instances. Consequently, when making use of a set of constraints (of both kinds), we take a transitive closure over the constraints.1 The full set of derived constraints is then presented to the clustering algorithm. In general, constraints may be derived from partially labeled data (cf. Section 5) or from background knowledge about the domain or data set (cf. Section 6).
3.2 The Constrained Algorithm 
Table 1 contains the modified k-means algorithm (COP-KMEANS) with our changes in bold. The algorithm takes in a data set (D), a set of must-link constraints (Con=), and a set of cannot-link constraints (Con6=). It returns a partition of the instances in D that satisfies all specified constraints. 

The major modification is that, when updating cluster assignments, we ensure that none of the specified constraints are violated. We attempt to assign each point di to its closest cluster Cj . This will succeed unless a constraint would be violated. If there is another point d= that must be assigned to the same cluster as d, but that is already in some other cluster, or there is another point d6= that cannot be grouped with d but is already in C, then d cannot be placed in C. We continue down the sorted list of clusters until we find one that can legally host d. Constraints are never broken; if a legal cluster cannot be found for d, the empty partition ({}) is returned. An interactive demo of this algorithm can be found at http://www.cs.cornell.edu/home/wkiri/cop-kmeans/.

4. Evaluation Method 
The data sets used for the evaluation include a “correct answer” or label for each data instance. We use the labels in a post-processing step for evaluating performance. 

To calculate agreement between our results and the correct labels, we make use of the Rand index (Rand, 1971). This allows for a measure of agreement between two partitions, P1 and P2, of the same data set D. Each partition is viewed as a collection of n∗ (n−1)/2 pairwise decisions, where n is the size of D. For each pair of points di and dj in D, Pi either assigns them to the same cluster or to different clusters. Let a be the number of decisions where di is in the same cluster as dj in P1 and in P2. Let b be the number of decisions where the two instances are placed in different clusters in both partitions. Total agreement can then be calculated using
We used this measure to calculate accuracy for all of our experiments. We were also interested in testing our hypothesis that constraint information can boost performance even on unconstrained instances. Consequently, we present two sets of numbers: the overall accuracy for the entire data set, and accuracy on a held-out test set (a subset of the data set composed of instances that are not directly or transitively affected by the constraints). This is achieved via 10-fold crossvalidation; we generate constraints on nine folds and evaluate performance on the tenth. This enables a true measurement of improvements in learning, since any improvements on the held-out test set indicate that the algorithm was able to generalize the constraint information to the unconstrained instances as well.

5. Experimental Results Using Artificial Constraints 
In this section, we report on experiments using six well-known data sets in conjunction with artificiallygenerated constraints. Each graph demonstrates the change in accuracy as more constraints are made available to the algorithm. The true value of k is known for these data sets, and we provided it as input to our algorithm. 

The constraints were generated as follows: for each constraint, we randomly picked two instances from the data set and checked their labels (which are available for evaluation purposes but not visible to the clustering algorithm). If they had the same label, we generated a must-link constraint. Otherwise, we generated a cannot-link constraint. We conducted 100 trials on each data set (where a trial is one 10-fold cross validation run) and averaged the results. 

In our previous work with COP-COBWEB, a constrained partitioning variant of COBWEB, we made use of three UCI data sets (soybean, mushroom, and tic-tac-toe) and one “real-world” data set that involved part of speech data (Wagstaff & Cardie, 2000). In this work, we replicated our COP-COBWEB experiments for the purpose of comparison with COP-KMEANS. COBWEB is an incremental algorithm, while k-means is a batch algorithm. Despite their significant algorithmic differences, we found that both algorithms improved almost identically when supplied with the same amount of background information.
Figure 1. COP-KMEANS results on soybean 

The first data set of interest is soybean, which has 47 instances and 35 attributes. Four classes are represented in the data. Without any constraints, the k-means algorithm achieves an accuracy of 87% (see Figure 1). Overall accuracy steadily increases with the incorporation of constraints, reaching 99% after 100 random constraints. 

We applied the Rand index to the set of constraints vs. the true partition. Because the Rand index views a partition as a set of pairwise decisions, this allowed us to calculate how many of those decisions were ’known’ by the set of constraints.2 For this data set, 100 random constraints achieve an average accuracy of 48%. We can therefore see that combining the power of clustering with background information achieves better performance than either in isolation

Held-out accuracy also improves, achieving 98% with 100 constraints. This represents a held-out improvement of 11% over the baseline (no constraints). Similarly, COP-COBWEB starts at 85% accuracy with no constraints, reaching a held-out accuracy of 96% with 100 random constraints.3
Figure 2. COP-KMEANS results on mushroom 

We next turn to the mushroom data set, with 50 instances and 21 attributes.4 It contains two classes. In the absence of constraints, the k-means algorithm achieves an accuracy of 69% (Figure 2). After incorporating 100 random constraints, overall accuracy improves to 96%. In this case, 100 random constraints achieve 73% accuracy before any clustering occurs. Held-out accuracy climbs to 82%, yielding an improvement of 13% over the baseline. COP-COBWEB starts at 67% accuracy with no constraints, with held-out accuracy reaching 83% with 100 constraints. 

The third data set under consideration is the part-ofspeech data set (Cardie, 1993). A subset of the full data set, it contains 50 instances, each described by 28 attributes. There are three classes in this data set. Without constraints, the k-means algorithm achieves an accuracy of 58% (We have omitted the graph for this data set, as COP-KMEANS and COP-COBWEB have very similar performance, just as shown in the previous two figures.). After incorporating 100 random constraints, overall accuracy improves to 87%. Here, 100 random constraints attain 56% accuracy. Held-out accuracy climbs to 70%, yielding an improvement of 12% over the baseline. Likewise, COP COBWEB starts at 56% accuracy with no constraints, reaching a held-out accuracy of 72% with 100 constraints.
Figure 3. COP-KMEANS results on tic-tac-toe 

Finally, we focus on the tic-tac-toe data set.5 There are 100 instances in this data set, each described by 9 attributes. There are two classes in this data set. Without constraints, the k-means algorithm achieves an accuracy of 51% (Figure 3). After incorporating 500 random constraints, overall accuracy is 92%. This set of constraints achieves 80% accuracy in isolation. Held-out accuracy reaches 56%, achieving a 5% increase in accuracy. 

COP-COBWEB behaves somewhat worse on this data set, with held-out performance staying roughly at the 49% mark. We believe that this data set is particularly challenging because the classification of a board as a win or a loss for the X player requires extracting relational information between the attributes — information not contained in our instance-level constraints. 

In contrast to the COP-COBWEB experiments, which made use of data sets with symbolic (categorical) attributes, we also experimented with using COPKMEANS on two UCI data sets with numeric (continuous) attributes. On the iris data set (150 instances, four attributes, three classes), incorporating 400 random constraints yielded a 7% increase in held-out accuracy. Overall accuracy climbed from 84% to 98%, and the set of constraints achieved 66% accuracy. Behavior on the wine data set (178 instances, 13 attributes, three classes) was similar to that of the tictac-toe data set, with only marginal held-out improvement (although overall accuracy, as usual, increased dramatically, from 71% to 94%). The constraint set achieved 68% in isolation. 

What we can conclude from this section is that even randomly-generated constraints can improve clustering accuracy. As one might expect, the improvement obtained depends on the data set under consideration. If the constraints are generalizable to the full data set, improvements can be observed even on unconstrained instances.

6. Experimental Results on GPS Lane Finding 
In all of the above experiments, the constraints we experimented with were randomly generated from the true data labels. To demonstrate the utility of constrained clustering with real domain knowledge, we applied COP-KMEANS to the problem of lane finding in GPS data. In this section, we report on the results of these experiments. More details can be found in Schroedl et al. (2001), which focuses specifically on the problem of map refinement and lane finding. 

As we will show, the unconstrained k-means algorithm performs abysmally compared to COP-KMEANS, which has access to additional domain knowledge about the problem. Section 6.2 describes how we transformed this domain knowledge into a useful set of instance-level constraints. 

6.1 Lane Finding Explained 
Digital road maps currently exist that enable several applications, such as generating personalized driving directions. However, these maps contain only coarse information about the location of a road. By refining maps down to the lane level, we enable a host of more sophisticated applications such as alerting a driver who drifts from the current lane. 

Our approach to this problem is based on the observation that drivers tend to drive within lane boundaries. Over time, lanes should correspond to “densely traveled” regions (in contrast to the lane boundaries, which should be “sparsely traveled”). Consequently, we hypothesized that it would be possible to collect data about the location of cars as they drive along a given road and then cluster that data to automatically determine where the individual lanes are located. 

We collected data approximately once per second from several drivers using GPS receivers affixed to the top of the vehicle being driven. Each data point is represented by two features: its distance along the road segment and its perpendicular offset from the road centerline.6 For evaluation purposes, we asked the drivers to indicate which lane they occupied and any lane changes. This allowed us to label each data point with its correct lane.

6.2 Background Knowledge as Constraints 
For the problem of automatic lane detection, we focused on two domain-specific heuristics for generating constraints: trace contiguity and maximum separation. These represent knowledge about the domain that can be encoded as instance-level constraints. 

Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane. 

Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster. 

To better analyze performance in this domain, we modified the cluster center representation. The usual way to compute the center of a cluster is to average all of its constituent points. There are two significant drawbacks of this representation. First, the center of a lane is a point halfway along its extent, which commonly means that points inside the lane but at the far ends of the road appear to be extremely far from the cluster center. Second, applications that make use of the clustering results need more than a single point to define a lane. 

Consequently, we instead represented each lane cluster with a line segment parallel to the centerline. This more accurately models what we conceptualize as “the center of the lane”, provides a better basis for measuring the distance from a point to its lane cluster center, and provides useful output for other applications. Both the basic k-means algorithm and COP-KMEANS make use of this lane representation (for this problem). 

6.3 Experiment 1: Comparison with K-means 
Table 2 presents accuracy results7 for both algorithms over 20 road segments. The number of data points for each road segment is also indicated. These data sets

are much larger than the UCI data sets, providing a chance to test the algorithms’ scaling abilities. 

In these experiments, the algorithms were required to select the best value for the number of clusters, k. To this end, we used a second measure that trades off cluster cohesiveness against simplicity (i.e., number of clusters).8 Note that this measure differs from the objective function used by k-means and COP-KMEANS while clustering. In the language of Jain and Dubes (1988), the former is a relative criterion, while the latter is an internal criterion. 

In the lane finding domain, the problem of selecting k is particularly challenging due to the large amount of noise in the GPS data. Each algorithm performed 30 randomly-initialized trials with each value of k (from 1 to 5). COP-KMEANS selected the correct value for k for all but one road segment, but k-means never chose the correct value for k (even though it was using the same method for selecting k). As shown in Table 2, COP-KMEANS consistently outperformed the unconstrained k-means algorithm, attaining 100% accuracy for all but three data sets 8More precisely, it calculates the average squared distance from each point to its assigned cluster center and penalizes for the complexity of the solution: Σi dist(di,di.cluster) 2 n ∗k 2 . The goal is to minimize this value and averaging 98.6% overall. The unconstrained version performed much worse, averaging 58.0% accuracy. The clusters the latter algorithm produces often span multiple lanes and never cover the entire road segment lengthwise. Lane clusters have a very specific shape: they are greatly elongated and usually oriented horizontally (with respect to the road centerline). Yet even when the cluster center is a line rather than a point, k-means seeks compact, usually spherical clusters. Consequently, it does a very poor job of locating the true lanes in the data. 

For example, Figure 4 shows the output of the regular k-means algorithm for data set 6.9 The horizontal axis is the distance along the road (in meters) and the vertical axis is the centerline offset. There are four true lanes. The points for each of the four clusters found by k-means are represented by different symbols. Clearly, these lanes do not correspond to the true lanes.

Figure 4. K-means output for data set 6, k=4 

The final column in Table 2 is a measure of how much is known after generating the constraints and before doing any clustering. It shows that an average accuracy of 50.4% can be achieved using the background information alone. What this demonstrates is that neither general similarity information (k-means clustering) nor domain-specific information (constraints) alone perform very well, but that combining the two sources of information effectively (COP-KMEANS) can produce excellent results. 

An analysis of the errors made by COP-KMEANS on the lane-finding data sets showed that each mistake arose for a different reason. For data set 12, the algorithm incorrectly included part of a trace from lane 4 in lane 3. This appears to have been caused by noise in the GPS points in question: they are significantly closer to lane 3 than lane 4. On data set 16, COPKMEANS chose the wrong value for k (it decided on three lanes rather than four). This road segment contains very few traces, which possibly contributed to the difficulty. Since COP-KMEANS made so few errors on this data set, it is not possible to provide a more general characterization of their causes. 

It might be argued that k-means is simply a poor choice of algorithm for this problem. However, the marked improvements we observed with COPKMEANS suggest another advantage of this method: algorithm choice may be of less importance when you have access to constraints based on domain knowledge. For this task, even a poorly-performing algorithm can boost its performance to extremely high levels. In essence, it appears that domain knowledge can make performance less sensitive to which algorithm is chosen. 

6.4 Experiment 2: Comparison with agglom 
Rogers et al. (1999) previously experimented with a clustering approach that viewed lane finding as a onedimensional problem. Their algorithm (agglom) only made use of the centerline offset of each point. They used a hierarchical agglomerative clustering algorithm that terminated when the two closest clusters were more than a given distance apart (which represented the maximum width of a lane). 

This approach is quite effective when there are no lane merges or splits across a segment, i.e., each lane continues horizontally from left to right with no interruptions. For the data sets listed in Table 2, their algorithm obtains an average accuracy of 99.4%.10 

However, all of these data sets were taken from a freeway, where the number of lanes is constant over the entirety of each road segment. In cases where there are lane merges or splits, the one-dimensional approach is inadequate because it cannot represent the extent of a lane along the road segment. We are currently in the process of obtaining data for a larger variety of roads, including segments with lane merges and splits, which we expect will illustrate this difference more clearly. 

7. Related Work 
A lot of work on certain varieties of constrained clustering has been done in the statistical literature (Gordon, 1973; Ferligoj & Batagelj, 1983; Lefkovitch, 1980). In general, this work focuses exclusively on agglomerative clustering algorithms and contiguity constraints (similar to the above trace contiguity constraint). In particular, no accommodation is provided for constraints that dictate the separation of two items. In addition, the contiguity relation is assumed to cover all data items. This contrasts with our approach, which can easily handle partial constraint relations that only cover a subset of the instances. 

In the machine learning literature, Thompson and Langley (1991) performed experiments with providing an initial “priming” concept hierarchy to several incremental unsupervised clustering systems. The algorithms were then free to modify the hierarchy as appropriate. In contrast to these soft constraints, our approach focuses on hard, instance-level constraints. 

Additionally, Talavera and B´ejar incorporated domain knowledge into an agglomerative algorithm, ISAAC (Talavera & Bejar, 1999). It is difficult to classify ISAAC’s constraints as uniformly hard or soft. The final output is a partition (from some level of the hierarchy), but the algorithm decides at which level of the hierarchy each constraint will be satisfied. Consequently, a given constraint may or may not be satisfied by the output. 

It is possible for the k-means algorithm to evolve empty clusters in the course of its iterations. This is undesirable, since it can produce a result with fewer than k clusters. Bradley et al. (2000) developed a method to ensure that this would never happen by imposing a minimum size on each cluster. Effectively, these act as cluster-level constraints. Like our instance-level constraints, they can be used to incorporate domain knowledge about the problem. For example, we know that road lanes must be separated by some minimum distance. However, we have not yet incorporated this type of constraint as an input to the clustering algorithm; rather, we simply discard solutions that contain lanes that are deemed too close together. We are interested in exploring these clusterlevel constraints and integrating them more closely with the clustering algorithm itself.

8. Conclusions and Future Directions 
We have developed a general method for incorporating background knowledge in the form of instancelevel constraints into the k-means clustering algorithm. In experiments with random constraints on six data sets, we have shown significant improvements in accuracy. Interestingly, the results obtained with COPKMEANS are very similar to those obtained with COP-COBWEB. In addition, we have demonstrated how background information can be utilized in a real domain, GPS lane finding, and reported on impressive gains in accuracy. 

We see several avenues for improvements and future work. The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive. If a poor decision is made early on, the algorithm may later encounter an instance di that has no possible valid cluster (e.g., it has a cannot-link constraint to at least one item in each of the k clusters). This occasionally occurred in our experiments (for some of the random data orderings). Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that di could then be validly assigned to a cluster. 

Additionally, we are interested in extending the constrained clustering approach to include hierarchical algorithms. COP-KMEANS and COP-COBWEB both generate a partition of the data and therefore are wellsituated to take advantage of hard, instance-level constraints. A different constraint formulation will be required for hierarchical algorithms. 

Finally, we would like to explore an alternative to the hard constraints presented here. Often domain knowledge is heuristic rather than exact, and it is possible that it would be better expressed by a “soft” constraint

Acknowledgements 
We thank Pat Langley for his assistance and suggestions and for access to the GPS data sets. We would especially like to thank Westley Weimer for advice and suggestions on the work as it progressed. We also thank Marie desJardins for her insightful comments.

sumber: https://web.cse.msu.edu/~cse802/notes/ConstrainedKmeans.pdf

Rabu, 04 Juli 2018

T7: Exercise about Coordinate Connectors

Exercise about Coordinate Connectors
1.          A spacecraft is freed from fiction ___ launched into space.
A.  It
B.  It is
C.  After is
D. After it is
The clause above is adverb clause, “after” is the connector and followed by the subject “it”
2.         ___ with their surroundings, or they hide in crevices for protection.
A.        Lobsters
B.        Lobsters blend
C.        Lobsters blending
D.       Because lobsters blend
The clause above is using coordinate connectors “or” and followed by the
subject “they”
3.         _____ a ball-and-socket joint, the elbow is a simple hinge joint.
A.        While the shoulder
B.        While the shoulder is
C.        The shoulder is
D.       The shoulder
The clause above is using adverb of time, “while” is the connector and it’s followed by the subject “the shoulder”
4.         A car has several sections with moving parts, ___ of those parts is essential.
A.                 Good lubrication
B.                 Well lubricated
C.                 And good lubrication
D.                And well lubricated
The clause above is using coordinate connector “and, and followed by the subject “good lubrication”
5.         Bears cannot see well ___ small eyes.
A.        Bears have
B.        Because having
C.        Because they have
D.       Because of bears
The clause above is adverb clause, “because” is the connector and followed by the subject “they”
6.         ____ at the isthmus of Panama, so animals were able to migrate between North and South America.
A.           A land bridge existed
B.           When a land bridge existed
C.           A land bridge
D.          With a land bridge
The clause above is coordinate connector “at” and followed by the subject “the isthmus of Panama”
7.         ____ mostly made of granite, it also contains some human-made materials.
A.        The Empire State Building
B.        The Empire State Building is
C.        Although the Empire State Building is
D.       Although the Empire State Building is built
The clause above is ……, “although” is connector and followed by the subject “the Empire State Building”
8.        Pressure differences make the eardrum vibrate ____ the ear.
A.        Enters the sound wave
B.        As sound wave
C.        Sound waves enter
D.       As sound waves enter
The clause above is adverb of time, “as” is the connector and followed by the subject “sound waves”
9.         An optical microscope magnifies as much as 2, 000 times, but an electron microscope ___ as much as a million times.
A.           Magnifying
B.           It magnifies
C.           Can magnify
D.          Magnify it
The clause above is coordinate connector “but” and followed by the subject“an electron microscope”
10.     If scientific estimates are accurate, ____ with the Earth about 20, 000 years ago.
A.           The Canon Diablo meteorite collided
B.           The collision of the Canon Diablo meteorite
C.           The Canon Diablo meteorite colliding
D.          Colliding the Canon Diablo meteorite
The clause above is adverb of condition, “if” is connector and followed by subject “scientific estimate are accurate”
11.      Nuclear power can be produced by fusion, ____ produced by fission.
A.           It can also be
B.           It can also
C.           And it can also be
D.          And it can also
The clause above is coordinate connector “and” and followed by the subject “ it”
12.     ____ igneous rocks may be changed into gneisses,
A.           The temperature is high
B.           If the temperature is high
C.           High temperatures
D.          If high temperature
The clause above is adverb of condition, “if” is the connector and followed by the subject “ the temperature”
13.     Because a family of birds set up housekeeping in Joel Chandler Harris’s mailbox when the birds were on need of a place to stay, ___ the Wren’s Nest.
A.           The home is named
B.           So the home is named
C.           Naming the home
D.          The home’s name
The clause above is adverb of condition, “because” is connector and followed by the subject “a family of birds”
14.     Today the true story of ___ at Little Bighorn remains a mystery.
A.           Happened
B.           It happened
C.           What happened
D.          What happening
The clause above is noun clause, “what” is connector and subject, and followed by verb “happened”
15.     For more than a decade, ___ that certain species are becoming scarce.
A.           The warning or bird-watchers
B.           Warn the bird-watchers
C.           Bird-watchers have warned
D.          A warning for bird-watchers
The clause above is noun clause, “that” is connector and followed by the subject “certain species”
16.     Early in the eighteen century, Halley accurately predicted when ___ of the 1682 would return.
A.           The comet
B.           Was the comet
C.           The comet was
D.          Had the comet
The clause above is noun clause, “when” is connector and followed by the subject “the comet of the 1682”
17.      No single factor explains why ___ vary so greatly among individuals.
A.           Aging affects
B.           The effects of aging
C.           Aging has an effect
D.          The aging effect
The clause above is noun clause, “why” is connector, and followed by the subject “the effect of aging”
18.     Lack of clarity about ___ the party in the coming year will be removed at the party’s convention.
A.              Will lead
B.              Lead
C.              They will lead
D.             Who will lead
The clause above is noun clause, “who” is connector and subject, and followed by verb “will lead”
19.     We do not ___ the bow drill was first developed for woodworking or fire making.
A.                 Whether it
B.                 Know whether it
C.                 Know whether
D.                Sure whether
The clause above is noun clause, “whether” is connector, and followed by the subject “it”
20.    Minute Man National Historical Park is a monument to where ____.
A.                 The beginning of the Revolutionary War
B.                 In the beginning of the Revolutionary War
C.                 The Revolutionary War to begin
D.                The Revolutionary War to begin
The clause above is coordinate connector “where” and followed by the the subject “the revolutionary”
21.     Test on the colors of cars were conducted at the University of California to determine ___ the safest colors for cars.
A.                 Which
B.                 Which were
C.                 If
D.                How were
The clause above is adjective clause, “which” is connector and subject,  and followed by verb “were”
22.    The National Institute of Dental Research estimates ___ in fluoridated areas have about 25 percent less tooth decay when children elsewhere.
A.              For school children
B.              School children’s
C.              That school children
D.             That for school children
The clause above is noun clause, “when” is connector and followed by the subject “school chidren’s”
23.    The process of photosynthesis explains how ___ able to use the energy in sunlight to manufacture foods from the simple chemicals in air and water.
A.           Green plants
B.           Green plants are
C.           Planting greens
D.          With green plants are
The clause above is noun clause. “how” is connector and followed by the subject “green plants”
24.    The Moon’s gravity pulls water on the near side of the Earth toward the Moon, and this is what ____ tides to occur.
A.           The cause
B.           Causes
C.           Causing
D.          The cause of
The clause above is coordinate connector “and”, and followed by the subject “this”
25.    It is not clear whether the subdivisions of the neocortex ___ units.
A.        Individual
B.        Are individual
C.        They are individual
D.       Individually
The clause above is noun clause, “whether” is connector and followed by the subject “the subdivision of the cortex”
26.    Modern humans, who first appeared about 600, 000 years ago, ____ Homo sapiens.
A.           callin.
B.           were called
C.           they called
D.          they were called
The clause above is noun clause, “who” is connector and followed by the subject “first”
27.     The first writing ___ evidence of is on Mesopotamian clay tablets.
A.           We
B.           That we
C.           Has
D.          That we have
The clause above is adjective clause. “that” is connector and followed by the subject “we”
28.    ___ drought-resistant plants which store water in fleshy tissue.
A.           Succulents are
B.           Succulents
C.           They are succulents
D.          Succulents which are
The clause above is noun clause “which” is connector and followed by the subject “store water”
29.    Benjamin Kablesky, whom ___ as Jack Benny, was a famous comedian in vaudeville and on radio and television.
A.           Most people’s knowledge
B.           Most people know
C.           Knowing most people
D.          The knowledge of most people
The clause above is adjective clause, “who” is connector and followed by the subject “most people”
30.    ___ the hunted other animals tended to have very narrow, sharp, curved claws.
A.        For dinosaurs
B.        Dinosaurs are known
C.        Dinosaurs
D.       Like dinosaurs
The clause above is noun clause, “that” is connector and subject, and followed by verb “hunted”
31.     The first eyeglasses had convex lenses for the aged who ___ farsighted.
A.        Had become
B.        They had become
C.        Becoming
D.       It became
The clause above is adjective clause, “who is connector and subject, and followed by verb “had”
32.    Chimney Rock, ___ 500 feet above the North Platte River, has eroded considerably in the last two centuries.
A.        Stands
B.        Is standing
C.        It stands
D.       Which stands
The clause above is adjective clause, “which” is connector and suject, and followed by verb “stands”
33.    ____ that accompany recurring bouts of severe depression reduce bone density.
A.        It changes hormones
B.        Hormonal changes
C.        The hormones change
D.       The changes in hormones is
The clause above is adjective clause, “that” is is connector and subject, and followed by verb “accompany”
34.    Willa Cather is an author ___ for her evocative and memorable vision of frontier prairie life.
A.                 Whom readers
B.                 The praise of readers
C.                 Whom praising
D.                Whom readers praise
The clause above is adjective clause, “ whom” is connector and followed by the subject “readers”
35.    Mars’s tiny moon Phobos is a small mountain of rock that ___ from the asteroid belt by Mars’s gravitational pull.
A.              Was probably captured
B.              It probably
C.              The probable capture
D.             Probably the capture
The clause above is adjective clause, “that” is connector and subject, and followed by verb “was”
36.    Some scientists think ___ be a planet but a moon of Neptune.
A.              That Pluto does not seem
B.              Not Pluto
C.              Pluto that might not
D.             That Pluto might not
The clause above is adjective clause, “that” is connector and followed by the subject “Pluto”
37.     Fort Union was the site of what ___ principal fur-trading post on the upper Missouri River.
A.              The
B.              Being the
C.              Was the
D.             It was the
The clause above is noun clause, “what” is connector and subject, and followed by verb “was”
38.    Since ___ commercial risk, it has to appeal to a large audience to justify its cost.
A.              The face of the movie
B.              Moving faces
C.              A movie faces
D.             To face a movie
The clause above is adverb clause, “since” is connector and followed by the subject “a movie”