• Alignment method with application to gas chromatography / mass spectrometry screening

      Hitchcock, Jonathan James; Li, Dayou; Maple, Carsten; Keech, Malcolm (IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2012-09-08)
      The paper presents a new spectrum-based alignment method that is able to precisely adjust the retention times of Gas Chromatography / Mass Spectrometry data so that corresponding scans of two samples can be chosen for comparison purposes. It includes the ability to do precise alignment within fractions of a scan; this is equivalent to doing sub-pixel registration of images.
    • Applications of concurrent access patterns in web usage mining

      Lu, Jing; Keech, Malcolm; Wang, Cuiqing; University of Bedfordshire (Springer, 2013-08)
      This paper builds on the original data mining and modelling research which has proposed the discovery of novel structural relation patterns, applying the approach in web usage mining. The focus of attention here is on concurrent access patterns (CAP), where an overarching framework illuminates the methodology for web access patterns post-processing. Data pre-processing, pattern discovery and patterns analysis all proceed in association with access patterns mining, CAP mining and CAP modelling. Pruning and selection of access patterns takes place as necessary, allowing further CAP mining and modelling to be pursued in the search for the most interesting concurrent access patterns. It is shown that higher level CAPs can be modelled in a way which brings greater structure to bear on the process of knowledge discovery. Experiments with real-world datasets highlight the applicability of the approach in web navigation.
    • Applications of concurrent sequential patterns in protein data mining

      Wang, Cuiqing; Keech, Malcolm; Lu, Jing; University of Bedfordshire (Springer, 2014)
      Protein sequences of the same family typically share common patterns which imply their structural function and biological relationship. Traditional sequential patterns mining has its focus on mining frequently occurring sub-sequences. However, a number of applications motivate the search for more structured patterns, such as protein motif mining. This paper builds on the original idea of structural relation patterns and applies the Concurrent Sequential Patterns (ConSP) mining approach in bioinformatics. Specifically, a new method and algorithms are presented using support vectors as the data structure for the extraction of novel patterns in protein sequences. Experiments with real-world protein datasets highlight the applicability of the ConSP methodology in protein data mining. The results show the potential for knowledge discovery in the field of protein structure identification.
    • Concurrent sequential patterns mining and frequent partial orders modelling

      Lu, Jing; Keech, Malcolm; Chen, Weiru; Wang, Cuiqing; University of Bedfordshire (Inderscience Publishers, 2013)
      Structural relation patterns have been introduced to extend the search for complex patterns often hidden behind large sequences of data, with applications (e.g.) in the analysis of customer behaviour, bioinformatics and web mining. In the overall context of frequent itemset mining, the focus of attention in the structural relation patterns family has been on the mining of concurrent sequential patterns, where a companion approach to graph-based modelling can be illuminating. The crux of this paper sets out to establish the connection between concurrent sequential patterns and frequent partial orders, which are well known for discovering ordering information from sequence databases. It is shown that frequent partial orders can be derived from concurrent sequential patterns, under certain conditions, and worked examples highlight the relationship. Experiments with real and synthetic datasets contrast the results of the data mining and modelling involved.
    • Discovering exclusive patterns in frequent sequences

      Chen, Weiru; Lu, Jing; Keech, Malcolm (Inderscience Publishers, 2010)
    • GC/MS data reduction using retention time alignment and spectral subtraction

      Hitchcock, Jonathan James; Li, Dayou; Maple, Carsten; Keech, Malcolm; Teale, Phil; Hudson, Simon; University of Bedfordshire; HFL Ltd (British Mass Spectrometry Society, 2007-09-10)
      The identification of chemical compounds in a complex mixture is a challenge. In the context of drug surveillance for the sporting world, large-scale screening of urine and blood samples is undertaken using methods such as GC/MS with low-resolution mass spectrometers that measure integer values of m/z ratio. The analysis of the GC/MS data can be automated using standard mass spectrometry software to detect peaks in the chromatograms and to search a library of mass spectra of known drugs. Because of noise and the presence of co-eluting compounds, the mass spectra are usually not exactly the same as those in the library. The match quality for a genuine match can be quite low, and the library search settings must be sufficiently sensitive so as not to miss positive samples. Therefore many false matches are reported for checking and validation by human analysts, and, since almost all the samples are negative, this process of checking is tedious, time-consuming and cost-inefficient. The usual technique to remove unwanted background is to subtract the spectrum of an adjacent scan of the same sample. Our proposed method instead subtracts the spectrum of a second similar sample. The intention is that any contributions from a substance common to the two samples will be eliminated, and that any substance that is in the first sample but not in the second will still be recorded in the subtracted dataset. Assuming a suitable second sample is available that does not contain banned substances, those that are present in the first sample can be more easily detected. The subtraction is applied to each scan of the test sample. For this to work, it is essential that retention times are precisely aligned so that a corresponding scan of the second sample can be chosen. Although many methods of alignment are described in the literature, simple linear alignment based on a correlation measure is found to be sufficient. It is also necessary to scale the spectra being subtracted to allow for differences between the two samples in the concentration of the common compounds. Subtracting a similar dataset will reduce the number of peaks to be considered, and our hypothesis is that a library search of the resulting dataset will produce a smaller number of false matches than the same library search applied to the original data. An experiment was carried out to test this and the number of false matches was indeed found to be reduced. The more similar the second sample was to the first, the better was the result. It was also verified that true matches of compounds of interest are still reported by the library search of the subtracted data.
    • A novel hybrid tabu search approach to container loading

      Liu, Jiamin; Yue, Yong; Dong, Zongran; Maple, Carsten; Keech, Malcolm (Elsevier, 2011-04)
      The container loading problem, which is significant for a number of industrial sectors, aims to obtain a high space utilisation in the container while satisfying practical constraints. This paper presents a novel hybrid tabu search approach to the container loading problem. A loading heuristic is devised to incorporate heuristic strategies with a handling method for remaining spaces to generate optimal loading arrangements of boxes with stability considered. The tabu search technique, which covers the encoding, evaluation criteria and configuration of neighbourhood and candidate solutions, is used to improve the performance of the loading heuristic. Experimental results with benchmark data show that the hybrid approach provides a better space utilisation than the published approaches under the condition of all loaded boxes with one hundred percent support from below. Moreover, it is shown that the hybrid tabu search can solve problems with the constraints of weight limit and weight distribution with real world data.
    • Protein data modelling for concurrent sequential patterns

      Lu, Jing; Keech, Malcolm; Wang, Cuiqing; University of Bedfordshire (DEXA, 2014-09)
      Protein sequences from the same family typically share common patterns which imply their structural function and biological relationship. The challenge of identifying protein motifs is often addressed through mining frequent itemsets and sequential patterns, where post-processing is a useful technique. Earlier work has shown that Concurrent Sequential Patterns mining can be applied in bioinformatics, e.g. to detect frequently occurring concurrent protein sub-sequences. This paper presents a companion approach to data modelling and visualisation, applying it to real-world protein datasets from the PROSITE and NCBI databases. The results show the potential for graph-based modelling in representing the integration of higher level patterns common to all or nearly all of the protein sequences.