Predictive modeling of pharmaceutical processes with missing and noisy data

2009 AIChE Annual Meeting in Nashville

November 12, 2009

F. Boukouvala F.J. Muzzio M.G. Ierapetritou

Current trends in pharmaceutical product development focus on the fundamental understanding and characterization of all process unit operations, which will lead to the development of predictive models in order to minimize the variability in product performance. The pharmaceutical industry is one of the most characteristic examples where high quality products- with strict predefined performance specifications - must be produced. The successful completion of this task is obstructed by the fact that mechanical and physiochemical properties of pharmaceutical APIs and excipients used are not completely understood [1]. Other industrial fields (food, ceramics, catalyst manufacturing etc)- that deal with processing of materials in powder form- are faced with the same challenging issue simply because powders share both characteristics of fluids and solids. It is very difficult to model the behavior of granular materials using first principles as they undergo different key processes, such as feeding, mixing and tablet pressing, without making a series of assumptions and introducing a relative amount of uncertainty. Data driven modeling methods on the other hand, connect the system state variables (input, internal and output variables) and use only a limited knowledge of the physical behavior of the system and the available experimental data. Data analysis techniques however are required to ?prepare? the data sets used for modeling in order to minimize uncertainty. Data analysis poses a challenging set of problems such as: high-dimensionality, computationally expensive objectives and incomplete or missing information. Until now, several models have been proposed for individual process operations. Methods such as Monte Carlo simulations, particle-dynamic simulations, heuristic models, and kinetic theory models have been proposed to characterize powder flow during the process of mixing using continuous blenders. Based on the continuum approximation, Sudah et al. developed a dispersion model to describe flow of powders in continuous blenders [2]. Berthiaux et al. proposed a model of continuous mixing based on the theory of Markov chains [3], while Ghaderi investigated axial mixing in continuous mixers and proposed a model using the Danckwerts- Weinekotter formula [4]. A number of papers have employed the discrete-element method (DEM) to model the behavior of granular flow in continuous mixers. The drawback to these simulations, however, is their computational efficiency. In the work of Portillo et al. it is observed that as the number of particles in the system increase, the computational intensity of a DEM simulation becomes prohibitive [5]. Work has also been done to improve the design of feeders ([6],[7]), however, limited attention has been directed at the analytical study of feeder performance in terms of the input process variables and operating conditions. Unlike all the above examples of first principle based models, data driven methods were used by Jia et al. to develop a less computationally expensive model based on experimental data alone [8]. Specifically, Kriging and Response Surface methodologies were used to predict the output flow variability of a loss-in-weight feeder and the residence time distribution of a continuous mixer. It is however known that the uniformity of the output flow of a feeding unit affects the behavior and output of the blender and consequently the properties of the final product (tablet). No work has yet been done in terms of designing an integrated model that will relate both properties of the material processed and the operating parameters to the properties of the final product. Design space in pharmaceutical development is defined as collection of all multidimensional combinations of input variables, which characterize material attributes (composition, cohesion, size distribution etc) and processing parameters (flow rate, mixer fill level, agitator speed, blade pattern, residence time etc). Thus, large multivariate data matrices are our resources for creating models, which will capture the effects of multiple variables to the final desired output- as well as interactions and correlations between variables. Working towards combining all the collected data, in order to produce an integrated model, we are often faced with the problem of having to work with incomplete data sets. Missing information may often arise from experiments due to insufficient sampling, errors during data acquisition or even when data fusion techniques are required in order to combine information from different sources. By simply discarding the missing observations, valuable information for subsequent analysis of the data can be lost. An exploratory analysis of the popular imputation algorithms that exist in literature is done by Huang et. al in order to compare their performances and suggest guidelines for the selection of the appropriate algorithm for a specific application [9]. In this work, different imputation techniques such as mean imputation, k-nearest neighbor imputation and multiple imputation are incorporated to the predictive model algorithm of pharmaceutical unit operations. Experimental data is obtained for different unit processes of tablet manufacturing (feeding, mixing and tablet press) and used for a thorough assessment of the efficiency and effect of the different methodologies on the average error of prediction of output variables. It is observed that different methods can be appropriate for different percentages of missing information. A more sophisticated model based method is formulated based on Maximum Likelihood Estimation (Expectation- Maximization algorithm - EM) to estimate parameters of the incomplete data sets and thus ?fill out? the missing values based on the position they would ?most likely? be found. Principal Component Analysis is combined with the EM algorithm, in order to produce a hybrid method of imputing missing values. This approach was first introduced by Andrews and Wentzell who demonstrated that principal components extracted with up to 10% of missing information were comparable to the projections of the complete data [10]. This iterative procedure is proposed because it can overcome the following barriers: ·Reduction of dimensionality by sorting the variables based on the correlations (covariance) between them, ·Minimization of noise, ·Imputation of missing values. Cross-validation and bootstrap techniques are used for validation of model selection and further accuracy estimation [11]. Data- driven predictive models have already shown very promising results in characterizing the behavior of separate unit operations of pharmaceutical product development [8]. Our ultimate goal is to gradually increase the complexity of the predictive model by allowing it to have as many input and output variables as necessary, in order to design the optimal set of process parameters that will result in the optimal performance given a set of material properties. Such a model will ideally ensure product quality and manufacturing cost reduction without having computationally restricting requirements. However, the key to successful modeling using such techniques is the right preparation of the data, which includes recognition of outliers and imputation of missing values. The proposed algorithm will be able to identify the appropriate imputation method that will minimize the uncertainty of missing and unreliable information- depending on the sequence and amount of missingness. Incorporating these methods into interpolatory regression based models- such as Kriging and Response Surface methods- will be a very important step towards the completion of the desired integrated model. References: 1. Portillo, P., et al., Quality by Design Methodology for Development and Scale-up of Batch Mixing Processes. Journal of Pharmaceutical Innovation, 2008. 3(4): p. 258-270. 2. Sudah, O.S., et al., Quantitative characterization of mixing processes in rotary calciners. Powder Technology, 2002. 126(2): p. 166-173. 3. Berthiaux, H., Marikh, K., Mizonov, V., Ponomarev, D., Barantzeva, Modeling continuous powder mixing by means of the theory of Markov chains. Particulate Science and Technology 2004. 22: p. 379-389. 4. Ghaderi, A., On characterization of continuous mixing of particulate materials. Particulate Science and Technology, 2003. 21: p. 271-282. 5. Patricia M. Portillo, F.J.M., Marianthi G. Ierapetritou,, Hybrid DEM-compartment modeling approach for granular mixing. AIChE Journal, 2007. 53(1): p. 119-128. 6. Gundogdu, M.Y., Design improvements on rotary valve particle feeders used for obtaining suspended airflows. Powder Technology, 2004. 139(1): p. 76-80. 7. Reist, P.C. and L. Taylor, Development and operation of an improved turntable dust feeder. Powder Technology, 2000. 107(1-2): p. 36-42. 8. Jia, Z., Davis,E., Muzzio, F., Ierapetritou, M.,, Predictive Modeling for Mixing and Feeding Processes using Kriging and Response Surface. In Press. 9. Huang, S.H., R. Kothamasu, and N. Rapur, Iterative imputation algorithms for process modeling with incomplete data. Intelligent Data Analysis, 2007. 11(2): p. 189-202. 10. Andrews, D.T. and P.D. Wentzell, Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer. Analytica Chimica Acta, 1997. 350(3): p. 341-352. 11. Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. in IJCAI. 1995.