ࡱ> 241bjbjUU >$??ZZ 6 SUUUUUU$PUUjSS^1?0ttt$UUtZ c:OTIMIZAO DA EXECUO DE WORKFLOWS INTENSIVOS DE DADOS EM FRAMEWORKS MAPREDUCE Resumo: Na cincia a anlise de grandes volumes de dados modelada como experimento cientfico, envolvendo algumas questes como o armazenamento dos dados e formatos dos mesmos, encadeamento dos programas e definio do ambiente de execuo usados durante as simulaes. Cientistas tm usado workflows cientficos para exprimir e modelar computacionalmente anlises e experimentos sobre dados. Devido complexidade de processamento dos workflows e tambm o volume de dados envolvido, estes tem sido executados em ambientes distribudos, em conjunto com modelos de programao paralela do workflow. O modelo MapReduce (MR) tem sido muito utilizado na especificao de experimentos cientficos, em especial, aqueles que analisam um grande volume de dados. A partir do MR foram criados frameworks, como Hadoop e Spark, que permitem a manipulao e anlise dos dados de forma distribuda, alm de realizarem o gerenciamento da execuo dos experimentos em ambientes distribudos. No entanto, a execuo de workflows intensivos de dados em ambientes distribudos gerenciados por frameworks MR ainda apresenta desafios em aberto. Embora exista uma certa facilidade na instalao desses frameworks, h muitos parmetros a serem configurados para execuo de um workflow. Alm disso, para explorar o paralelismo oferecido pelo ambiente necessrio o particionamento dos dados de entrada. Existem diversas estratgias de particionamento de dados e aspectos como: conhecimento do critrio de particionamento por parte da aplicao, tamanho das parties e o balanceamento de carga interferem no desempenho do workflow. Com isso, para executar um workflow MR de forma eficiente, o cientista deve ajustar diversos parmetros de configurao dos frameworks e do particionamento dos dados de entrada. As correlaes que existem entre estes parmetros, o workflow e o ambiente de execuo tornam o ajuste da configurao de tais parmetros uma tarefa complexa e difcil para o cientista. Nesta tese, proposta uma abordagem que pode ser aplicada no ajuste da configurao dos parmetros de execuo de workflows MR em ambientes distribudos. A abordagem baseada em (i) coletar o tempo de execuo do workflow utilizando diversos valores na configurao dos parmetros, (ii) aplicar tcnicas de aprendizado de mquina afim de encontrar os valores e parmetros que executam o workflow de forma eficiente e (iii) utilizar as mesmas tcnicas para gerar o modelo preditivo para conhecer previamente o desempenho de uma configurao de parmetros em execues posteriores do workflow MR. Os experimentos apresentados nesta tese mostraram que a abordagem proposta para configurao de parmetros conduz a um desempenho eficiente do workflow MR em um ambiente distribudo. Palavras-Chave: Workflows Cientficos; Configurao de Parmetros; Aprendizado de Mquina. OPTIMIZING THE EXECUTION OF DATA INTENSIVE WORKFLOWS IN MAPREDUCE FRAMEWORKS Abstract: In science, an analysis of large volumes of data is modeled as a scientific experiment, involving some issues such as data storage and formatting, program chaining and the definition of execution environment during simulations. Scientists have used scientific workflows to express and model computations and experiments on data. Due to complexity of the workflows and also the volume of data involved, these have been executed on distributed environments, through workflow parallel programming models. The MapReduce (MR) model has been widely used in the specification of scientific experiments, especially those that analyze a large volume of data. From the MR, frameworks such as Hadoop and Spark were created, which allow the manipulation and analysis of the data in a distributed way, as well as managing the execution of the experiments on distributed environments. However, the execution of intensive data workflows on distributed environments managed by MR frameworks still presents open challenges. Although it is not a complex task to install these frameworks, there are many parameters to be configured to execute a workflow. In addition, to exploit the parallelism offered by the environment it is necessary to partition the input data. There are several data partitioning strategies and aspects such as: knowledge of the partitioning criterion by the application, partition size and load balancing impact the workflow performance. Thus, in order to execute an MR workflow efficiently, the scientist must tune several configuration parameters related to the framework and data partitioning. The correlations between these parameters, workflow, and the execution environment make the configuration of such parameters a complex and difficult task for the scientist. In this thesis, an approach is proposed that can be applied in tuning the execution parameters configuration of workflows MR in distributed environments. The approach is based on (i) collecting the workflow execution time using several values in the parameters configuration, (ii) applying machine learning techniques in order to find the values and parameters that execute the workflow efficiently and (iii) use the same techniques to generate the predictive model to previously know the performance of a parameter configuration in later executions of workflow MR. The experiments presented in this thesis showed that the proposed approach to parameter setting leads to efficient performance of MR workflow in a distributed environment. Keywords: Scientific Workflows; Parameter Tuning; Machine Learning. OPQX&rstvŞŔkO9+ht/B*CJOJQJ^JaJmH phsH 7h2ht/5B*CJOJQJ\^JaJmH phsH )h2ht/B*OJQJ^JmH phsH &h2ht/5OJQJ\^JmH sH ht/OJQJ^J)h2ht/B*CJOJQJ^JaJph#ht/B*CJOJQJ^JaJphht/CJOJQJ^JaJ&h2ht/5CJOJQJ\^JaJ/h2ht/5B*CJOJQJ\^JaJphPQstuvwxyz{|}~gd2$a$$a$gd2gd2$a$gd2$a$gd2м#ht/B*OJQJ^JmH phsH )h2ht/B*OJQJ^JmH phsH  h2ht/OJQJ^JmH sH &h2ht/5OJQJ\^JmH sH +ht/B*CJOJQJ^JaJmH phsH 1h2ht/B*CJOJQJ^JaJmH phsH  <P1h:pt/. A!n"n#n$n% Dp^ 666666666vvvvvvvvv66666686666666666666666666666666666666666666666666666666hH6666666666666666666666666666666666666666666666666666666666666666662 0@P`p2( 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p 0@P`p8XV~_HmHnHsHtHZ`Z Normal*$1$,CJKHOJQJ^J_HaJmHnHsHtHDA D 0Default Paragraph FontRiR 0 Table Normal4 l4a (k ( 0No List JJ 0Ttulo1 $xCJOJQJ^JaJ8B8 0 Body Text d T/T M0Body Text Char CJKHOJQJ^JaJnHtH$/"$ 0List<"2< 0Caption  $xx6],B, 0ndice $TORT 0Texto prformatadoCJOJQJ^JaJPK![Content_Types].xmlj0Eжr(΢Iw},-j4 wP-t#bΙ{UTU^hd}㨫)*1P' ^W0)T9<l#$yi};~@(Hu* Dנz/0ǰ $ X3aZ,D0j~3߶b~i>3\`?/[G\!-Rk.sԻ..a濭?PK!֧6 _rels/.relsj0 }Q%v/C/}(h"O = C?hv=Ʌ%[xp{۵_Pѣ<1H0ORBdJE4b$q_6LR7`0̞O,En7Lib/SeеPK!kytheme/theme/themeManager.xml M @}w7c(EbˮCAǠҟ7՛K Y, e.|,H,lxɴIsQ}#Ր ֵ+!,^$j=GW)E+& 8PK!Ptheme/theme/theme1.xmlYOo6w toc'vuر-MniP@I}úama[إ4:lЯGRX^6؊>$ !)O^rC$y@/yH*񄴽)޵߻UDb`}"qۋJחX^)I`nEp)liV[]1M<OP6r=zgbIguSebORD۫qu gZo~ٺlAplxpT0+[}`jzAV2Fi@qv֬5\|ʜ̭NleXdsjcs7f W+Ն7`g ȘJj|h(KD- dXiJ؇(x$( :;˹! I_TS 1?E??ZBΪmU/?~xY'y5g&΋/ɋ>GMGeD3Vq%'#q$8K)fw9:ĵ x}rxwr:\TZaG*y8IjbRc|XŻǿI u3KGnD1NIBs RuK>V.EL+M2#'fi ~V vl{u8zH *:(W☕ ~JTe\O*tHGHY}KNP*ݾ˦TѼ9/#A7qZ$*c?qUnwN%Oi4 =3ڗP 1Pm \\9Mؓ2aD];Yt\[x]}Wr|]g- eW )6-rCSj id DЇAΜIqbJ#x꺃 6k#ASh&ʌt(Q%p%m&]caSl=X\P1Mh9MVdDAaVB[݈fJíP|8 քAV^f Hn- "d>znNJ ة>b&2vKyϼD:,AGm\nziÙ.uχYC6OMf3or$5NHT[XF64T,ќM0E)`#5XY`פ;%1U٥m;R>QD DcpU'&LE/pm%]8firS4d 7y\`JnίI R3U~7+׸#m qBiDi*L69mY&iHE=(K&N!V.KeLDĕ{D vEꦚdeNƟe(MN9ߜR6&3(a/DUz<{ˊYȳV)9Z[4^n5!J?Q3eBoCM m<.vpIYfZY_p[=al-Y}Nc͙ŋ4vfavl'SA8|*u{-ߟ0%M07%<ҍPK! ѐ'theme/theme/_rels/themeManager.xml.relsM 0wooӺ&݈Э5 6?$Q ,.aic21h:qm@RN;d`o7gK(M&$R(.1r'JЊT8V"AȻHu}|$b{P8g/]QAsم(#L[PK-![Content_Types].xmlPK-!֧6 +_rels/.relsPK-!kytheme/theme/themeManager.xmlPK-!Ptheme/theme/theme1.xmlPK-! ѐ' theme/theme/_rels/themeManager.xml.relsPK] $__DdeLink__1_1335657166O2Z6t/@@@UnknownG*Ax Times New Roman5Symbol3. *Cx ArialI xP!Liberation SerifG& xP!Liberation SansG5 x@Liberation MonoACambria Math" Ƃc΂c m 0 $PZ6!xx PortugusHelioHelioOh+'0|  8 D P\dlt PortugusHelioNormal_WordconvHelio3Microsoft Office Outlook@vA@TZ/@t1m՜.+,0 hp|    Portugus Title  "#$%&'(*+,-./03Root Entry F?e151TabletWordDocument>$SummaryInformation(!DocumentSummaryInformation8)CompObjy  F'Microsoft Office Word 97-2003 Document MSWordDocWord.Document.89q  F#Documento do Microsoft Office Word MSWordDocWord.Document.89q